[NLP] Information Extraction
Information Extraction
Information Extraction
The task of extracting structured information from unstructured documents
Information extraction (IE) systems
- Find and understand limited relevant parts of texts
- Gather information from many pieces of text
- Produce a structured representation of relevant information:
- Relations (in the database sense), a.k.a.,
- A knowledge base
Goals of Information extraction (IE) systems
- Organize information so that it is useful to people
- Put information in a semantically precise form that allow further inferences to be made by computer algorithms
Example of IE
Classified Advertisements (Real Estate)
- Plain text advertisements
- Lowest common denominator: only thing that 70+ newspapers using many different publishing systems can all handle
Why doesn’t text search (IR) work?
What you search for in real estate advertisements:
- Town/suburb. You might think easy, but:
- Real estate agents: Coldwell Banker, Mosman
- Phrases: Only 45 minutes from Parramatta
- Multiple property ads have different suburbs in one ad
- Money: want a range not textual match
- Multiple amounts: was $155K, now $145K
- Variations: offers in the high 700s [but not rents for $270]
- Bedrooms: similar issues: br, bdr, beds, B/R 8
Named Entity Recognition (NER)
Named Entity Recognition
- Example
- Find entities in text
- Classify entities in text
Task: Predict entities in a text
Standard Methods for NER
- Hand-written regular expressions
- Perhaps stacked
- Using classifiers
- Generative: Naive Bayes
- Discriminative: Maxent models
- Sequence models
- HMMs
- CMMs/MEMMs
- CRFs
Hand-written regular expressions
Hand-written Patterns for Information
- If extracting from automatically generated web pages, simple regex patterns usually work
- Amazon page
<div class="buying"><h1 class="parseasintitile"><span id="btAsinTitle" style="">(.*?)</span><h1>
- For certain restricted, common types of entities in unstructured text, simple regex patterns also usually work
- Finding (US) phone numbers
- (?:(?[0-9]{3})?[ -.])?[0-9]{3}[ -.]?[0-9]{4}
Natural Language Processing-based Hand-Written Information Extraction
- For unstructured human-written text, some NLP may help
- Part-of-speech (POS) tagging
- Mark each word as a noun, verb, preposition, etc.
- Syntatic parsing
- Identify phrases: NP, VP, PP
- Semantic word categories (e.g. from WordNet)
- KILL: kill, murder, assassinate, strangle, suffocate
- Part-of-speech (POS) tagging
Rule-based Extraction Examples
- Determining which person holds what office in what organization
- [person], [office], of [org]
- Vuk Draskovic, leader of the Serbian Renewal Movement
- [org] (named, appointed, etc.) [person] Prep [office]
- NATO appointed Wesley Clark as Commander in Chief
- [person], [office], of [org]
- Determining where an organization is located
- [org] in [loc]
- NATO headquarters in Brussels
- [org][loc] (division, branch, headquarters, etc.)
- KFOR Kosovo headquarters
- [org] in [loc]
Using classifiers
Information Extraction as Text Classification
- Use conventional classification algorithms to classify substrings of document as “to be extracted” or not.
- In some simple but compelling domains, this naive technique is remarkably effective.
- But do think about when it would and wouldn’t work!
“Change of Address” Email
Change-of-Address Detection Results
- Corpus of 36 CoA emails and 5720 non-CoA emails
- Results from 2-fold cross validations (train on half, test on other half)
- Very skewed distribution intended to be realistic
- Note very limited training data: only 18 training CoA messages per fold
- 36 CoA messages have 86 email addresses; old, new, and miscellaneous
Sequence models
The ML Sequence Model Approachto NER
- Training
- Collect a set of representative training documents
- Label each token for its entity class or other (O)
- Design feature extractors appropriate to the text and classes
- Train a sequence classifier to predic the labels from the data
- Testing
- Receive a set of testing documents
- Run sequence model inference to label each token
- Appropriately output the recognized entities
Encoding Classes for Sequence Labeling
Features for Sequence Labeling
- Word
- Current word (essentially like a learned dictionary)
- Previous/next word (context)
- Other kinds of inferred linguistic classification
- Part-of-speech tags
- Previous (and perhaps next) label
Inference in Systems
Inference Methods
Greedy inference
- We just start at the left, and use our classifier at each position to assign a label
-
The classifier can depend on previous labeling decisions as well as observed data
- Advantages
- Fast, no extra memory requirements
- Very easy to implement
- With rich features including observations to the right, it may perform quite well
- Disadvantage
- Greedy. We make commit errors we cannot recover from
Beam inference
- At each position keep the top k complete sequences.
- Extend each sequence in each local way.
-
The extensions compete for the k slots at the next position
- Advantages
- Fast; beam sizes of 3-5 are almost as good as exact inference in many cases
- Easy to implement (no dynamic programming required)
- Disadvantage
- Ineact: the globally best sequence can fall off the beam
Viterbi inference
- Dynamic programming or memorization
-
Requires small window of state influence (e.g., past two states are relevant)
- Advantages
- Exact: the global best sequence is returned
- Disadvantage
- Harder to implement long-distance state-state interactions (but beam inference tends not to allow long-distance resurrection of sequences anyway)
Neural Approaches for NER
Relation Extraction
Relation Extraction
Checking if groupings of entities are instances of a relation
- Manually engineered rules
- Rules defined over words/entities:
"<company>located in<location>"
- Rules defined over parsed text:
"((Obj<company>)(Verb located)(*)(Subj<location>))"
- Rules defined over words/entities:
- Machine Learning-based
- Supervised: Learn relation classifier from examples
- Partially-supervised: bootstrap rules/patterns from “seed” examples
Disease Outbreaks
Protein Interactions
Binary Relation Association as Binary Classification
Resolving Coreference
Extracting Relation Triples
Why Relation Extraction?
- Create new structured knowledge bases, useful for many app
- Augment current knowledge bases
- Adding words to WodNet thesaurus, facts to FreeBase or DBPedia
- Support question answering
- The granddaughter of which actor starred in the movie “E.T.”?
(acted-in ?x "E.T.")(is-a ?y actor)(granddaughter-of ?x ?y)
- But which relations should we extract?
Automated Content Extraction (ACE)
UMLS: Unified Medical Language System
Databases of Wikipedia
** Relation Databases that Draw From Wikipedia
- Resource Description Framework (RDF) triples
- subject predicate object
- Golden Gate Park location San Francisco
- dbpedia: Golden_Gate_Park dbpedia-owl: location dbpedia:San_Francisco
- DBPedia: 1 billion RDP triples, 385 from English Wikipedia
- Frequent Freebase Relations
- People/person/nationality
- location/locations/contains
- People/person/profession
- people/person/place-of-birth
- Biology/organism_higher_classification film/film/genre
Ontological Relations
- IS-A (hypernym): subsumption between classes
- Giraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal …
- Instance-of: relation between individual and class
- San Francisco instance-of city
How to build relation extractors
- Hand-written patterns
- Supervised machine learning
- Semi-supervised and unsupervised
- Bootstrapping (using seeds)
- Distant supervision
- Unsupervised learning from the web
Rules for Extracting IS-A Relation
- Early intuition from Hearst (1992)
- “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
- What does Gelidiummean?
- How do you know?
Hearst’s Patterns for Extracting IS-A Relations
Extracting Richer Relations Using Rules
- Intuition: relations often hold between specific entities
- Located-in (ORGANIZATION, LOCATION)
- Founded (PERSON, ORGANIZATION)
- Cures (DRUG, DISEASE)
- Start with Named Entity tags to help extract relations
NER for Extracting Relations?
- Named Entities aren’t quite enough
- Which relations hold between 2 entities?
- Named Entities aren’t quite enough
- Which relations hold between 2entities?
Extracting Richer Relations Using Rules and Named Entities
Hand-built Patterns for Relations
- Pros
- Human patterns tend to be high-precision
- Can be tailored to specific domains
- Cons
- Human patterns are often low-recall
- A lot of work to think of all possible patterns
- Don’t want to have to do this for every relation
- We’d like better accuracy
Supervised Relation Extraction
- Choose a set of relations we’d like to extract
- Choose a set of relevant named entities
- Find and label data
- Choose a representative corpus
- Label the named entities in the corpus
- Hand-label the relations between those entities
- Break into training, development, and test
- Train a classifier on the training set
Classification in Supervised Relation Extraction
- Find all pairs of named entities (usually in same sentence)
- Decide if 2 entities are related
- If yes, classify the relation
- Why the extra step?
- Faster classification training by eliminating most pairs
- Can use distinct feature-sets appropriate for each task
- Why the extra step?
Relation Extraction
Features for Relation Extraction
Classifier for Supervised Method
- Now you can use any classifier you like
- MaxEnt
- Naive Bayes
- SVM
- Neural Network
- BERT
- Train it on the training set, tune on the dev set, test on the test set
Example: Neural Relation Extraction
Supervised Relation Extraction
- Pros
- Can get high accuracies with enough hand-labeled training data, if test similar enough to training
- Cons
- Labeling a large training set is expensive
- Supervised models are brittle, don’t generalize well to different genres
Seed-based Or Bootstrapping Approaches To Relation Extraction
- No training set? Maybe you have
- A few seed tuples or
- A few high-precision patterns
- Can you use those seeds to do something useful?
- Bootstrapping: use the seeds to directly learn to populate a relation
Relation Bootstrapping (Hearst 1992)
- Gather a set of seed pairs that have relation R
- Iterate:
- Find sentences with these pairs
- Look at the context between or around the pair and generalize the context to create patterns
- Use the patterns for grep for more pairs
Bootstrapping
- <Mark Twain, Elmira> Seed tuple
- Grep for the environments of the seed tuple
- Use those patterns to grep for new tuples
- Iterate
Dipre: Extrac (author, book) pairs
Unsupervised Relation Extraction
- Open information Extraction
- Extract relations from the web with no training data, no list of relations
- Use parsed data to train a “trustworthy tuple” classifier
- Single-pass extract all relations between NPs, keep if trustworthy
- Assessor ranks relations based on text redundancy
Other topics in IE
Extracting Times
- Temporal expression extraction
- Temporal normalization
Extracting Events and their Times
Event extraction: identify mentions of events in text – An event mention is any expression denoting an event or state that can be assigned to a particular point, or interval, in time – Events are to be classified as actions, states, reporting events (say, report, tell, explain), perception reporting events, and so on
Entitiy Linking
Taks: Given a database of candidate referents, identify the correct referent for a mention in context
Korean NER with BERT
BERT + CRF (conditional random field)
댓글남기기