[NLP] Information Extraction

7 분 소요

Information Extraction

The task of extracting structured information from unstructured documents

Information extraction (IE) systems

Find and understand limited relevant parts of texts
Gather information from many pieces of text
Produce a structured representation of relevant information:
- Relations (in the database sense), a.k.a.,
- A knowledge base

Goals of Information extraction (IE) systems

Organize information so that it is useful to people
Put information in a semantically precise form that allow further inferences to be made by computer algorithms

Example of IE

Classified Advertisements (Real Estate)

Plain text advertisements
Lowest common denominator: only thing that 70+ newspapers using many different publishing systems can all handle

Why doesn’t text search (IR) work?

What you search for in real estate advertisements:

Town/suburb. You might think easy, but:
- Real estate agents: Coldwell Banker, Mosman
- Phrases: Only 45 minutes from Parramatta
- Multiple property ads have different suburbs in one ad
Money: want a range not textual match
- Multiple amounts: was $155K, now $145K
- Variations: offers in the high 700s [but not rents for $270]
Bedrooms: similar issues: br, bdr, beds, B/R 8

Named Entity Recognition (NER)

Named Entity Recognition

Example

Find entities in text

Classify entities in text

Task: Predict entities in a text

Standard Methods for NER

Hand-written regular expressions
- Perhaps stacked
Using classifiers
- Generative: Naive Bayes
- Discriminative: Maxent models
Sequence models
- HMMs
- CMMs/MEMMs
- CRFs

Hand-written regular expressions

Hand-written Patterns for Information

If extracting from automatically generated web pages, simple regex patterns usually work
- Amazon page
- <div class="buying"><h1 class="parseasintitile"><span id="btAsinTitle" style="">(.*?)</span><h1>
For certain restricted, common types of entities in unstructured text, simple regex patterns also usually work
- Finding (US) phone numbers
- (?:(?[0-9]{3})?[ -.])?[0-9]{3}[ -.]?[0-9]{4}

Natural Language Processing-based Hand-Written Information Extraction

For unstructured human-written text, some NLP may help
- Part-of-speech (POS) tagging
  - Mark each word as a noun, verb, preposition, etc.
- Syntatic parsing
  - Identify phrases: NP, VP, PP
- Semantic word categories (e.g. from WordNet)
  - KILL: kill, murder, assassinate, strangle, suffocate

Rule-based Extraction Examples

Determining which person holds what office in what organization
- [person], [office], of [org]
  - Vuk Draskovic, leader of the Serbian Renewal Movement
- [org] (named, appointed, etc.) [person] Prep [office]
  - NATO appointed Wesley Clark as Commander in Chief
Determining where an organization is located
- [org] in [loc]
  - NATO headquarters in Brussels
- [org][loc] (division, branch, headquarters, etc.)
  - KFOR Kosovo headquarters

Using classifiers

Information Extraction as Text Classification

Use conventional classification algorithms to classify substrings of document as “to be extracted” or not.
In some simple but compelling domains, this naive technique is remarkably effective.
- But do think about when it would and wouldn’t work!

“Change of Address” Email

Change-of-Address Detection Results

Corpus of 36 CoA emails and 5720 non-CoA emails
- Results from 2-fold cross validations (train on half, test on other half)
- Very skewed distribution intended to be realistic
- Note very limited training data: only 18 training CoA messages per fold
- 36 CoA messages have 86 email addresses; old, new, and miscellaneous

Sequence models

The ML Sequence Model Approachto NER

Training
1. Collect a set of representative training documents
2. Label each token for its entity class or other (O)
3. Design feature extractors appropriate to the text and classes
4. Train a sequence classifier to predic the labels from the data
Testing
1. Receive a set of testing documents
2. Run sequence model inference to label each token
3. Appropriately output the recognized entities

Encoding Classes for Sequence Labeling

Features for Sequence Labeling

Word
- Current word (essentially like a learned dictionary)
- Previous/next word (context)
Other kinds of inferred linguistic classification
- Part-of-speech tags
- Previous (and perhaps next) label

Inference in Systems

Inference Methods

Greedy inference

We just start at the left, and use our classifier at each position to assign a label
The classifier can depend on previous labeling decisions as well as observed data
Advantages
- Fast, no extra memory requirements
- Very easy to implement
- With rich features including observations to the right, it may perform quite well
Disadvantage
- Greedy. We make commit errors we cannot recover from

Beam inference

At each position keep the top k complete sequences.
Extend each sequence in each local way.
The extensions compete for the k slots at the next position
Advantages
- Fast; beam sizes of 3-5 are almost as good as exact inference in many cases
- Easy to implement (no dynamic programming required)
Disadvantage
- Ineact: the globally best sequence can fall off the beam

Viterbi inference

Dynamic programming or memorization
Requires small window of state influence (e.g., past two states are relevant)
Advantages
- Exact: the global best sequence is returned
Disadvantage
- Harder to implement long-distance state-state interactions (but beam inference tends not to allow long-distance resurrection of sequences anyway)

Neural Approaches for NER

Relation Extraction

Checking if groupings of entities are instances of a relation

Manually engineered rules
- Rules defined over words/entities: "<company>located in<location>"
- Rules defined over parsed text: "((Obj<company>)(Verb located)(*)(Subj<location>))"
Machine Learning-based
- Supervised: Learn relation classifier from examples
- Partially-supervised: bootstrap rules/patterns from “seed” examples

Disease Outbreaks

Protein Interactions

Binary Relation Association as Binary Classification

Resolving Coreference

Extracting Relation Triples

Why Relation Extraction?

Create new structured knowledge bases, useful for many app
Augment current knowledge bases
- Adding words to WodNet thesaurus, facts to FreeBase or DBPedia
Support question answering
- The granddaughter of which actor starred in the movie “E.T.”?
- (acted-in ?x "E.T.")(is-a ?y actor)(granddaughter-of ?x ?y)
But which relations should we extract?

Automated Content Extraction (ACE)

UMLS: Unified Medical Language System

Databases of Wikipedia

** Relation Databases that Draw From Wikipedia

Resource Description Framework (RDF) triples
- subject predicate object
- Golden Gate Park location San Francisco
- dbpedia: Golden_Gate_Park dbpedia-owl: location dbpedia:San_Francisco
DBPedia: 1 billion RDP triples, 385 from English Wikipedia
Frequent Freebase Relations
- People/person/nationality
- location/locations/contains
- People/person/profession
- people/person/place-of-birth
- Biology/organism_higher_classification film/film/genre

Ontological Relations

IS-A (hypernym): subsumption between classes
- Giraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal …
Instance-of: relation between individual and class
- San Francisco instance-of city

How to build relation extractors

Hand-written patterns
Supervised machine learning
Semi-supervised and unsupervised
- Bootstrapping (using seeds)
- Distant supervision
- Unsupervised learning from the web

Rules for Extracting IS-A Relation

Early intuition from Hearst (1992)
- “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
What does Gelidiummean?
How do you know?

Hearst’s Patterns for Extracting IS-A Relations

Extracting Richer Relations Using Rules

Intuition: relations often hold between specific entities
- Located-in (ORGANIZATION, LOCATION)
- Founded (PERSON, ORGANIZATION)
- Cures (DRUG, DISEASE)
Start with Named Entity tags to help extract relations

NER for Extracting Relations?

Named Entities aren’t quite enough
Which relations hold between 2 entities?

Named Entities aren’t quite enough
Which relations hold between 2entities?

Extracting Richer Relations Using Rules and Named Entities

Hand-built Patterns for Relations

Pros
- Human patterns tend to be high-precision
- Can be tailored to specific domains
Cons
- Human patterns are often low-recall
- A lot of work to think of all possible patterns
- Don’t want to have to do this for every relation
- We’d like better accuracy

Supervised Relation Extraction

Choose a set of relations we’d like to extract
Choose a set of relevant named entities
Find and label data
- Choose a representative corpus
- Label the named entities in the corpus
- Hand-label the relations between those entities
- Break into training, development, and test
Train a classifier on the training set

Classification in Supervised Relation Extraction

Find all pairs of named entities (usually in same sentence)
Decide if 2 entities are related
If yes, classify the relation
- Why the extra step?
  - Faster classification training by eliminating most pairs
  - Can use distinct feature-sets appropriate for each task

Relation Extraction

Features for Relation Extraction

Classifier for Supervised Method

Now you can use any classifier you like
- MaxEnt
- Naive Bayes
- SVM
- Neural Network
- BERT
Train it on the training set, tune on the dev set, test on the test set

Example: Neural Relation Extraction

Supervised Relation Extraction

Pros
- Can get high accuracies with enough hand-labeled training data, if test similar enough to training
Cons
- Labeling a large training set is expensive
- Supervised models are brittle, don’t generalize well to different genres

Seed-based Or Bootstrapping Approaches To Relation Extraction

No training set? Maybe you have
- A few seed tuples or
- A few high-precision patterns
Can you use those seeds to do something useful?
Bootstrapping: use the seeds to directly learn to populate a relation

Relation Bootstrapping (Hearst 1992)

Gather a set of seed pairs that have relation R
Iterate:
- Find sentences with these pairs
- Look at the context between or around the pair and generalize the context to create patterns
- Use the patterns for grep for more pairs

Bootstrapping

<Mark Twain, Elmira> Seed tuple
Grep for the environments of the seed tuple
Use those patterns to grep for new tuples
Iterate

Dipre: Extrac (author, book) pairs

Unsupervised Relation Extraction

Open information Extraction
- Extract relations from the web with no training data, no list of relations

Use parsed data to train a “trustworthy tuple” classifier
Single-pass extract all relations between NPs, keep if trustworthy
Assessor ranks relations based on text redundancy

LEE CHANWOO

Information Extraction

Information Extraction

The task of extracting structured information from unstructured documents

Information extraction (IE) systems

Goals of Information extraction (IE) systems

Example of IE

Named Entity Recognition (NER)

Named Entity Recognition

Standard Methods for NER

Hand-written regular expressions

Hand-written Patterns for Information

Natural Language Processing-based Hand-Written Information Extraction

Rule-based Extraction Examples

Using classifiers

Information Extraction as Text Classification

Change-of-Address Detection Results

Sequence models

The ML Sequence Model Approachto NER

Encoding Classes for Sequence Labeling

Features for Sequence Labeling

Inference in Systems

Inference Methods

Greedy inference

Beam inference

Viterbi inference

Neural Approaches for NER

Relation Extraction

Relation Extraction

Disease Outbreaks

Protein Interactions

Binary Relation Association as Binary Classification

Resolving Coreference

Extracting Relation Triples

Why Relation Extraction?

Automated Content Extraction (ACE)

UMLS: Unified Medical Language System

Databases of Wikipedia

Ontological Relations

How to build relation extractors

Rules for Extracting IS-A Relation

Hearst’s Patterns for Extracting IS-A Relations

Extracting Richer Relations Using Rules

NER for Extracting Relations?

Extracting Richer Relations Using Rules and Named Entities

Hand-built Patterns for Relations

Supervised Relation Extraction

Classification in Supervised Relation Extraction

Relation Extraction

Features for Relation Extraction

Classifier for Supervised Method

Supervised Relation Extraction

Seed-based Or Bootstrapping Approaches To Relation Extraction

Relation Bootstrapping (Hearst 1992)

Bootstrapping

Dipre: Extrac (author, book) pairs

Unsupervised Relation Extraction

Other topics in IE

Extracting Times

Extracting Events and their Times

Entitiy Linking

Korean NER with BERT

공유하기

댓글남기기

참고

[Programming] gRPC란? gRPC와 REST의 차이점

[Python] uv : 패키지 관리 도구

[Python] PEP 8 : Style Guide for Python Code

[Python] PEP 20 : The Zen of Python