Information Extraction

The task of extracting structured information from unstructured documents


Information extraction (IE) systems

  • Find and understand limited relevant parts of texts
  • Gather information from many pieces of text
  • Produce a structured representation of relevant information:
    • Relations (in the database sense), a.k.a.,
    • A knowledge base

Goals of Information extraction (IE) systems

  • Organize information so that it is useful to people
  • Put information in a semantically precise form that allow further inferences to be made by computer algorithms

Example of IE

Classified Advertisements (Real Estate)

  • Plain text advertisements
  • Lowest common denominator: only thing that 70+ newspapers using many different publishing systems can all handle


Why doesn’t text search (IR) work?

What you search for in real estate advertisements:

  • Town/suburb. You might think easy, but:
    • Real estate agents: Coldwell Banker, Mosman
    • Phrases: Only 45 minutes from Parramatta
    • Multiple property ads have different suburbs in one ad
  • Money: want a range not textual match
    • Multiple amounts: was $155K, now $145K
    • Variations: offers in the high 700s [but not rents for $270]
  • Bedrooms: similar issues: br, bdr, beds, B/R 8

Named Entity Recognition (NER)

Named Entity Recognition

  • Example


  • Find entities in text


  • Classify entities in text


Task: Predict entities in a text


Standard Methods for NER

  • Hand-written regular expressions
    • Perhaps stacked
  • Using classifiers
    • Generative: Naive Bayes
    • Discriminative: Maxent models
  • Sequence models
    • HMMs
    • CMMs/MEMMs
    • CRFs

Hand-written regular expressions

Hand-written Patterns for Information

  • If extracting from automatically generated web pages, simple regex patterns usually work
    • Amazon page
    • <div class="buying"><h1 class="parseasintitile"><span id="btAsinTitle" style="">(.*?)</span><h1>
  • For certain restricted, common types of entities in unstructured text, simple regex patterns also usually work
    • Finding (US) phone numbers
    • (?:(?[0-9]{3})?[ -.])?[0-9]{3}[ -.]?[0-9]{4}

Natural Language Processing-based Hand-Written Information Extraction

  • For unstructured human-written text, some NLP may help
    • Part-of-speech (POS) tagging
      • Mark each word as a noun, verb, preposition, etc.
    • Syntatic parsing
      • Identify phrases: NP, VP, PP
    • Semantic word categories (e.g. from WordNet)
      • KILL: kill, murder, assassinate, strangle, suffocate

Rule-based Extraction Examples

  • Determining which person holds what office in what organization
    • [person], [office], of [org]
      • Vuk Draskovic, leader of the Serbian Renewal Movement
    • [org] (named, appointed, etc.) [person] Prep [office]
      • NATO appointed Wesley Clark as Commander in Chief
  • Determining where an organization is located
    • [org] in [loc]
      • NATO headquarters in Brussels
    • [org][loc] (division, branch, headquarters, etc.)
      • KFOR Kosovo headquarters

Using classifiers

Information Extraction as Text Classification

  • Use conventional classification algorithms to classify substrings of document as “to be extracted” or not.
  • In some simple but compelling domains, this naive technique is remarkably effective.
    • But do think about when it would and wouldn’t work!


“Change of Address” Email



Change-of-Address Detection Results

  • Corpus of 36 CoA emails and 5720 non-CoA emails
    • Results from 2-fold cross validations (train on half, test on other half)
    • Very skewed distribution intended to be realistic
    • Note very limited training data: only 18 training CoA messages per fold
    • 36 CoA messages have 86 email addresses; old, new, and miscellaneous


Sequence models

The ML Sequence Model Approachto NER

  • Training
    1. Collect a set of representative training documents
    2. Label each token for its entity class or other (O)
    3. Design feature extractors appropriate to the text and classes
    4. Train a sequence classifier to predic the labels from the data
  • Testing
    1. Receive a set of testing documents
    2. Run sequence model inference to label each token
    3. Appropriately output the recognized entities

Encoding Classes for Sequence Labeling


Features for Sequence Labeling

  • Word
    • Current word (essentially like a learned dictionary)
    • Previous/next word (context)
  • Other kinds of inferred linguistic classification
    • Part-of-speech tags
    • Previous (and perhaps next) label

Inference in Systems


Inference Methods


Greedy inference

  • We just start at the left, and use our classifier at each position to assign a label
  • The classifier can depend on previous labeling decisions as well as observed data

  • Advantages
    • Fast, no extra memory requirements
    • Very easy to implement
    • With rich features including observations to the right, it may perform quite well
  • Disadvantage
    • Greedy. We make commit errors we cannot recover from

Beam inference

  • At each position keep the top k complete sequences.
  • Extend each sequence in each local way.
  • The extensions compete for the k slots at the next position

  • Advantages
    • Fast; beam sizes of 3-5 are almost as good as exact inference in many cases
    • Easy to implement (no dynamic programming required)
  • Disadvantage
    • Ineact: the globally best sequence can fall off the beam

Viterbi inference

  • Dynamic programming or memorization
  • Requires small window of state influence (e.g., past two states are relevant)

  • Advantages
    • Exact: the global best sequence is returned
  • Disadvantage
    • Harder to implement long-distance state-state interactions (but beam inference tends not to allow long-distance resurrection of sequences anyway)

Neural Approaches for NER



Relation Extraction

Relation Extraction

Checking if groupings of entities are instances of a relation

  • Manually engineered rules
    • Rules defined over words/entities: "<company>located in<location>"
    • Rules defined over parsed text: "((Obj<company>)(Verb located)(*)(Subj<location>))"
  • Machine Learning-based
    • Supervised: Learn relation classifier from examples
    • Partially-supervised: bootstrap rules/patterns from “seed” examples

Disease Outbreaks


Protein Interactions


Binary Relation Association as Binary Classification


Resolving Coreference


Extracting Relation Triples


Why Relation Extraction?

  • Create new structured knowledge bases, useful for many app
  • Augment current knowledge bases
    • Adding words to WodNet thesaurus, facts to FreeBase or DBPedia
  • Support question answering
    • The granddaughter of which actor starred in the movie “E.T.”?
    • (acted-in ?x "E.T.")(is-a ?y actor)(granddaughter-of ?x ?y)
  • But which relations should we extract?

Automated Content Extraction (ACE)



UMLS: Unified Medical Language System



Databases of Wikipedia


** Relation Databases that Draw From Wikipedia

  • Resource Description Framework (RDF) triples
    • subject predicate object
    • Golden Gate Park location San Francisco
    • dbpedia: Golden_Gate_Park dbpedia-owl: location dbpedia:San_Francisco
  • DBPedia: 1 billion RDP triples, 385 from English Wikipedia
  • Frequent Freebase Relations
    • People/person/nationality
    • location/locations/contains
    • People/person/profession
    • people/person/place-of-birth
    • Biology/organism_higher_classification film/film/genre

Ontological Relations

  • IS-A (hypernym): subsumption between classes
    • Giraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal …
  • Instance-of: relation between individual and class
    • San Francisco instance-of city

How to build relation extractors

  • Hand-written patterns
  • Supervised machine learning
  • Semi-supervised and unsupervised
    • Bootstrapping (using seeds)
    • Distant supervision
    • Unsupervised learning from the web

Rules for Extracting IS-A Relation

  • Early intuition from Hearst (1992)
    • “Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”
  • What does Gelidiummean?
  • How do you know?

Hearst’s Patterns for Extracting IS-A Relations



Extracting Richer Relations Using Rules

  • Intuition: relations often hold between specific entities
    • Cures (DRUG, DISEASE)
  • Start with Named Entity tags to help extract relations

NER for Extracting Relations?

  • Named Entities aren’t quite enough
  • Which relations hold between 2 entities?


Extracting Richer Relations Using Rules and Named Entities


Hand-built Patterns for Relations

  • Pros
    • Human patterns tend to be high-precision
    • Can be tailored to specific domains
  • Cons
    • Human patterns are often low-recall
    • A lot of work to think of all possible patterns
    • Don’t want to have to do this for every relation
    • We’d like better accuracy

Supervised Relation Extraction

  • Choose a set of relations we’d like to extract
  • Choose a set of relevant named entities
  • Find and label data
    • Choose a representative corpus
    • Label the named entities in the corpus
    • Hand-label the relations between those entities
    • Break into training, development, and test
  • Train a classifier on the training set

Classification in Supervised Relation Extraction

  1. Find all pairs of named entities (usually in same sentence)
  2. Decide if 2 entities are related
  3. If yes, classify the relation
    • Why the extra step?
      • Faster classification training by eliminating most pairs
      • Can use distinct feature-sets appropriate for each task

Relation Extraction


Features for Relation Extraction


Classifier for Supervised Method

  • Now you can use any classifier you like
    • MaxEnt
    • Naive Bayes
    • SVM
    • Neural Network
    • BERT
  • Train it on the training set, tune on the dev set, test on the test set

Example: Neural Relation Extraction


Supervised Relation Extraction

  • Pros
    • Can get high accuracies with enough hand-labeled training data, if test similar enough to training
  • Cons
    • Labeling a large training set is expensive
    • Supervised models are brittle, don’t generalize well to different genres

Seed-based Or Bootstrapping Approaches To Relation Extraction

  • No training set? Maybe you have
    • A few seed tuples or
    • A few high-precision patterns
  • Can you use those seeds to do something useful?
  • Bootstrapping: use the seeds to directly learn to populate a relation

Relation Bootstrapping (Hearst 1992)

  • Gather a set of seed pairs that have relation R
  • Iterate:
    • Find sentences with these pairs
    • Look at the context between or around the pair and generalize the context to create patterns
    • Use the patterns for grep for more pairs


  • <Mark Twain, Elmira> Seed tuple
  • Grep for the environments of the seed tuple image
  • Use those patterns to grep for new tuples
  • Iterate

Dipre: Extrac (author, book) pairs


Unsupervised Relation Extraction

  • Open information Extraction
    • Extract relations from the web with no training data, no list of relations
  1. Use parsed data to train a “trustworthy tuple” classifier
  2. Single-pass extract all relations between NPs, keep if trustworthy
  3. Assessor ranks relations based on text redundancy


Other topics in IE

Extracting Times

  • Temporal expression extraction
  • Temporal normalization


Extracting Events and their Times

Event extraction: identify mentions of events in text – An event mention is any expression denoting an event or state that can be assigned to a particular point, or interval, in time – Events are to be classified as actions, states, reporting events (say, report, tell, explain), perception reporting events, and so on


Entitiy Linking

Taks: Given a database of candidate referents, identify the correct referent for a mention in context




Korean NER with BERT

BERT + CRF (conditional random field)

