[NLP] Question Answering
Question Answering
- The task of answering a (natural language) question
- One of the oldest NLP tasks (punched card systems in 1961)
- Idea: Doing dependency parsing on question and searching most simliar answer
QUESTION ANSWERING
- Non-factoid QA: answer is free text - "why is Dracula so evil?"
QA subtypes (could be factoid or non-factoid):
- Semantic parsing: question is mapped to a logical form which is then executed over some database
- “how many people did Dracula bite?”
- Reading comprehension: answer is a span of text within a document
- Community-based QA: question is answerd by multiple web users
- Visual QA: questions about images
Reading Comprehension
Textual Question Answering
Conversational Question Answering
Long-form Question answering
Open-domain Question Answering
Knowledge Base Question Answering
Table-based Question Answering
Visual Question Answering
Stanford Question Answering Dataset (SQuAD)
METHODS FOR QUESTION ANSWERING
Feature Based Methods
Bi-Directional Attention Flow for Machine Comprehension
Coattention Encoder
BiLSTM-based Models
- Encode the question using word/char embeddings
- pass on an biLSTM encoder
- Encode the passage similarly
- Passage-to-question and question-to-passage attention
- Modeling layer: another BiLSTM layer
- Output layer: two classifiers for predicting start and end points
- The entire model can be trained in an end-to-end way
BERT-based Models
- Concatenate question and passage as one single sequence separated with a [SEP] token, then pass it to the BERT encoder
- Train two classifiers on top of passage tokens
Experiments on SQuAD
Is Reading Comprehension Solved?
Question Aswering Datasets
- Reading comprehension
- CNN/Daily Mail, CoQA, HotpotQA, QuAC, RACE, SQuAD, SWAG, Receipt QA, NarrativeQA, DROP, Story Cloze Test, …
- Open-domain question answering
- DuReader, Quasar, SearchQA, …
- Knowledge base question answering
- More datasets
- http://nlpprogress.com/english/question_answering.html
READING COMPREHENSION
CNN Article
- Dataset Statistics
- Articles were collected from April 2007 for CNN and June 2010 for Daily Mal, until the end of April 2015
- Validation data is from March, test data from April 2015
- Question Diffuculty
- Distribution (in percent) of queries over category and number of context sentences required to answer them based on a subset of the CNN validate data
- Simple Baselines
Reading vs Encoding
Use neural encoding models for estimating the probability of word type 𝑎𝑎from document 𝑑𝑑answering query 𝑞𝑞: (a|d,q) ∝ exp(W(a)g(d,q)) where
- W(a) indexes row 𝑎𝑎of weight matrix W
- g(d,q) returns a vector embedding of a document and query pair
Deep LSTM Reader
Attentive Reader
Attentive Reader Training
Impatient Reader
Summary
- Supervised machine reading is a viable research direction with the available data
- LSTM based recurrent networks constantly surprise with their ability to encode dependencies in sequences
- Attention is a very effective and flexible modeling technique
OPEN-DOMAIN QUESTION ANSWERING
- Simmons et al. (1964) did first exploration of answering questions from an expository text based on matching dependency parses of a question and answer
- Murax(Kupiec1993) aimed to answer questions over an online encyclopedia using IR and shallow linguistic processing
- The NIST TREC QA track begun in 1999 first rigorously investigated answering fact questions over a large collection of documents
- IBM’s Jeopardy! System (DeepQA, 2011) brought attention to a version of the problem; it used an ensemble of many methods
- DrQA(Chen et al. 2017) uses IR followed by neural reading comprehension to bring deep learning to Open-domain QA
IBM’s Watson and Jeopardy! Challenge
Compared to Reading Comprehension?
- Challenges from both large-scale open-domain QA and of machine comprehension
- The question can be any open-domain questions
- Instead of quesetions posed after reading the passage
Traditional QA System
QA System
DrQA: Two-Stage Retriever and Reader
Document Retriever: Two Steps
- TF-IDF bag-of-words vector
- Efficient bigram hashing (Weinberger et al., 2009)
- Map the bigram to 224 bins with an unsigned murmur3 hash
- Preserving speed and memory efficiency - Murmur3: map a word or string to a 32-bit or 128 bit value
Document Reader
Paragraph Encoding
- Represent kokens in a paragraph as a sequence of feature vectors
- Word embedding
- Exact match
- Token features
- Aighned question embedding
- Pass features as the input to a RNN (multi-layer Bidirectional LSTM)
Question Encoding
- Apply another RNN to top of word embeddings of qi and get qj
- Combining the resulting units into one single vector
Prediction
Dataset and Example Training Data & Result
Summary: DrQA
- DrQAwas the first attempt to scale up reading comprehension to open-domain question answering, by combining IR techniques and neural reading comprehension models
- Although we achieved good accuracy on SQuADin 2017, the final QA accuracy still remains low:
- Distant supervision + multi-task learning helps
Limitations of Current Models
Latent Retrieval for Weakly Supervised Open Domain Question Answering
Information Retrieval (IR)
Reader (QA)
Result
MULTI-HOP QUESTION ANSWERING
- Very few SQuADquestions require actually combining multiple pieces of information
- This is an important capability QA systems should have
- Several datasets test multi-hop reasoning
- Ability to answer questions that draw on several sentences or several documents to answer
WikiHop
- Annotators shown Wikipedia and asked to pose a simple question linking two entities that require a third (bridging) entity to associate
- A model shouldn’t be able to answer these without doing some reasoning about the intermediate entity
HotpotQA
Multi-Hop Reasoning
- This is idealized version of multi-hop reasoning
- Do models need to do this to do well on this task?
- Model can ignore the bridging entity and directly predict the answer
- No simple lexical overlap
- But only one government position appears in the context
NEW TYPES OF QA
DROP
- Let’s build QA datasets to help the community focus on modeling particular things
- Question types: subtraction, comparison (which did he visit first), counting and sorting (which kecker kecked more field goals),
- Invites ad hoc solutions (structure the model around predicting differences between numbers)
NarrativeQA
- Humans see a summary of a book
- … Peter’s former girlfriend Dana Barrett has had a son, Oscar …
- Question: How is Oscar related to Dana?
- Answering these questions from the source text (not summary) requires complex inferences and is extremely challenging
Summary
- Lots of problems with current QA settings, lots of new datasets
- Models can often work well for one QA task but don’t generalize
- We still do not have (solvable) QA settings which seem to require really complex reasoning as opposed to surface-level pattern recognition
댓글남기기