[NLP] Conversation Model
Conversation Model
Dialogue Systems
The task of generating a response for making a conversation with human
- Turing test
Ability to understand and generate language - intelligence “Can machine think?”
ELIZA
- Created 1964-1966 at MIT, heavily scripted
- DOCTOR script was most successful: repeats user’s input, asks silly questions
- Identify keyword, Identify context, apply transformation rule
- Very little need to generate new content, but can only have one type of conversation
Cleverbot
- Carpenter(1986), online system built in 2006
- “Nearset neighbors”: Human says statement A, find a human response in human-human or human-computer chats to statement A, repeat that
- Can often give sensible answers, but the bot doesn’t really impose high-level discourse structure
Conversation Model (1)
Data-Driven Approaches
- Can treat as a machine translation problem
- “translate” from current utterance to next one
- Filter the data, use statistical measures to prune extracted phrases to get better performance
Seq2seq models
Just like conventional MT, can train seq2seq models for this task
Lack of Diversity
Training to maximize likelihood gives a system that prefers common responses
- Solution
- Mutual information criterion
- Response $R$ should be predictive of user utterance $U$ as well
- Standard conditional likelihood: $\log P(R \mid U)$
- Mutual information: $\log P(R \mid U)-\log P(R)$
- $\log P(R): probabilities under a language model$
PersonaChat
Wizard of Wikipedia
Conversation Model
Motivation
1. Inconsistent personality
- Existing conversation models tend to generate inconsistent personal responses even the speaker is the same
2. Average personality of all speakers
- Existing conversation models tend to generate the same responses even the speakers are different
Idea
- Use a stochastic variable for the context from speakers
- Learns the context of conversations between two speakers
- Infers the context of new conversation from the speakers
- Provide speaker info to response generator indirectly
- Learns speakers’ preference from own utterances
- Infers the mixture of speakers’ preference and the context
VHUCM
- Variational Hierarchical User-based Conversation Model
VHUCM - Idea
- Use a stochastic variable for the context from speakers
- Provide speaker info to response generator indirectly
Conversation Corpus
- Requirements of corpus
- Naturally-occurring conversations
- Many conversations between two speakers
- Multiple conversation partners of a speaker
Twitter Conversation Corpus
- A Twitter conversation
- Five or more tweets
- At least two replies by each user
- Statistics
- 27K users
- 107K dyads
- 770K conversations
- 6M tweets
- 7 years
Experiment - Personalized Response
- Experiment Setup
- Set two users as questioner and answerer
- Ask demographic questions
VHUCM - Result
Challenges
- Experiment Setup
- Set two users as questioner and answerer
- Ask relationship questions
- Top five answers of “Do you love me?” by VHUCM
- Whoever asks, VHUCM always discloses personal information
Meena (Google, Jan 2010)
Model
- Evolved Transformer seq2seq model
- 2.6B parameters
Data
- Social media conversation
- 876M context-response pairs
- 8K BPE unique subwords
- 341GB text file
- 61B BPE tokens (400B tokens for GPT-3)
Train
- Device: 2048 TPU cores
- 16GB memory (only 8 examples can be loaded)
- Data: 61B BPE tokens
- Time: 30 days
- Optimizer: Adafactor
- keep the initial learning rate for the first 10K steps
- Decay with the inverse square root of the number of steps
- Others
- 0.1 dropout
- Tensor2Tensor code base
BeamSearch
Sample-and-Rank
LaMDA (Google, May 2021)
- Model: Transformers (similar to Meena)
- Data: Conversation corpus (Web documents for GPT-3)
- Features
- Specificity
- Factuality
- Interestingness (related to emotion)
- Sensibleness (related to emotion)
BlenderBot (Facebook, Apr 2020)
Model
- Generate standard seq2seq transformer model (BART)
- Retrieve candidate responses for a given dialogue
- Blend above together
Data
- Pretraining
- Reddit discussion
- 1.5B comments
- 88.8B BPE tokens (61B for Meena, 400B for GPT-3)
- Fine-tuning
- ConvAI2 (140k utterances)
- Empathetic Dialogue (50k utterances)
- Wizard of Wikipedia (194k utterances)
Training
- Model size: 9.4B parameters (Meena: 2.6B)
- Platform: Fairseq toolkit
- Data: 88.8B BPE tokens
- Time: 200k SGD updates (with 2400 warmup steps)
- Optimizer: Adam
BlenderBot 2.0 (Facebook, July 2021)
Model
- Memorize context of multi-turn conversation
- Augment external knowledge from internet
Data
- Long-term Memory
- Multi-turn conversation with summary
- 300K utterances
- Internet-Augmented
- Wizard-Apprentice relationship
- 93K utterances
Conversation Model (2)
Evaluation Metrics in Machine Learning
- Classification - Class
- Accuracy
- Precision, Recall, F1
- Area Under Curve
- $\ldots$
- Regression - Number
- Mean Squared Error
- $R^2$
- Explained Variance
- $\ldots$
- Clustering - Cluster
- Purity
- Davies-Bouldin Index
- Jaccard Index
- $\ldots$
- Reinforcement Learning - Policy
- Total rewards
- Dispersion of Fixed Policy
- Conditional Value at Risk
- $\ldots$
Natural Language Generation
- Generate natural language text
- Machine transaltion
- Automatic summarization
- Conversation model
- Image captioning
- $\ldots$
Evaluation Metrics in Natural Language Generation
- Task-based evaluation
- Ask human to rate the usefulness of the generated text for a specific task
- Human evaluation
- Ask human to rate the quality of the generated text
- Automatic evaluation
- Measure the correspondence between the generated text and ground truth text
- BLEU, ROUGE, METEOR, $\ldots$
- Averaged word embedding, $\ldots$
- BERTScore, BLEURT, $\ldots$
- $\ldots$
- Measure the correspondence between the generated text and ground truth text
BLEU
BERTScore
Conversation Model (3)
Motivation
-
Responses of a conversation can be various
-
Existing metrics (i.e. BLEU) cannot measure the diversity
- Existing metrics that consider the given conversation
- High scores to non-appropriate responses
- Need human labeled score for responses to train the model
- Human evaluation is resources-consuming
- Requires money and evaluation time
- Low scalability
SSREM (Speaker Sensitive Response Evaluation Model)
SSREM - Train
- Same Conversation ($SC_A$): Speaker $A$’s utterances in a conversation
- Same Partner ($SP_A$): $A$’s utterances in conversations with the same partner
- Same Speaker ($SS_A$): $A$’s utterances
- Random ($Rand_A$): Random utterances from speakers who are not $A$
- Korean SAT English subject problem
Experiment 1
- Goal: Correlation with human scores
- Human scores
- Annotate the appropriateness of 1,200 responses
- Use Amazon MTurk
- Comparison metrics
- BLEU [Papineni et al., ACL 2002]
- ROUGE-L [Lin, TSBO 2004]
- EMB [Liu et al., EMNLP 2016]
- RUBER [Tao etal., AAAI 2018]
- RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})
Experiment 1 - Result
- Correlation with human scores
Experiment 2
- Goal: Identifying true/false responses
- Responses
- True
- Ground truth (GT)
- False
- Same conversation (SC)
- Same Partner (SP)
- Same Speaker (SS)
- Random (Rand)
- True
- Comparison metrics
- RUBER [Tao etal., AAAI 2018]
- RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})
Experiment 2 - Result
Experiment 3
- Goal: applicability of SSREM
- Data
- Train: Twitter conversation corpus
- Test: Movie script
- Method
- Correlation with human scores
- Identifying true/false responses
Experiment 3 - Result
- Correlation with human scores
Challenges
- More robust on adversarial attacks
- How can we overcom various adversarial attacks?
- Copying a utterance in the context
- How can we overcom various adversarial attacks?
댓글남기기