[NLP] Conversation Model
Conversation Model
Dialogue Systems
The task of generating a response for making a conversation with human


- Turing test

Ability to understand and generate language - intelligence “Can machine think?”
ELIZA
- Created 1964-1966 at MIT, heavily scripted
- DOCTOR script was most successful: repeats user’s input, asks silly questions

- Identify keyword, Identify context, apply transformation rule

- Very little need to generate new content, but can only have one type of conversation
Cleverbot
- Carpenter(1986), online system built in 2006
- “Nearset neighbors”: Human says statement A, find a human response in human-human or human-computer chats to statement A, repeat that
- Can often give sensible answers, but the bot doesn’t really impose high-level discourse structure

Conversation Model (1)
Data-Driven Approaches
- Can treat as a machine translation problem
- “translate” from current utterance to next one
- Filter the data, use statistical measures to prune extracted phrases to get better performance
Seq2seq models
Just like conventional MT, can train seq2seq models for this task

Lack of Diversity
Training to maximize likelihood gives a system that prefers common responses

- Solution
- Mutual information criterion
- Response $R$ should be predictive of user utterance $U$ as well
- Standard conditional likelihood: $\log P(R \mid U)$
- Mutual information: $\log P(R \mid U)-\log P(R)$
- $\log P(R): probabilities under a language model$

PersonaChat

Wizard of Wikipedia

Conversation Model

Motivation
1. Inconsistent personality
- Existing conversation models tend to generate inconsistent personal responses even the speaker is the same

2. Average personality of all speakers
- Existing conversation models tend to generate the same responses even the speakers are different

Idea
- Use a stochastic variable for the context from speakers
- Learns the context of conversations between two speakers
- Infers the context of new conversation from the speakers
- Provide speaker info to response generator indirectly
- Learns speakers’ preference from own utterances
- Infers the mixture of speakers’ preference and the context
VHUCM
- Variational Hierarchical User-based Conversation Model

VHUCM - Idea
- Use a stochastic variable for the context from speakers
- Provide speaker info to response generator indirectly

Conversation Corpus
- Requirements of corpus
- Naturally-occurring conversations
- Many conversations between two speakers
- Multiple conversation partners of a speaker
Twitter Conversation Corpus
- A Twitter conversation
- Five or more tweets
- At least two replies by each user
- Statistics
- 27K users
- 107K dyads
- 770K conversations
- 6M tweets
- 7 years

Experiment - Personalized Response
- Experiment Setup
- Set two users as questioner and answerer
- Ask demographic questions

VHUCM - Result

Challenges
- Experiment Setup
- Set two users as questioner and answerer
- Ask relationship questions

- Top five answers of “Do you love me?” by VHUCM


- Whoever asks, VHUCM always discloses personal information
Meena (Google, Jan 2010)

Model
- Evolved Transformer seq2seq model
- 2.6B parameters

Data
- Social media conversation
- 876M context-response pairs
- 8K BPE unique subwords
- 341GB text file
- 61B BPE tokens (400B tokens for GPT-3)
Train
- Device: 2048 TPU cores
- 16GB memory (only 8 examples can be loaded)
- Data: 61B BPE tokens
- Time: 30 days
- Optimizer: Adafactor
- keep the initial learning rate for the first 10K steps
- Decay with the inverse square root of the number of steps
- Others
- 0.1 dropout
- Tensor2Tensor code base
BeamSearch

Sample-and-Rank

LaMDA (Google, May 2021)

- Model: Transformers (similar to Meena)
- Data: Conversation corpus (Web documents for GPT-3)
- Features
- Specificity
- Factuality
- Interestingness (related to emotion)
- Sensibleness (related to emotion)
BlenderBot (Facebook, Apr 2020)

Model
- Generate standard seq2seq transformer model (BART)
- Retrieve candidate responses for a given dialogue
- Blend above together

Data
- Pretraining
- Reddit discussion
- 1.5B comments
- 88.8B BPE tokens (61B for Meena, 400B for GPT-3)
- Fine-tuning
- ConvAI2 (140k utterances)
- Empathetic Dialogue (50k utterances)
- Wizard of Wikipedia (194k utterances)
Training
- Model size: 9.4B parameters (Meena: 2.6B)
- Platform: Fairseq toolkit
- Data: 88.8B BPE tokens
- Time: 200k SGD updates (with 2400 warmup steps)
- Optimizer: Adam
BlenderBot 2.0 (Facebook, July 2021)

Model
- Memorize context of multi-turn conversation
- Augment external knowledge from internet

Data
- Long-term Memory
- Multi-turn conversation with summary
- 300K utterances
- Internet-Augmented
- Wizard-Apprentice relationship
- 93K utterances

Conversation Model (2)

Evaluation Metrics in Machine Learning
- Classification - Class
- Accuracy
- Precision, Recall, F1
- Area Under Curve
- $\ldots$
- Regression - Number
- Mean Squared Error
- $R^2$
- Explained Variance
- $\ldots$

- Clustering - Cluster
- Purity
- Davies-Bouldin Index
- Jaccard Index
- $\ldots$
- Reinforcement Learning - Policy
- Total rewards
- Dispersion of Fixed Policy
- Conditional Value at Risk
- $\ldots$

Natural Language Generation
- Generate natural language text
- Machine transaltion
- Automatic summarization
- Conversation model
- Image captioning
- $\ldots$

Evaluation Metrics in Natural Language Generation
- Task-based evaluation
- Ask human to rate the usefulness of the generated text for a specific task
- Human evaluation
- Ask human to rate the quality of the generated text
- Automatic evaluation
- Measure the correspondence between the generated text and ground truth text
- BLEU, ROUGE, METEOR, $\ldots$
- Averaged word embedding, $\ldots$
- BERTScore, BLEURT, $\ldots$
- $\ldots$
- Measure the correspondence between the generated text and ground truth text
BLEU

BERTScore

Conversation Model (3)

Motivation
-
Responses of a conversation can be various

-
Existing metrics (i.e. BLEU) cannot measure the diversity

- Existing metrics that consider the given conversation
- High scores to non-appropriate responses
- Need human labeled score for responses to train the model

- Human evaluation is resources-consuming
- Requires money and evaluation time
- Low scalability

SSREM (Speaker Sensitive Response Evaluation Model)

SSREM - Train

- Same Conversation ($SC_A$): Speaker $A$’s utterances in a conversation
- Same Partner ($SP_A$): $A$’s utterances in conversations with the same partner
- Same Speaker ($SS_A$): $A$’s utterances
- Random ($Rand_A$): Random utterances from speakers who are not $A$

- Korean SAT English subject problem

Experiment 1
- Goal: Correlation with human scores
- Human scores
- Annotate the appropriateness of 1,200 responses
- Use Amazon MTurk
- Comparison metrics
- BLEU [Papineni et al., ACL 2002]
- ROUGE-L [Lin, TSBO 2004]
- EMB [Liu et al., EMNLP 2016]
- RUBER [Tao etal., AAAI 2018]
- RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})
Experiment 1 - Result
- Correlation with human scores

Experiment 2
- Goal: Identifying true/false responses
- Responses
- True
- Ground truth (GT)
- False
- Same conversation (SC)
- Same Partner (SP)
- Same Speaker (SS)
- Random (Rand)
- True
- Comparison metrics
- RUBER [Tao etal., AAAI 2018]
- RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})
Experiment 2 - Result

Experiment 3
- Goal: applicability of SSREM
- Data
- Train: Twitter conversation corpus
- Test: Movie script
- Method
- Correlation with human scores
- Identifying true/false responses

Experiment 3 - Result
- Correlation with human scores

Challenges
- More robust on adversarial attacks
- How can we overcom various adversarial attacks?
- Copying a utterance in the context

- Copying a utterance in the context
- How can we overcom various adversarial attacks?
Perplexity

ACUTE-Eval

댓글남기기