[NLP] Conversation Model

5 분 소요

Conversation Model

Dialogue Systems

The task of generating a response for making a conversation with human

Turing test

Ability to understand and generate language - intelligence “Can machine think?”

ELIZA

Created 1964-1966 at MIT, heavily scripted
DOCTOR script was most successful: repeats user’s input, asks silly questions

Identify keyword, Identify context, apply transformation rule

Very little need to generate new content, but can only have one type of conversation

Cleverbot

Carpenter(1986), online system built in 2006
“Nearset neighbors”: Human says statement A, find a human response in human-human or human-computer chats to statement A, repeat that
Can often give sensible answers, but the bot doesn’t really impose high-level discourse structure

Conversation Model (1)

Data-Driven Approaches

Can treat as a machine translation problem
- “translate” from current utterance to next one
Filter the data, use statistical measures to prune extracted phrases to get better performance

Seq2seq models

Just like conventional MT, can train seq2seq models for this task

Lack of Diversity

Training to maximize likelihood gives a system that prefers common responses

Solution
- Mutual information criterion
- Response $R$ should be predictive of user utterance $U$ as well
Standard conditional likelihood: $\log P(R \mid U)$
Mutual information: $\log P(R \mid U)-\log P(R)$
$\log P(R): probabilities under a language model$

PersonaChat

Wizard of Wikipedia

Conversation Model

Motivation

1. Inconsistent personality

Existing conversation models tend to generate inconsistent personal responses even the speaker is the same

2. Average personality of all speakers

Existing conversation models tend to generate the same responses even the speakers are different

Idea

Use a stochastic variable for the context from speakers
- Learns the context of conversations between two speakers
- Infers the context of new conversation from the speakers
Provide speaker info to response generator indirectly
- Learns speakers’ preference from own utterances
- Infers the mixture of speakers’ preference and the context

VHUCM

Variational Hierarchical User-based Conversation Model

VHUCM - Idea

Use a stochastic variable for the context from speakers
Provide speaker info to response generator indirectly

Conversation Corpus

Requirements of corpus
- Naturally-occurring conversations
- Many conversations between two speakers
- Multiple conversation partners of a speaker

Twitter Conversation Corpus

A Twitter conversation
- Five or more tweets
- At least two replies by each user
Statistics
- 27K users
- 107K dyads
- 770K conversations
- 6M tweets
- 7 years

Experiment - Personalized Response

Experiment Setup
- Set two users as questioner and answerer
- Ask demographic questions

VHUCM - Result

Challenges

Experiment Setup
- Set two users as questioner and answerer
- Ask relationship questions
Top five answers of “Do you love me?” by VHUCM

Whoever asks, VHUCM always discloses personal information

Meena (Google, Jan 2010)

Model

Evolved Transformer seq2seq model
2.6B parameters

Data

Social media conversation
876M context-response pairs
8K BPE unique subwords
341GB text file
61B BPE tokens (400B tokens for GPT-3)

Train

Device: 2048 TPU cores
- 16GB memory (only 8 examples can be loaded)
Data: 61B BPE tokens
Time: 30 days
Optimizer: Adafactor
- keep the initial learning rate for the first 10K steps
- Decay with the inverse square root of the number of steps
Others
- 0.1 dropout
- Tensor2Tensor code base

BeamSearch

Sample-and-Rank

LaMDA (Google, May 2021)

Model: Transformers (similar to Meena)
Data: Conversation corpus (Web documents for GPT-3)
Features
- Specificity
- Factuality
- Interestingness (related to emotion)
- Sensibleness (related to emotion)

BlenderBot (Facebook, Apr 2020)

Model

Generate standard seq2seq transformer model (BART)
Retrieve candidate responses for a given dialogue
Blend above together

Data

Pretraining
- Reddit discussion
- 1.5B comments
- 88.8B BPE tokens (61B for Meena, 400B for GPT-3)
Fine-tuning
- ConvAI2 (140k utterances)
- Empathetic Dialogue (50k utterances)
- Wizard of Wikipedia (194k utterances)

Training

Model size: 9.4B parameters (Meena: 2.6B)
Platform: Fairseq toolkit
Data: 88.8B BPE tokens
Time: 200k SGD updates (with 2400 warmup steps)
Optimizer: Adam

BlenderBot 2.0 (Facebook, July 2021)

Model

Memorize context of multi-turn conversation
Augment external knowledge from internet

Data

Long-term Memory
- Multi-turn conversation with summary
- 300K utterances
Internet-Augmented
- Wizard-Apprentice relationship
- 93K utterances

Conversation Model (2)

Evaluation Metrics in Machine Learning

Classification - Class
- Accuracy
- Precision, Recall, F1
- Area Under Curve
- $\ldots$
Regression - Number
- Mean Squared Error
- $R^2$
- Explained Variance
- $\ldots$
Clustering - Cluster
- Purity
- Davies-Bouldin Index
- Jaccard Index
- $\ldots$
Reinforcement Learning - Policy
- Total rewards
- Dispersion of Fixed Policy
- Conditional Value at Risk
- $\ldots$

Natural Language Generation

Generate natural language text
- Machine transaltion
- Automatic summarization
- Conversation model
- Image captioning
- $\ldots$

Evaluation Metrics in Natural Language Generation

Task-based evaluation
- Ask human to rate the usefulness of the generated text for a specific task
Human evaluation
- Ask human to rate the quality of the generated text
Automatic evaluation
- Measure the correspondence between the generated text and ground truth text
  - BLEU, ROUGE, METEOR, $\ldots$
  - Averaged word embedding, $\ldots$
  - BERTScore, BLEURT, $\ldots$
  - $\ldots$

BLEU

BERTScore

Conversation Model (3)

Motivation

Responses of a conversation can be various
Existing metrics (i.e. BLEU) cannot measure the diversity
Existing metrics that consider the given conversation
- High scores to non-appropriate responses
- Need human labeled score for responses to train the model
Human evaluation is resources-consuming
- Requires money and evaluation time
- Low scalability

SSREM (Speaker Sensitive Response Evaluation Model)

SSREM - Train

Same Conversation ($SC_A$): Speaker $A$’s utterances in a conversation
Same Partner ($SP_A$): $A$’s utterances in conversations with the same partner
Same Speaker ($SS_A$): $A$’s utterances
Random ($Rand_A$): Random utterances from speakers who are not $A$

Korean SAT English subject problem

Experiment 1

Goal: Correlation with human scores
Human scores
- Annotate the appropriateness of 1,200 responses
- Use Amazon MTurk
Comparison metrics
- BLEU [Papineni et al., ACL 2002]
- ROUGE-L [Lin, TSBO 2004]
- EMB [Liu et al., EMNLP 2016]
- RUBER [Tao etal., AAAI 2018]
- RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})

Experiment 1 - Result

Correlation with human scores

Experiment 2

Goal: Identifying true/false responses
Responses
- True
  - Ground truth (GT)
- False
  - Same conversation (SC)
  - Same Partner (SP)
  - Same Speaker (SS)
  - Random (Rand)
Comparison metrics
- RUBER [Tao etal., AAAI 2018]
- RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})

Experiment 2 - Result

Experiment 3

Goal: applicability of SSREM
Data
- Train: Twitter conversation corpus
- Test: Movie script
Method
- Correlation with human scores
- Identifying true/false responses

Experiment 3 - Result

Correlation with human scores

Challenges

More robust on adversarial attacks
- How can we overcom various adversarial attacks?
  - Copying a utterance in the context

Perplexity

ACUTE-Eval

Twitter Facebook LinkedIn

LEE CHANWOO

Conversation Model

Dialogue Systems

ELIZA

Cleverbot

Conversation Model (1)

Data-Driven Approaches

Seq2seq models

Lack of Diversity

PersonaChat

Wizard of Wikipedia

Conversation Model

Motivation

1. Inconsistent personality

2. Average personality of all speakers

Idea

VHUCM

VHUCM - Idea

Conversation Corpus

Twitter Conversation Corpus

Experiment - Personalized Response

VHUCM - Result

Challenges

Meena (Google, Jan 2010)

Model

Data

Train

BeamSearch

Sample-and-Rank

LaMDA (Google, May 2021)

BlenderBot (Facebook, Apr 2020)

Model

Data

Training

BlenderBot 2.0 (Facebook, July 2021)

Model

Data

Conversation Model (2)

Evaluation Metrics in Machine Learning

Natural Language Generation

Evaluation Metrics in Natural Language Generation

BLEU

BERTScore

Conversation Model (3)

Motivation

SSREM (Speaker Sensitive Response Evaluation Model)

SSREM - Train

Experiment 1

Experiment 1 - Result

Experiment 2

Experiment 2 - Result

Experiment 3

Experiment 3 - Result

Challenges

Perplexity

ACUTE-Eval

공유하기

댓글남기기

참고

[Docker] Docker Compose Cheatsheet

[Docker] Docker Cheatsheet

[MLOps] 가상화와 컨테이너

[Machine Learning] Parquet vs Arrow