5 분 소요

Conversation Model

Dialogue Systems

The task of generating a response for making a conversation with human

image

image

  • Turing test

image

Ability to understand and generate language - intelligence “Can machine think?”

ELIZA

  • Created 1964-1966 at MIT, heavily scripted
  • DOCTOR script was most successful: repeats user’s input, asks silly questions image

  • Identify keyword, Identify context, apply transformation rule

image

  • Very little need to generate new content, but can only have one type of conversation

Cleverbot

  • Carpenter(1986), online system built in 2006
  • “Nearset neighbors”: Human says statement A, find a human response in human-human or human-computer chats to statement A, repeat that
  • Can often give sensible answers, but the bot doesn’t really impose high-level discourse structure

image

Conversation Model (1)

Data-Driven Approaches

  • Can treat as a machine translation problem
    • “translate” from current utterance to next one
  • Filter the data, use statistical measures to prune extracted phrases to get better performance

Seq2seq models

Just like conventional MT, can train seq2seq models for this task image

Lack of Diversity

Training to maximize likelihood gives a system that prefers common responses image


  • Solution
    • Mutual information criterion
    • Response $R$ should be predictive of user utterance $U$ as well
  • Standard conditional likelihood: $\log P(R \mid U)$
  • Mutual information: $\log P(R \mid U)-\log P(R)$
  • $\log P(R): probabilities under a language model$

image

PersonaChat

image

Wizard of Wikipedia

image

Conversation Model

image

Motivation

1. Inconsistent personality

  • Existing conversation models tend to generate inconsistent personal responses even the speaker is the same image

2. Average personality of all speakers

  • Existing conversation models tend to generate the same responses even the speakers are different image

Idea

  • Use a stochastic variable for the context from speakers
    • Learns the context of conversations between two speakers
    • Infers the context of new conversation from the speakers
  • Provide speaker info to response generator indirectly
    • Learns speakers’ preference from own utterances
    • Infers the mixture of speakers’ preference and the context

VHUCM

  • Variational Hierarchical User-based Conversation Model image

VHUCM - Idea

  • Use a stochastic variable for the context from speakers
  • Provide speaker info to response generator indirectly image

Conversation Corpus

  • Requirements of corpus
    • Naturally-occurring conversations
    • Many conversations between two speakers
    • Multiple conversation partners of a speaker

Twitter Conversation Corpus

  • A Twitter conversation
    • Five or more tweets
    • At least two replies by each user
  • Statistics
    • 27K users
    • 107K dyads
    • 770K conversations
    • 6M tweets
    • 7 years

image

Experiment - Personalized Response

  • Experiment Setup
    • Set two users as questioner and answerer
    • Ask demographic questions

image

VHUCM - Result

image

Challenges

  • Experiment Setup
    • Set two users as questioner and answerer
    • Ask relationship questions image
  • Top five answers of “Do you love me?” by VHUCM image

image

  • Whoever asks, VHUCM always discloses personal information

Meena (Google, Jan 2010)

image

Model

  • Evolved Transformer seq2seq model
  • 2.6B parameters image

Data

  • Social media conversation
  • 876M context-response pairs
  • 8K BPE unique subwords
  • 341GB text file
  • 61B BPE tokens (400B tokens for GPT-3)

Train

  • Device: 2048 TPU cores
    • 16GB memory (only 8 examples can be loaded)
  • Data: 61B BPE tokens
  • Time: 30 days
  • Optimizer: Adafactor
    • keep the initial learning rate for the first 10K steps
    • Decay with the inverse square root of the number of steps
  • Others
    • 0.1 dropout
    • Tensor2Tensor code base

BeamSearch

image

Sample-and-Rank

image

LaMDA (Google, May 2021)

image

  • Model: Transformers (similar to Meena)
  • Data: Conversation corpus (Web documents for GPT-3)
  • Features
    • Specificity
    • Factuality
    • Interestingness (related to emotion)
    • Sensibleness (related to emotion)

BlenderBot (Facebook, Apr 2020)

image

Model

  • Generate standard seq2seq transformer model (BART)
  • Retrieve candidate responses for a given dialogue
  • Blend above together image image

Data

  • Pretraining
    • Reddit discussion
    • 1.5B comments
    • 88.8B BPE tokens (61B for Meena, 400B for GPT-3)
  • Fine-tuning
    • ConvAI2 (140k utterances)
    • Empathetic Dialogue (50k utterances)
    • Wizard of Wikipedia (194k utterances)

Training

  • Model size: 9.4B parameters (Meena: 2.6B)
  • Platform: Fairseq toolkit
  • Data: 88.8B BPE tokens
  • Time: 200k SGD updates (with 2400 warmup steps)
  • Optimizer: Adam

BlenderBot 2.0 (Facebook, July 2021)

image

Model

  • Memorize context of multi-turn conversation
  • Augment external knowledge from internet image

Data

  • Long-term Memory
    • Multi-turn conversation with summary
    • 300K utterances
  • Internet-Augmented
    • Wizard-Apprentice relationship
    • 93K utterances

image

Conversation Model (2)

image

Evaluation Metrics in Machine Learning

  • Classification - Class
    • Accuracy
    • Precision, Recall, F1
    • Area Under Curve
    • $\ldots$
  • Regression - Number
    • Mean Squared Error
    • $R^2$
    • Explained Variance
    • $\ldots$ image
  • Clustering - Cluster
    • Purity
    • Davies-Bouldin Index
    • Jaccard Index
    • $\ldots$
  • Reinforcement Learning - Policy
    • Total rewards
    • Dispersion of Fixed Policy
    • Conditional Value at Risk
    • $\ldots$ image image

Natural Language Generation

  • Generate natural language text
    • Machine transaltion
    • Automatic summarization
    • Conversation model
    • Image captioning
    • $\ldots$ image

Evaluation Metrics in Natural Language Generation

  • Task-based evaluation
    • Ask human to rate the usefulness of the generated text for a specific task
  • Human evaluation
    • Ask human to rate the quality of the generated text
  • Automatic evaluation
    • Measure the correspondence between the generated text and ground truth text
      • BLEU, ROUGE, METEOR, $\ldots$
      • Averaged word embedding, $\ldots$
      • BERTScore, BLEURT, $\ldots$
      • $\ldots$

BLEU

image

BERTScore

image

Conversation Model (3)

image

Motivation

  1. Responses of a conversation can be various image

  2. Existing metrics (i.e. BLEU) cannot measure the diversity image

  3. Existing metrics that consider the given conversation
    • High scores to non-appropriate responses
    • Need human labeled score for responses to train the model image
  4. Human evaluation is resources-consuming
    • Requires money and evaluation time
    • Low scalability image

SSREM (Speaker Sensitive Response Evaluation Model)

image

SSREM - Train

image image image

  • Same Conversation ($SC_A$): Speaker $A$’s utterances in a conversation
  • Same Partner ($SP_A$): $A$’s utterances in conversations with the same partner
  • Same Speaker ($SS_A$): $A$’s utterances
  • Random ($Rand_A$): Random utterances from speakers who are not $A$

image image

  • Korean SAT English subject problem image image

Experiment 1

  • Goal: Correlation with human scores
  • Human scores
    • Annotate the appropriateness of 1,200 responses
    • Use Amazon MTurk
  • Comparison metrics
    • BLEU [Papineni et al., ACL 2002]
    • ROUGE-L [Lin, TSBO 2004]
    • EMB [Liu et al., EMNLP 2016]
    • RUBER [Tao etal., AAAI 2018]
    • RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})

Experiment 1 - Result

  • Correlation with human scores image

Experiment 2

  • Goal: Identifying true/false responses
  • Responses
    • True
      • Ground truth (GT)
    • False
      • Same conversation (SC)
      • Same Partner (SP)
      • Same Speaker (SS)
      • Random (Rand)
  • Comparison metrics
    • RUBER [Tao etal., AAAI 2018]
    • RSREM ($R_{cand}$={$r_A$,$rand_A^{(1)}$,$rand_A^{(2)}$,$rand_A^{(3)}$,$rand_A^{(4)}$})

Experiment 2 - Result

image

Experiment 3

  • Goal: applicability of SSREM
  • Data
    • Train: Twitter conversation corpus
    • Test: Movie script
  • Method
    • Correlation with human scores
    • Identifying true/false responses image

Experiment 3 - Result

  • Correlation with human scores image image

Challenges

  • More robust on adversarial attacks
    • How can we overcom various adversarial attacks?
      • Copying a utterance in the context image

Perplexity

image

ACUTE-Eval

image

댓글남기기