1 분 소요

Word Representation

Word Reperesentations

  • Count-based
    • Created by a simple function of the counts of nearby words (tf-idf, PPMI)
  • Class=based
    • Created through hierarchical clustering (Brown clusters)
  • Distributed prediction-based embeddings
    • Created by training a classifier to distinguish nearby and far-away words (Word2vec, Fasttext)
  • Distributed contextual embeddings from language models
    • Embeddings from language model (ELMo, BERT, GPT)

Language Models

  • Probability distributions over sentences
    • P(W) = P(w1, w2, w3, …, wk)
    • Ex) Probability of “I like riding a bicycle”
  • Can use them to generate strings
    • P(wk w1, w2, w3, …, wk-1)
    • Ex) Probability of ‘bicycle’ given the strings “I like riding a”
  • Rank possible sentences
    • Ex) P(“I like riding a bicycle”) > P(“like a I bicycle riding”)
    • Ex) P(“I like riding a bicycle”) > P(“I like riding a computer”)

Application image

  • N-grams model
    • P(wk w1, w2, w3, …, wk-1) ≈ P(wk kw-n, …, wk-1)
    • Unigram, Bigram, Trigram, …
  • Neural language models
    • RNN
    • ELMo
    • BERT
    • GPT

RNN(Recurrent Neural Network

A family of neural networks for processing sequential data

image

image

Limitations of naive RNN: Long-term dependencies problem

image

image

ELMo

Make word embedding from two separate directional LSTM

image

image

BERT

Make word embedding from transformer

image

Attention Human pay attention to correlate words in one sentence or different regions of an image

image

image

Transformer

  • Limitations of RNN: Can not run parallel (sequential modeling)
  • A deep model with a sequence of attention-based transformer blocks
  • Self-attention model

image

  • Language understanding is bedirectional (forward and backward)
    • ELMo models the bidirectional shallowly
  • Let’s use bidirectional encoder to encode text
    • But RNN is too slow
    • Let’s make transformer to run the encoder fast
  • How to train the model?

  • Language understanding is bidirectional (forward and backward)
  • Let’s use transformer to encode text
  • Let’s mask out some input words, and then predict the masked words

image

Perform well on various NLP tasks with BERT pretrained model

image

GPT-2

  • Use transformer decoder blocks (BERT uses encoder blocks)
  • Train the model by predicting next word based on given strings
  • Use large dataset (40GB) with large model size (1,500M)
  • Perform well on various NLP tasks too

GPT-3

image

image

image

Language Models

image

Unmodeled Representation

image

image




참고자료

  • http://karpathy.github.io/2015/05/21/rnn-effectiveness/
  • https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • http://jalammar.github.io/illustrated-bert/
  • https://lilianweng.github.io/posts/2018-06-24-attention/
  • http://jalammar.github.io/illustrated-transformer/
  • https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122
  • https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html
  • https://blog.pingpong.us/gpt3-review/

댓글남기기