[NLP] Word Representation (2)
Word Representation
Word Reperesentations
- Count-based
- Created by a simple function of the counts of nearby words (tf-idf, PPMI)
- Class=based
- Created through hierarchical clustering (Brown clusters)
- Distributed prediction-based embeddings
- Created by training a classifier to distinguish nearby and far-away words (Word2vec, Fasttext)
- Distributed contextual embeddings from language models
- Embeddings from language model (ELMo, BERT, GPT)
Language Models
- Probability distributions over sentences
- P(W) = P(w1, w2, w3, …, wk)
- Ex) Probability of “I like riding a bicycle”
- Can use them to generate strings
-
P(wk w1, w2, w3, …, wk-1) - Ex) Probability of ‘bicycle’ given the strings “I like riding a”
-
- Rank possible sentences
- Ex) P(“I like riding a bicycle”) > P(“like a I bicycle riding”)
- Ex) P(“I like riding a bicycle”) > P(“I like riding a computer”)
Application
- N-grams model
-
P(wk w1, w2, w3, …, wk-1) ≈ P(wk kw-n, …, wk-1) - Unigram, Bigram, Trigram, …
-
- Neural language models
- RNN
- ELMo
- BERT
- GPT
RNN(Recurrent Neural Network
A family of neural networks for processing sequential data
Limitations of naive RNN: Long-term dependencies problem
ELMo
Make word embedding from two separate directional LSTM
BERT
Make word embedding from transformer
Attention Human pay attention to correlate words in one sentence or different regions of an image
Transformer
- Limitations of RNN: Can not run parallel (sequential modeling)
- A deep model with a sequence of attention-based transformer blocks
- Self-attention model
- Language understanding is bedirectional (forward and backward)
- ELMo models the bidirectional shallowly
- Let’s use bidirectional encoder to encode text
- But RNN is too slow
- Let’s make transformer to run the encoder fast
-
How to train the model?
- Language understanding is bidirectional (forward and backward)
- Let’s use transformer to encode text
- Let’s mask out some input words, and then predict the masked words
Perform well on various NLP tasks with BERT pretrained model
GPT-2
- Use transformer decoder blocks (BERT uses encoder blocks)
- Train the model by predicting next word based on given strings
- Use large dataset (40GB) with large model size (1,500M)
- Perform well on various NLP tasks too
GPT-3
Language Models
Unmodeled Representation
참고자료
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- http://jalammar.github.io/illustrated-bert/
- https://lilianweng.github.io/posts/2018-06-24-attention/
- http://jalammar.github.io/illustrated-transformer/
- https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122
- https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html
- https://blog.pingpong.us/gpt3-review/
댓글남기기