[NLP] Transformer

2 분 소요

Transformer

Attention Human pay attention to correlate words in one sentence or different regions of an image

Seq2seq Model

Encoder-Decoder model

Problems with Seq2seq Models

Encoder-decoder models like to repeat themselves
- Example: …
Why does this happen?
- Models trained poorly
- LSTM state is not behaving as expected so it gets stuck in a “loop” of generating the same output tokens again and again
Solution?
- Need some notion of input coverage or what input words we’ve translate

Bad at long sentences
- A fixed-size hidden representation doesn’t scale
- LSTMs still have a hard time remembering for really long periods of time

Unknown words

Encoding these rare words into a vector spcae is really hard
In fact, we don’t want to encode them, we want a way of directly looking back at the input and copying them

Aligned Inputs

Suppose we knew the source and target would be word-by-word translated
Can look at the corresponding input word when translating
Much less burden on the hidden state
How can we archieve this without hardcoding it?

Attention

Suppose we knew the source and target would be word-by-word translated-
Can look at the corresponding input word when translating
Much less burden on the hidden state
How can we achieve this without hardcoding it?

At each decoder state, we compute a distribution over source inputs based on current decoder state, use that in output layer

For each decoder state, compute weighted sum of input states

No attention: P(yi x, y1, …, yi-1) = softmax(Whi)
Attention: P(yi|x, y1, …, yi-1) = softmax(W[ci;hi]) ci = sig(aij*hj) aij = exp(eij)/sig(exp(eij’)) eij = f(hi,hj)

Encoder hidden states capture contextual source word identity
Decoder hidden states are now mostly responsible for selecting what to attend to
Doesn’t take a complex hidden state to walk monotonically through a sentence and spit out word-by-word translations

Self-Attention

Sentence Encoders

LSTM abstraction
- Maps each vector in a sentence to a new, context-aware vector
CNNs do something similar with filter
Attention can give us a third way to do this

What words need to be contextualized here? – Pronouns need to look at antecedents – Ambiguous words should look at context – What words need to be contextualized here? – Words should look at syntactic parents/children

Want [이미지]
LSTMs/CNNs: tend to look at local context [이미지]
To appropriately contextualize embeddings, we need to pass information over long distances dynamically for each word

Each word forms a “query” which then computers attention over each word

Multiple “heads” analogous to different convolutional filters Use parameters Wk & Vk to get different attention values & transform vectors

[이미지]

Attend nearby & to semantically related terns
Why multiple heads?
- Softmax ends up being peaked, single distribution cannot easily put weight on multiple things

Query : Context vector
Key: Target vector
Value: Correlation value between query and key

[이미지] [이미지] [이미지] [이미지] [이미지]

Transformer

참고자료

https://lilianweng.github.io/posts/2018-06-24-attention/

Twitter Facebook LinkedIn

LEE CHANWOO

[NLP] Transformer

Transformer

Seq2seq Model

Problems with Seq2seq Models

Attention

Self-Attention

Transformer

공유하기

댓글남기기

참고

[Docker] Docker Compose Cheatsheet

[Docker] Docker Cheatsheet

[MLOps] 가상화와 컨테이너

[Machine Learning] Parquet vs Arrow