[NLP] Machine Translation
Machine Translation
The task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language)
Challenges
- Ambiguities
- Word
- Morphology
- Syntax
- Semantics
- Pragmatics
- Gaps in data
- Availablility of corpus
- Commonsense knowledge
- Understanding of context, connotation, social norms, etc.
When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode”
Noisy Channel Models
- A pattern for modeling a pair of random variables, W and A
- W is the plaintext, the true message, the missing information
- A is the ciphertext, the garbled message, the observable evidence
- Decoding: select w given A = a
MT as Direct Modeling
- one model does everything
- Trained to reproduce a corpus of translations
Two Views of MT
- Noisy channel model
- I know the target language
- I have example translations texts (example enciphered data)
- Direct model
- I have really good learning algorithms and a bunch of example inputs (source language sentences) and outputs (target language translations)
- Noisy channel model
- Easy to use monolingual target language data
- Search happens under a product of two models
- Individual models can be simple, product can be powerful
- Direct model
- Directly model the process you care about
- Model must be very powerful
Now?
- Direct modeling is where most of the action is
- Neural networks are very good at generalizing and conceptually very simple
- Inference in “product of wto models” is hard
- Noisy channel ideas are incredibly important and still play a big role in how we think about translation
Parallel Corpora
- Europarl (proceedings of European parliament, 50M words/language)
- http://www.statmt.org/europarl/
- UN Corpus (United Nations documents, six languages, 300M words/language)
- http://www.euromatrixplus.net/multi-un
- Common crawl (Web documents, long tail of language pairs)
Challenges
Word Translation Ambiguity
- What is the best translation?
- Solution intuition: use counts in parallel corpus
Word Order
- Problem: different languages organize words in different order to express the same idea
- Solution intuition: language modeling!
Output Language Fluency
- what is most fluent?
- Solution intuition: a language modeling problem!
How Good is Machine Translation Today?
MT History
Neural Machine Translation
Encoder-decoder framework
Encoder
Decoder
- we have a model, how can we generate transaltions?
- Answers
- Sampling: generate a random sentence according to probability distribution
- Argmax: generate sentence with highest probability
Inference Methods
- Greedy inference
- We just start at the left, and use our classifier at each position to assign a label
- One by one, pick single highest probability word
- Problems
- Often generates easy words first
- Often prefers multiple common words to rare words
Beam inference
- At each position keep the top k complete sequences
- Extend each sequence in each local way
- The extensions compete for the k slots at the next position
Neural Machine Translation
Google’s NMT System 2016
- Encoder and decoder are both transformers
- Decoder consumes the previous generated token (and attends to input), but has no recurrent state
Evaluation
- How good is a given machine translation system?
- Many different translations acceptable
- Evaluation metric
- Subjective judgments by human evaluators
- Automatic evaluation metrics
- Task-based evaluation
Adequacy and Fluency
Automatic Evaluation Metrics
- Goal: computer program that computes quality of translations
- Advantages: low cost, optimizable, consistent
- Basic strategy
- Given: MT output
- Given: human reference translation
- Task: compute similarity between them
Precision and Recall of Words
Bilingual Evaluation Understudy (BLEU)
- N-gram overlap between machine translation output and reference translation
- Compute precision for n-grams of size 1 to 4
- Add brevity penalty (for too short translations)
- Typically computed over the entire corpus, not single sentences
Drawbacks of Automatic Metrics
- All words are treated as equally relevant
- Operate on local level
- Sceres are meaningless (absolute value not informative)
- Human translators score low on BLEU
BLEU Correlate with Human Judgement
댓글남기기