[NLP] Word Representation (1)
Word Representation
Unmodeled Representation
What does it mean to “know” a language?
Levels of Linguistic Knowledge
Phonetics / Phonology
- Pronunciation modeling
Orthography
- Character modeling
Word
- Languagae modeling
- Tokenization
- Spelling correction
Morphology
- Morphological analysis
- Tokenization
- Lemmatization
Syntax
- Syntactic parsing
A ship-shipping ship ships shipping-ships.
Semantics
- Named entity recognition
- Word sense disambiguation
- Semantic role labelling
Discourse
- Reference resolution
- Discourse parsing
Word
Lexical Semantics
- How should we represent the meaning of the word?
- Words, lemmas, senses, definitions
- Relationships between words or senses
- Lemma ‘pepper’
- Sense 1: spice from pepper plant
- Sense 2: the pepeer plant itself
- Sense 3: another similar plant (Jamaican pepper)
- Sense 4: another plant with peppercorns (California pepper)
- Sense 5: capsicum (i.e. chili, paprika, bell pepper, etc)
Relations
- Synonymity
- Synonyms have the same meaning in some or all contexts
- Ex) big/large, automobile/car, water/H2O
- Antonymy
- Senses that are opposites with respect to one feature of meaning
- Ex) dark/light, short/long, in/out
- Similarity
- Not synonyms, but sharing some element of meaning
- Ex) car/bicycle, cow/horse
- Word relatedness
- Words be related in any way, perhaps via a semantic frame or field
- Ex) car/bycicle: similar, car/gasoline: related but not similar
Taxonomy
Lexical Semantics
**How should we represent the meaning of the word?
- Dictionary definition
- Lemma and wordforms
- Senses
- Relationships between words or senses
- Taxonomic relationships
- Word similarity, word relatedness
- Semantic frames and roles
- Connotation and sentiment
- …
Distributional Hypothesis
- The meaning of a word is its use in the language [Wittgenstein PI 1943]
- If A and B have almost identical environments we say that they are synonyms [Harris 1954]
- You shall know a word by the company it keeps [Firth 1957]
Each word = one vector
- Similar words are nearby in space
- The standard way to represent meaing in NLP
Word Reperesentations
- Count-based
- Created by a simple function of the counts of nearby words (tf-idf, PPMI)
- Class=based
- Created through hierarchical clustering (Brown clusters)
- Distributed prediction-based embeddings
- Created by training a classifier to distinguish nearby and far-away words (Word2vec, Fasttext)
- Distributed contextual embeddings from language models
- Embeddings from language model (ELMo, BERT, GPT)
Word2Vec
- Popular embedding method
- Very fast to train
- Ideas
- Predict a word rather than count
- Words that are semantically similar often occur near each other in text
- Embeddings capture relational meaning Ex) king - man + woman ≒ queen
- Examples of Word2vec analogy test (Korean)
- [여름 - 더위 + 겨울] = [마름]
- [선풍기 - 바람 + 눈] = [눈물]
- [사람 - 지능 + 컴퓨터] = [소프트웨어]
- [인생 - 사람 + 컴퓨터] = [관리자]
- [그림 - 연필 + 영화] = [스타]
- [손 - 박수 + 발] = [달리기]
- [삼겹살 - 소주 + 맥주] = [햄]
- Limitations
- Tokenizing a word is hard problem
- Similar but many different format of words
FastText
- Subword representation
Word2Vec & FastText - Limitations
One word can have several meanings/roles depending on the context Ex) ‘play’ word
- Elmo and Cookie Monster play a game.
- The Broadway play premiered yesterday.
- Flowers play an important role in mental health of people.
- Girls will play characters with a role to build the group’s social relationship.
- Some play the piano, while others dance, sing, and perform plays.
참고자료
- https://blog.naver.com/saltluxmarketing/221607368769
- https://github.com/MrBananaHuman/JamoFastText
댓글남기기