[NLP] Word Representation (1)

2 분 소요

Word Representation

Unmodeled Representation

What does it mean to “know” a language?

Levels of Linguistic Knowledge

Phonetics / Phonology

Pronunciation modeling

Orthography

Character modeling

Word

Languagae modeling
Tokenization
Spelling correction

Morphology

Morphological analysis
Tokenization
Lemmatization

Syntax

Syntactic parsing

A ship-shipping ship ships shipping-ships.

Semantics

Named entity recognition
Word sense disambiguation
Semantic role labelling

Discourse

Reference resolution
Discourse parsing

Word

Lexical Semantics

How should we represent the meaning of the word?
- Words, lemmas, senses, definitions
- Relationships between words or senses

Lemma ‘pepper’
- Sense 1: spice from pepper plant
- Sense 2: the pepeer plant itself
- Sense 3: another similar plant (Jamaican pepper)
- Sense 4: another plant with peppercorns (California pepper)
- Sense 5: capsicum (i.e. chili, paprika, bell pepper, etc)

Relations

Synonymity
- Synonyms have the same meaning in some or all contexts
- Ex) big/large, automobile/car, water/H2O
Antonymy
- Senses that are opposites with respect to one feature of meaning
- Ex) dark/light, short/long, in/out
Similarity
- Not synonyms, but sharing some element of meaning
- Ex) car/bicycle, cow/horse
Word relatedness
- Words be related in any way, perhaps via a semantic frame or field
- Ex) car/bycicle: similar, car/gasoline: related but not similar

Taxonomy

Lexical Semantics

**How should we represent the meaning of the word?

Dictionary definition
Lemma and wordforms
Senses
Relationships between words or senses
Taxonomic relationships
Word similarity, word relatedness
Semantic frames and roles
Connotation and sentiment
…

Distributional Hypothesis

The meaning of a word is its use in the language [Wittgenstein PI 1943]
If A and B have almost identical environments we say that they are synonyms [Harris 1954]
You shall know a word by the company it keeps [Firth 1957]

Each word = one vector

Similar words are nearby in space
The standard way to represent meaing in NLP

Word Reperesentations

Count-based
- Created by a simple function of the counts of nearby words (tf-idf, PPMI)
Class=based
- Created through hierarchical clustering (Brown clusters)
Distributed prediction-based embeddings
- Created by training a classifier to distinguish nearby and far-away words (Word2vec, Fasttext)
Distributed contextual embeddings from language models
- Embeddings from language model (ELMo, BERT, GPT)

Word2Vec

Popular embedding method
Very fast to train
Ideas
- Predict a word rather than count
- Words that are semantically similar often occur near each other in text

Embeddings capture relational meaning Ex) king - man + woman ≒ queen

Examples of Word2vec analogy test (Korean)
- [여름 - 더위 + 겨울] = [마름]
- [선풍기 - 바람 + 눈] = [눈물]
- [사람 - 지능 + 컴퓨터] = [소프트웨어]
- [인생 - 사람 + 컴퓨터] = [관리자]
- [그림 - 연필 + 영화] = [스타]
- [손 - 박수 + 발] = [달리기]
- [삼겹살 - 소주 + 맥주] = [햄]
Limitations
- Tokenizing a word is hard problem
- Similar but many different format of words

FastText

Subword representation

Word2Vec & FastText - Limitations

One word can have several meanings/roles depending on the context Ex) ‘play’ word

Elmo and Cookie Monster play a game.
The Broadway play premiered yesterday.
Flowers play an important role in mental health of people.
Girls will play characters with a role to build the group’s social relationship.
Some play the piano, while others dance, sing, and perform plays.

참고자료

https://blog.naver.com/saltluxmarketing/221607368769
https://github.com/MrBananaHuman/JamoFastText

Twitter Facebook LinkedIn

LEE CHANWOO

[NLP] Word Representation (1)

Word Representation

What does it mean to “know” a language?

Levels of Linguistic Knowledge

Phonetics / Phonology

Orthography

Word

Morphology

Syntax

Semantics

Discourse

Word

Lexical Semantics

Relations

Taxonomy

Lexical Semantics

Distributional Hypothesis

Word Reperesentations

Word2Vec

FastText

Word2Vec & FastText - Limitations

공유하기

댓글남기기

참고

[Docker] Docker Compose Cheatsheet

[Docker] Docker Cheatsheet

[MLOps] 가상화와 컨테이너

[Machine Learning] Parquet vs Arrow