2 분 소요

Word Representation

Unmodeled Representation

image

What does it mean to “know” a language?

Levels of Linguistic Knowledge

image

Phonetics / Phonology

  • Pronunciation modeling

image

Orthography

  • Character modeling

image

image

Word

  • Languagae modeling
  • Tokenization
  • Spelling correction

image

Morphology

  • Morphological analysis
  • Tokenization
  • Lemmatization

image

Syntax

  • Syntactic parsing

image

A ship-shipping ship ships shipping-ships. image

Semantics

  • Named entity recognition
  • Word sense disambiguation
  • Semantic role labelling

image

Discourse

  • Reference resolution
  • Discourse parsing

image


image


Word

image

Lexical Semantics

  • How should we represent the meaning of the word?
    • Words, lemmas, senses, definitions
    • Relationships between words or senses

image

  • Lemma ‘pepper’
    • Sense 1: spice from pepper plant
    • Sense 2: the pepeer plant itself
    • Sense 3: another similar plant (Jamaican pepper)
    • Sense 4: another plant with peppercorns (California pepper)
    • Sense 5: capsicum (i.e. chili, paprika, bell pepper, etc)

image

Relations

  • Synonymity
    • Synonyms have the same meaning in some or all contexts
    • Ex) big/large, automobile/car, water/H2O
  • Antonymy
    • Senses that are opposites with respect to one feature of meaning
    • Ex) dark/light, short/long, in/out
  • Similarity
    • Not synonyms, but sharing some element of meaning
    • Ex) car/bicycle, cow/horse
  • Word relatedness
    • Words be related in any way, perhaps via a semantic frame or field
    • Ex) car/bycicle: similar, car/gasoline: related but not similar

Taxonomy

image

Lexical Semantics

**How should we represent the meaning of the word?

  • Dictionary definition
  • Lemma and wordforms
  • Senses
  • Relationships between words or senses
  • Taxonomic relationships
  • Word similarity, word relatedness
  • Semantic frames and roles
  • Connotation and sentiment

Distributional Hypothesis

  • The meaning of a word is its use in the language [Wittgenstein PI 1943]
  • If A and B have almost identical environments we say that they are synonyms [Harris 1954]
  • You shall know a word by the company it keeps [Firth 1957]

Each word = one vector

  • Similar words are nearby in space
  • The standard way to represent meaing in NLP

image

Word Reperesentations

  • Count-based
    • Created by a simple function of the counts of nearby words (tf-idf, PPMI)
  • Class=based
    • Created through hierarchical clustering (Brown clusters)
  • Distributed prediction-based embeddings
    • Created by training a classifier to distinguish nearby and far-away words (Word2vec, Fasttext)
  • Distributed contextual embeddings from language models
    • Embeddings from language model (ELMo, BERT, GPT)

Word2Vec

  • Popular embedding method
  • Very fast to train
  • Ideas
    • Predict a word rather than count
    • Words that are semantically similar often occur near each other in text

image

image

image

image

image

image

image

image

image

  • Embeddings capture relational meaning Ex) king - man + woman ≒ queen

image

  • Examples of Word2vec analogy test (Korean)
    • [여름 - 더위 + 겨울] = [마름]
    • [선풍기 - 바람 + 눈] = [눈물]
    • [사람 - 지능 + 컴퓨터] = [소프트웨어]
    • [인생 - 사람 + 컴퓨터] = [관리자]
    • [그림 - 연필 + 영화] = [스타]
    • [손 - 박수 + 발] = [달리기]
    • [삼겹살 - 소주 + 맥주] = [햄]
  • Limitations
    • Tokenizing a word is hard problem
    • Similar but many different format of words

image

FastText

  • Subword representation

image

image

image

Word2Vec & FastText - Limitations

One word can have several meanings/roles depending on the context Ex) ‘play’ word

  • Elmo and Cookie Monster play a game.
  • The Broadway play premiered yesterday.
  • Flowers play an important role in mental health of people.
  • Girls will play characters with a role to build the group’s social relationship.
  • Some play the piano, while others dance, sing, and perform plays.




참고자료

  • https://blog.naver.com/saltluxmarketing/221607368769
  • https://github.com/MrBananaHuman/JamoFastText

댓글남기기