[NLP] Topic Models

1 분 소요

Topic Models

Topic Modeling

Motivation Suppose you are given a messive corpora and asked to carry out the following tasks

Organize the documents into thematic categories
Describe the evaluation of these categories over time
Enable a domain expert to analyze and understand the content
Find relationships between the categories
Understand how authorship influences the content

A method of (usually unsupervised) discovery of latent or hidden structure in a corpus

Applied primarily to text corpora, but techniques are more general
Provides a modeling toolbox
Has prompted the exploration of a variety of new inference methods to accommodate large-scale datasets

Beta-Bernoulli Model

A fast food chain is considering a change in the blend of coffee beans they use to make their coffee
To determine whether their customers prefer the new blend, the company plans to select a random sample of $n=100$ coffee-drinking customers and ask them to taste coffee made with old blend, in cups markes “A” and “B”
Half the time the new blend will be in cup A, and half the time it will be in cup B
Management wants to know if there is a difference in preference for the two blends

Model

$\theta$: probability that a consumer will choose the new blend
$X_1, \ldots, X_n$: a random sample from a Bernoulli distribution

Prior

Beta distribution

Posterior

Proprotional to the density of a Beta distribution $Beta({\alpha}’,{\beta}’)$
- $\alpha’=\alpha+{\sum}_ix_i$
- $\beta’=\beta+n-{\sum}_ix_i$

Dirichlet-Multinomial Model

Beta distribution

Dirichlet distribution

Diriclet-Multinomial Mixture Model

Mixture vs. Admixture

Latent Dirichlet Allocation

Q) Why does LDA work?
A)
- LDA trades off two goals
  1. For each document, allocate its words to as few topics as possible
  2. For each topic, assign high probability to as few terms as possible
- These goals are at odds
  - Putting a document in a single topic makes #2 hard:
    - All of its words must have probability under that topic.
  - Putting very few words in each topic makes #1 hard:
    - To cover a document’s words, it must assign many topics to it.
- Trading off these goals finds groups of tightly co-occurring words

.. Topic Models

Bayesian Topic Models

Neural Topic Models

Twitter Facebook LinkedIn

LEE CHANWOO

[NLP] Topic Models

Topic Models

Topic Modeling

Beta-Bernoulli Model

Model

Prior

Posterior

Dirichlet-Multinomial Model

Diriclet-Multinomial Mixture Model

Mixture vs. Admixture

Latent Dirichlet Allocation

.. Topic Models

Bayesian Topic Models

Neural Topic Models

Neural Topic Models

공유하기

댓글남기기

참고

[Programming] gRPC란? gRPC와 REST의 차이점

[Python] uv : 패키지 관리 도구

[Python] PEP 8 : Style Guide for Python Code

[Python] PEP 20 : The Zen of Python