[NLP] Topic Models
Topic Models
Topic Modeling
Motivation Suppose you are given a messive corpora and asked to carry out the following tasks
- Organize the documents into thematic categories
- Describe the evaluation of these categories over time
- Enable a domain expert to analyze and understand the content
- Find relationships between the categories
- Understand how authorship influences the content
A method of (usually unsupervised) discovery of latent or hidden structure in a corpus
- Applied primarily to text corpora, but techniques are more general
- Provides a modeling toolbox
- Has prompted the exploration of a variety of new inference methods to accommodate large-scale datasets
Beta-Bernoulli Model
- A fast food chain is considering a change in the blend of coffee beans they use to make their coffee
- To determine whether their customers prefer the new blend, the company plans to select a random sample of $n=100$ coffee-drinking customers and ask them to taste coffee made with old blend, in cups markes “A” and “B”
- Half the time the new blend will be in cup A, and half the time it will be in cup B
- Management wants to know if there is a difference in preference for the two blends
Model
- $\theta$: probability that a consumer will choose the new blend
- $X_1, \ldots, X_n$: a random sample from a Bernoulli distribution
Prior
- Beta distribution
Posterior
- Proprotional to the density of a Beta distribution $Beta({\alpha}’,{\beta}’)$
- $\alpha’=\alpha+{\sum}_ix_i$
- $\beta’=\beta+n-{\sum}_ix_i$
Dirichlet-Multinomial Model
- Beta distribution
- Dirichlet distribution
Diriclet-Multinomial Mixture Model
Mixture vs. Admixture
Latent Dirichlet Allocation
- Q) Why does LDA work?
- A)
- LDA trades off two goals
- For each document, allocate its words to as few topics as possible
- For each topic, assign high probability to as few terms as possible
- These goals are at odds
- Putting a document in a single topic makes #2 hard:
- All of its words must have probability under that topic.
- Putting very few words in each topic makes #1 hard:
- To cover a document’s words, it must assign many topics to it.
- Putting a document in a single topic makes #2 hard:
- Trading off these goals finds groups of tightly co-occurring words
- LDA trades off two goals
댓글남기기