1 분 소요

Topic Models

Topic Modeling

Motivation Suppose you are given a messive corpora and asked to carry out the following tasks

  • Organize the documents into thematic categories
  • Describe the evaluation of these categories over time
  • Enable a domain expert to analyze and understand the content
  • Find relationships between the categories
  • Understand how authorship influences the content

A method of (usually unsupervised) discovery of latent or hidden structure in a corpus

  • Applied primarily to text corpora, but techniques are more general
  • Provides a modeling toolbox
  • Has prompted the exploration of a variety of new inference methods to accommodate large-scale datasets

image

image

Beta-Bernoulli Model

  • A fast food chain is considering a change in the blend of coffee beans they use to make their coffee
  • To determine whether their customers prefer the new blend, the company plans to select a random sample of $n=100$ coffee-drinking customers and ask them to taste coffee made with old blend, in cups markes “A” and “B”
  • Half the time the new blend will be in cup A, and half the time it will be in cup B
  • Management wants to know if there is a difference in preference for the two blends

Model

  • $\theta$: probability that a consumer will choose the new blend
  • $X_1, \ldots, X_n$: a random sample from a Bernoulli distribution

image

Prior

  • Beta distribution

image

Posterior

image

  • Proprotional to the density of a Beta distribution $Beta({\alpha}’,{\beta}’)$
    • $\alpha’=\alpha+{\sum}_ix_i$
    • $\beta’=\beta+n-{\sum}_ix_i$

image

Dirichlet-Multinomial Model

  • Beta distribution

image

  • Dirichlet distribution

image

image

Diriclet-Multinomial Mixture Model

image

image

Mixture vs. Admixture

image

Latent Dirichlet Allocation

image

image

image

image

image

image

image

image

  • Q) Why does LDA work?
  • A)
    • LDA trades off two goals
      1. For each document, allocate its words to as few topics as possible
      2. For each topic, assign high probability to as few terms as possible
    • These goals are at odds
      • Putting a document in a single topic makes #2 hard:
        • All of its words must have probability under that topic.
      • Putting very few words in each topic makes #1 hard:
        • To cover a document’s words, it must assign many topics to it.
    • Trading off these goals finds groups of tightly co-occurring words

.. Topic Models

Bayesian Topic Models

image

Neural Topic Models

image

Neural Topic Models

image

댓글남기기