LDA: Latent Dirichlet Allocation

Table of Contents

Motivation

Bag-of-words models in general performs quite well, and we're looking for a probabilistic model of text. Let's look at the underlying assumptions of a bag-of-word model:

  • Ordering of words within a document is irrelevant
  • Ordering of documents in entire corpus is irrelevant

In probability we call this exchangebility. You remember anything about this? [BROKEN LINK: No match for fuzzy expression: *Dirichlet%20process] maybe?!

Notation

  • word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by $\{1,... ,V\}$. Represented using unit-basis vectors that have a single component equal to 1 and all other components equal to 0. Thus, using superscripts to denote components,the $v^{\text{th}}$ word in the vocabulary is represented by a V-vector $w$ such that $w^i=1$ and $w^j=0$ for $i \neq j$.
  • document is a sequence of $N$ words denoted by $\mathbf{w} = (w_1, w_2, ..., w_N)$ where $w_n$ is the $n^{\text{th}}$ word in the sequence
  • corpus is a collection of $M$ documents denoted by $D= \{\mathbf{w}_1, \mathbf{w}_2,... , \mathbf{w}_M\}$.
  • topic is characterized by a distribution over words

Latent Dirichlet Allocation

  • Generative model of a corpus
  • Documents are represented as a random mixtures over latent topics

Notation

  • $\alpha$ parameter for Dirichlet distribution
  • $\theta$ latent random variable for some document ($\theta_d$ for a specific document $d$) drawn from a $\text{Dir}(\alpha)$
  • $z$ is a k-dimensional one-hot-vector representing the chosen topic for a word i.e. with the i-th component equal to 1 and all other components being 0 ($z^i = 1$ for a unique $i$)
  • $z_n$ is the same as $z$, but for the specific n-th word $w_n$ in a document, i.e. topic for this word.
  • $\beta$ is a $k \times V$ matrix parameterizing the word probabilities, i.e. $\beta_{ij} = p(w^j = 1 | z^i = 1)$

Model

Assume the following generative model for each document $\mathbf{w}$ in a corpus $D$:

  1. Choose $N \sim \text{Poisson} (\xi)$, i.e. the length of the document (following this distribution is not a strict requirement)
  2. Choose $\theta \sim \text{Dir} (\alpha)$, used as a parameter for a multinomial later on
  3. Each word $w_n$ in the document of length $N$ is assigned as follows:
    1. Choose a topic $z_n \sim \text{Multinomial} ( \theta )$
    2. Choose a word $w_n \sim p(w_n | z_n, \beta)$, multinomial distribution conditioned on the topic $z_n$

generative_model.png

Dirichlet

My interpretation

From a Dirichlet we draw a $k$ -vector which corresponds to a multinomial distribution, with each component corresponding to the probability of the corresponding topic / label / class. In our case, $p(z^i = 1 | \alpha)$ then corresponds to the i-th component of $\theta$, i.e. the probability of topic $i$.

From the paper

A k-dimensional Dirichlet random variable $\theta$ can take values in the (k−1)-simplex (a k-vector $\theta$ lies in the (k − 1)-simplex if $\theta_i \ge 0, \sum_{i=1}^k \theta_i = 1$ ), and has the following probability density on this simplex:

    \begin{equation*}
     p(\theta | \alpha) = \frac{\Gamma \Big( \sum_{i=1}^k \alpha_i \Big) }{\prod_{i=1}^k \Gamma (\alpha_i)} \prod_{i=1}^k \theta_i^{(1 - \alpha_i)}
\end{equation*}
1

When we're talking about the (k - 1)-simplex we're referring to a region of space where the parameters of the mutlinomial can be (domain of all $\theta_i$). Why $k - 1$? Since we require that the sum over all $\theta_i$ equals 1, after choosing a value for $k-1$ of these, the last one is determined as $1 - \sum_{i=1}^{k-1} \theta_i$.

The $k$ and $\theta$ being referred to in this section on the Dirichlet has nothing to do with notation used elswhere in this text, and simply represents some arbitrary integer $k$ and some random variable $\theta_i$.

Joint distributions

First we have the joint distribution over the topic mixture $\theta$, topics $\mathbf{z}$ and words $\mathbf{w}$ for one document:

    \begin{equation*}
    p(\theta, \mathbf{z}, \mathbf{w} | \alpha, \beta) = p(\theta | \alpha) \prod_{n=1}^N p(z_n | \theta) p(w_n | z_n, \beta)
\end{equation*}
2

Where $p(z_n | \theta) = \theta_i$ where $z_n^i = 1$ since $z_n$ is a one-hot vector representing the n-th chosen topic.

topic mixture simply refers to a latent variable which the rv. topic depends on.

Then to obtain the probability of a document $\mathbf{w}$, i.e. a set of words $w$, we simply integrate over all topic mixtures $\theta$ and summing over all possible topic assignments $z_n$

\begin{equation*}
p(\mathbf{w} | \alpha, \beta) = \int p(\theta | \alpha) \prod_{n=1}^N \sum_{z_{n}} p(z_{n} | \theta) p(w_n | z_n, \beta) d\theta
\end{equation*}
3

And finally, to obtain the distribution over our corpus, i.e. our set of documents, we simply take the product over all documents

\begin{equation*}
p(D | \alpha, \beta) = \prod_{d=1}^M \int p(\theta_d | \alpha) \prod_{n = 1}^N_d \sum_{z_d_n} p(z_d_n | \theta_d) p(w_d_n | z_d_n, \beta) d\theta_d
\end{equation*}
4

Inference

Our inference problem is to compute the distributions of the latent variables $\mathbf{z}$ and $\theta$ given a document $\mathbf{w}$.

\begin{equation*}
p(\theta, \mathbf{z}, \alpha, \beta, \mathbf{w}) = \frac{p(\theta, \mathbf{z}, \mathbf{w} | \alpha, \beta)}{p(\mathbf{w} | \alpha, \beta)}
\end{equation*}
5

Both of which we deduced in the previous section! Unfortunately, this computation is intractable. Luckily there exists methods to deal with this: variational inference, MCMC and Laplace approximation.

Resources