LDA: Latent Dirichlet Allocation

Motivation
Notation
Latent Dirichlet Allocation
- Notation
- Model
Resources

Motivation

Bag-of-words models in general performs quite well, and we're looking for a probabilistic model of text. Let's look at the underlying assumptions of a bag-of-word model:

Ordering of words within a document is irrelevant
Ordering of documents in entire corpus is irrelevant

In probability we call this exchangebility. You remember anything about this? [BROKEN LINK: No match for fuzzy expression: *Dirichlet%20process] maybe?!

Notation

word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by $\{1,... ,V\}$ . Represented using unit-basis vectors that have a single component equal to 1 and all other components equal to 0. Thus, using superscripts to denote components,the $v^{\text{th}}$ word in the vocabulary is represented by a V-vector $w$ such that $w^i=1$ and $w^j=0$ for $i \neq j$ .
document is a sequence of $N$ words denoted by $\mathbf{w} = (w_1, w_2, ..., w_N)$ where $w_n$ is the $n^{\text{th}}$ word in the sequence
corpus is a collection of $M$ documents denoted by $D= \{\mathbf{w}_1, \mathbf{w}_2,... , \mathbf{w}_M\}$ .
topic is characterized by a distribution over words

Latent Dirichlet Allocation

Generative model of a corpus
Documents are represented as a random mixtures over latent topics

Notation

$\alpha$ parameter for Dirichlet distribution
$\theta$ latent random variable for some document ( $\theta_d$ for a specific document $d$ ) drawn from a $\text{Dir}(\alpha)$
$z$ is a k-dimensional one-hot-vector representing the chosen topic for a word i.e. with the i-th component equal to 1 and all other components being 0 ( $z^i = 1$ for a unique $i$ )
$z_n$ is the same as $z$ , but for the specific n-th word $w_n$ in a document, i.e. topic for this word.
$\beta$ is a $k \times V$ matrix parameterizing the word probabilities, i.e. $\beta_{ij} = p(w^j = 1 | z^i = 1)$

Model

Assume the following generative model for each document $\mathbf{w}$ in a corpus $D$ :

Choose $N \sim \text{Poisson} (\xi)$ , i.e. the length of the document (following this distribution is not a strict requirement)
Choose $\theta \sim \text{Dir} (\alpha)$ , used as a parameter for a multinomial later on
Each word in the document of length is assigned as follows:
1. Choose a topic $z_n \sim \text{Multinomial} ( \theta )$
2. Choose a word $w_n \sim p(w_n | z_n, \beta)$ , multinomial distribution conditioned on the topic $z_n$

Dirichlet

My interpretation

From a Dirichlet we draw a $k$ -vector which corresponds to a multinomial distribution, with each component corresponding to the probability of the corresponding topic / label / class. In our case, $p(z^i = 1 | \alpha)$ then corresponds to the i-th component of $\theta$ , i.e. the probability of topic $i$ .

From the paper

A k-dimensional Dirichlet random variable $\theta$ can take values in the (k−1)-simplex (a k-vector $\theta$ lies in the (k − 1)-simplex if $\theta_i \ge 0, \sum_{i=1}^k \theta_i = 1$ ), and has the following probability density on this simplex:

$\begin{equation*} p(\theta | \alpha) = \frac{\Gamma \Big( \sum_{i=1}^k \alpha_i \Big) }{\prod_{i=1}^k \Gamma (\alpha_i)} \prod_{i=1}^k \theta_i^{(1 - \alpha_i)} \end{equation*}$ 1

When we're talking about the (k - 1)-simplex we're referring to a region of space where the parameters of the mutlinomial can be (domain of all $\theta_i$ ). Why $k - 1$ ? Since we require that the sum over all $\theta_i$ equals 1, after choosing a value for $k-1$ of these, the last one is determined as $1 - \sum_{i=1}^{k-1} \theta_i$ .

The $k$ and $\theta$ being referred to in this section on the Dirichlet has nothing to do with notation used elswhere in this text, and simply represents some arbitrary integer $k$ and some random variable $\theta_i$ .

Joint distributions

First we have the joint distribution over the topic mixture $\theta$ , topics $\mathbf{z}$ and words $\mathbf{w}$ for one document:

$\begin{equation*} p(\theta, \mathbf{z}, \mathbf{w} | \alpha, \beta) = p(\theta | \alpha) \prod_{n=1}^N p(z_n | \theta) p(w_n | z_n, \beta) \end{equation*}$ 2

Where $p(z_n | \theta) = \theta_i$ where $z_n^i = 1$ since $z_n$ is a one-hot vector representing the n-th chosen topic.

topic mixture simply refers to a latent variable which the rv. topic depends on.

Then to obtain the probability of a document $\mathbf{w}$ , i.e. a set of words $w$ , we simply integrate over all topic mixtures $\theta$ and summing over all possible topic assignments $z_n$

$\begin{equation*} p(\mathbf{w} | \alpha, \beta) = \int p(\theta | \alpha) \prod_{n=1}^N \sum_{z_{n}} p(z_{n} | \theta) p(w_n | z_n, \beta) d\theta \end{equation*}$ 3

And finally, to obtain the distribution over our corpus, i.e. our set of documents, we simply take the product over all documents

$\begin{equation*} p(D | \alpha, \beta) = \prod_{d=1}^M \int p(\theta_d | \alpha) \prod_{n = 1}^N_d \sum_{z_d_n} p(z_d_n | \theta_d) p(w_d_n | z_d_n, \beta) d\theta_d \end{equation*}$ 4

Inference

Our inference problem is to compute the distributions of the latent variables $\mathbf{z}$ and $\theta$ given a document $\mathbf{w}$ .

$\begin{equation*} p(\theta, \mathbf{z}, \alpha, \beta, \mathbf{w}) = \frac{p(\theta, \mathbf{z}, \mathbf{w} | \alpha, \beta)}{p(\mathbf{w} | \alpha, \beta)} \end{equation*}$ 5

Both of which we deduced in the previous section! Unfortunately, this computation is intractable. Luckily there exists methods to deal with this: variational inference, MCMC and Laplace approximation.

Resources

Paper which most of the notes are based on

LDA: Latent Dirichlet Allocation

Table of Contents

Motivation

Notation

Latent Dirichlet Allocation

Notation

Model

Dirichlet

My interpretation

From the paper

Joint distributions

Inference

Resources