LDA: Latent Dirichlet Allocation

Table of Contents

Motivation

Bag-of-words models in general performs quite well, and we're looking for a probabilistic model of text. Let's look at the underlying assumptions of a bag-of-word model:

  • Ordering of words within a document is irrelevant
  • Ordering of documents in entire corpus is irrelevant

In probability we call this exchangebility. You remember anything about this? Dirichlet process maybe?!

Notation

  • word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by latent_dirichlet_allocation_301c0cec46e6fbfe3cef61cbd3a1c29e998b362e.png. Represented using unit-basis vectors that have a single component equal to 1 and all other components equal to 0. Thus, using superscripts to denote components,the latent_dirichlet_allocation_b715709440657f4b4a044433aad15f923add0647.png word in the vocabulary is represented by a V-vector latent_dirichlet_allocation_fe480f547555026d210b55b5d4ef758235f32832.png such that latent_dirichlet_allocation_ce6f9c4a6b43e5ffd58b186d964d725f7dcf03e2.png and latent_dirichlet_allocation_40f2b42807acf162d24deb0c2a5c442d685f2a18.png for latent_dirichlet_allocation_be4d29637e2238660214d6cd0eea0349a5210bd0.png.
  • document is a sequence of latent_dirichlet_allocation_e10e2b430f95617381cdd6d6b52aed29fb971dff.png words denoted by latent_dirichlet_allocation_e179ddc61688fab729de20cd03196d40d7102295.png where latent_dirichlet_allocation_5639a50ecbfcf3bcb287fd845a6a752936d0c47e.png is the latent_dirichlet_allocation_8338d9c09ba786a92bf1519dd34afa6771ccd79b.png word in the sequence
  • corpus is a collection of latent_dirichlet_allocation_f75d7681f9d0acecef24eb637ee201aa5b39199a.png documents denoted by latent_dirichlet_allocation_6dd97029ae9bad5b57f0759bd1366b1b8a48ad06.png.
  • topic is characterized by a distribution over words

Latent Dirichlet Allocation

  • Generative model of a corpus
  • Documents are represented as a random mixtures over latent topics

Notation

  • latent_dirichlet_allocation_6e9d20113f7ef024a71fc3edcc10e77417d06393.png parameter for Dirichlet distribution
  • latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png latent random variable for some document (latent_dirichlet_allocation_6f5be2bb6f78420f7efe2dbf0e2b8efcd1f5c963.png for a specific document latent_dirichlet_allocation_24db75342343379bc61072ed218d34613269537e.png) drawn from a latent_dirichlet_allocation_432b75eefd28174e7d128eee77eb55bef88a6ca7.png
  • latent_dirichlet_allocation_9c15196dd07b1add486b8b54592e74bfe946ed95.png is a k-dimensional one-hot-vector representing the chosen topic for a word i.e. with the i-th component equal to 1 and all other components being 0 (latent_dirichlet_allocation_6081f3a44094e357fb0b77caf0f420c1b0abfdb1.png for a unique latent_dirichlet_allocation_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png)
  • latent_dirichlet_allocation_2cf02d5943c9689ad064b553c5a546d1a23c9e9a.png is the same as latent_dirichlet_allocation_9c15196dd07b1add486b8b54592e74bfe946ed95.png, but for the specific n-th word latent_dirichlet_allocation_5639a50ecbfcf3bcb287fd845a6a752936d0c47e.png in a document, i.e. topic for this word.
  • latent_dirichlet_allocation_c7352234b7a50a150d827cc268488697d90826dc.png is a latent_dirichlet_allocation_739f39a73791bb8153551815198b0650e063654e.png matrix parameterizing the word probabilities, i.e. latent_dirichlet_allocation_6f76f7466330dc5baf3bdb507547bcded6bfde98.png

Model

Assume the following generative model for each document latent_dirichlet_allocation_4016a81e52b1f12c3e1d907335d404f43d7cf519.png in a corpus latent_dirichlet_allocation_b689cba8d7566f6adaf605a844e193a27e155078.png:

  1. Choose latent_dirichlet_allocation_32e7b1ac34dba929de3a12ae0ad4ceaf3467cbbb.png, i.e. the length of the document (following this distribution is not a strict requirement)
  2. Choose latent_dirichlet_allocation_5c52734cd7ef048669591c88639991de148dae9f.png, used as a parameter for a multinomial later on
  3. Each word latent_dirichlet_allocation_5639a50ecbfcf3bcb287fd845a6a752936d0c47e.png in the document of length latent_dirichlet_allocation_e10e2b430f95617381cdd6d6b52aed29fb971dff.png is assigned as follows:
    1. Choose a topic latent_dirichlet_allocation_ecfc55d06f42ab9cfe24dca6d5c248787eb8f53f.png
    2. Choose a word latent_dirichlet_allocation_b6da00673a2aa9c7b71695f0448a15a62d88106f.png, multinomial distribution conditioned on the topic latent_dirichlet_allocation_2cf02d5943c9689ad064b553c5a546d1a23c9e9a.png

generative_model.png

Dirichlet

My interpretation

From a Dirichlet we draw a latent_dirichlet_allocation_094b02afce734f4ce51933d0093ef3d2da9f8123.png -vector which corresponds to a multinomial distribution, with each component corresponding to the probability of the corresponding topic / label / class. In our case, latent_dirichlet_allocation_8916924041aa76a10c859676b713c95941dfced8.png then corresponds to the i-th component of latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png, i.e. the probability of topic latent_dirichlet_allocation_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png.

From the paper

A k-dimensional Dirichlet random variable latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png can take values in the (k−1)-simplex (a k-vector latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png lies in the (k − 1)-simplex if latent_dirichlet_allocation_26563230f72d60e07e06210fd34504b6d29b7cd3.png ), and has the following probability density on this simplex:

latent_dirichlet_allocation_31a9997f5d89f46efa6c2171b9de1435658c0398.png

When we're talking about the (k - 1)-simplex we're referring to a region of space where the parameters of the mutlinomial can be (domain of all latent_dirichlet_allocation_3fef34decfc720f701fe0648862afaa133b8d9b8.png). Why latent_dirichlet_allocation_bf328dae43e7491934ae8d05c6d598a4e07b91ee.png? Since we require that the sum over all latent_dirichlet_allocation_3fef34decfc720f701fe0648862afaa133b8d9b8.png equals 1, after choosing a value for latent_dirichlet_allocation_d30779bbea1357af8622b5f631b2857a547741be.png of these, the last one is determined as latent_dirichlet_allocation_d19ed401fd35418fc6acf39595ae7f96d3479f61.png.

The latent_dirichlet_allocation_094b02afce734f4ce51933d0093ef3d2da9f8123.png and latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png being referred to in this section on the Dirichlet has nothing to do with notation used elswhere in this text, and simply represents some arbitrary integer latent_dirichlet_allocation_094b02afce734f4ce51933d0093ef3d2da9f8123.png and some random variable latent_dirichlet_allocation_3fef34decfc720f701fe0648862afaa133b8d9b8.png.

Joint distributions

First we have the joint distribution over the topic mixture latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png, topics latent_dirichlet_allocation_de12a97aeed74fd9dfc7765b2af33c8777b0bbb1.png and words latent_dirichlet_allocation_4016a81e52b1f12c3e1d907335d404f43d7cf519.png for one document:

latent_dirichlet_allocation_956cf8888aebb515f5a10fceb4a061d9592720fe.png

Where latent_dirichlet_allocation_a0f67b3332de211e712f85f6ed0d57361ba7e909.png where latent_dirichlet_allocation_e3597f5cbdb6be0982cd67e5dbe422efd4b9d783.png since latent_dirichlet_allocation_2cf02d5943c9689ad064b553c5a546d1a23c9e9a.png is a one-hot vector representing the n-th chosen topic.

topic mixture simply refers to a latent variable which the rv. topic depends on.

Then to obtain the probability of a document latent_dirichlet_allocation_4016a81e52b1f12c3e1d907335d404f43d7cf519.png, i.e. a set of words latent_dirichlet_allocation_fe480f547555026d210b55b5d4ef758235f32832.png, we simply integrate over all topic mixtures latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png and summing over all possible topic assignments latent_dirichlet_allocation_2cf02d5943c9689ad064b553c5a546d1a23c9e9a.png

latent_dirichlet_allocation_26a4b782ebc5673c5c04c9cd183beb93cc7c1b19.png

And finally, to obtain the distribution over our corpus, i.e. our set of documents, we simply take the product over all documents

latent_dirichlet_allocation_b73ca169f2664a4037327b366aa768485cebe4e5.png

Inference

Our inference problem is to compute the distributions of the latent variables latent_dirichlet_allocation_de12a97aeed74fd9dfc7765b2af33c8777b0bbb1.png and latent_dirichlet_allocation_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png given a document latent_dirichlet_allocation_4016a81e52b1f12c3e1d907335d404f43d7cf519.png.

latent_dirichlet_allocation_c82a11ba1569331c93d61acd99f61225f23ae84b.png

Both of which we deduced in the previous section! Unfortunately, this computation is intractable. Luckily there exists methods to deal with this: variational inference, MCMC and Laplace approximation.

Resources