# Kullback Leibler Divergence

## Table of Contents

## Definition

Let and be two probability distributions for a random variable .

The **Kullback-Leibler (KL) divergence**

if is

*discrete*, is given byif is

*continuous*,

## Interpretation

### Probability

Kullback-Leibner divergence from some probability-distributions and , denoted ,
is a measure of the information gained when one revises one's beliefs from the *prior* distribution
to the *posterior* distribution . In other words, amount of information lost when
is used to approximate .

Most importantly, the KL-divergence can be written

where is the optimal parameter and is the one we vary to approximate . The second term in the equation above is the only one which depend on the "unknown" parameter ( is fixed, since this is the parameter we assume to take on). Now, suppose we have samples from , then observe that the negative log-likelihood for some parametrizable distribution is given by

By the Law of Large numbers, we have

where denotes the expectation over the probability density . But this is exactly the second term in the KL-divergence! Hence, **minimizing the KL-divergence between and is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!**

### Coding

The Kraft-McMillan theorem establishes than any decodable coding scheme for coding a message to identify one value out of a set of possibilities can be seen as representing an implicit probability distribution over , where is the length of the code for in bits

Therefore, we can interpret the Kullback-Leibner divergence as the

expected extra message-lengthper datumthat must be commincated if a code that is optimal for a given (wrong) distribution is, compared to using a code based on the true distribution .

Let's break this a bit down:

- Kraft-McMillan theorem basically says when taking the exponential of the length of each valid "codeword", the resulting set (of exponentials) looks like a probability mass function.
- If we were to create a coding schema for the values in (our set of symbols) using binary representation (bits) using our "suggested" probability distribution , KL-divergence gives us the expected extra message-length ber datum compared to using a code based on the true distribution