Kullback Leibler Divergence
Table of Contents
Definition
Let and be two probability distributions for a random variable .
The Kullback-Leibler (KL) divergence
if is discrete, is given by
if is continuous,
Interpretation
Probability
Kullback-Leibner divergence from some probability-distributions and , denoted , is a measure of the information gained when one revises one's beliefs from the prior distribution to the posterior distribution . In other words, amount of information lost when is used to approximate .
Most importantly, the KL-divergence can be written
where is the optimal parameter and is the one we vary to approximate . The second term in the equation above is the only one which depend on the "unknown" parameter ( is fixed, since this is the parameter we assume to take on). Now, suppose we have samples from , then observe that the negative log-likelihood for some parametrizable distribution is given by
By the Law of Large numbers, we have
where denotes the expectation over the probability density . But this is exactly the second term in the KL-divergence! Hence, minimizing the KL-divergence between and is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!
Coding
The Kraft-McMillan theorem establishes than any decodable coding scheme for coding a message to identify one value out of a set of possibilities can be seen as representing an implicit probability distribution over , where is the length of the code for in bits
Therefore, we can interpret the Kullback-Leibner divergence as the expected extra message-length per datum that must be commincated if a code that is optimal for a given (wrong) distribution is, compared to using a code based on the true distribution .
Let's break this a bit down:
- Kraft-McMillan theorem basically says when taking the exponential of the length of each valid "codeword", the resulting set (of exponentials) looks like a probability mass function.
- If we were to create a coding schema for the values in (our set of symbols) using binary representation (bits) using our "suggested" probability distribution , KL-divergence gives us the expected extra message-length ber datum compared to using a code based on the true distribution