Kullback Leibler Divergence
Table of Contents
Definition
Let
and
be two probability distributions for a random variable
.
The Kullback-Leibler (KL) divergence
if
is discrete, is given by
if
is continuous,
Interpretation
Probability
Kullback-Leibner divergence from some probability-distributions
and
, denoted
,
is a measure of the information gained when one revises one's beliefs from the prior distribution
to the posterior distribution
. In other words, amount of information lost when
is used to approximate
.
Most importantly, the KL-divergence can be written
where
is the optimal parameter and
is the one we vary to approximate
. The second term in the equation above is the only one which depend on the "unknown" parameter
(
is fixed, since this is the parameter we assume
to take on). Now, suppose we have
samples
from
, then observe that the negative log-likelihood for some parametrizable distribution
is given by
By the Law of Large numbers, we have
where
denotes the expectation over the probability density
. But this is exactly the second term in the KL-divergence! Hence, minimizing the KL-divergence between
and
is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!
Coding
The Kraft-McMillan theorem establishes than any decodable coding scheme for coding a message to identify one value
out of a set of possibilities
can be seen as representing an implicit probability distribution
over
, where
is the length of the code for
in bits
Therefore, we can interpret the Kullback-Leibner divergence as the expected extra message-length per datum that must be commincated if a code that is optimal for a given (wrong) distribution
is, compared to using a code based on the true distribution
.
Let's break this a bit down:
- Kraft-McMillan theorem basically says when taking the exponential of the length of each valid "codeword", the resulting set (of exponentials) looks like a probability mass function.
- If we were to create a coding schema for the values in
(our set of symbols) using binary representation (bits) using our "suggested" probability distribution
, KL-divergence gives us the expected extra message-length ber datum compared to using a code based on the true distribution 
out of a set of possibilities
over
is the length of the code for