Kullback Leibler Divergence

Table of Contents

Definition

Let kullback_leibler_divergence_fefe9e556d399665a26a37824ec578cbffb0cabe.png and kullback_leibler_divergence_ab437e1f9b3376761b155efe111c9860607c4b86.png be two probability distributions for a random variable kullback_leibler_divergence_0207be880056b9a69e22e729dd37bced29cd174a.png.

The Kullback-Leibler (KL) divergence

  • if kullback_leibler_divergence_0207be880056b9a69e22e729dd37bced29cd174a.png is discrete, is given by

    kullback_leibler_divergence_a8ae5eacb3c9bfd2d0912e1b0fdd52a3c33f528b.png

  • if kullback_leibler_divergence_0207be880056b9a69e22e729dd37bced29cd174a.png is continuous,

    kullback_leibler_divergence_fc347931b13cb9684adaa17ff2bd975ddc89a863.png

Interpretation

Probability

Kullback-Leibner divergence from some probability-distributions kullback_leibler_divergence_88c6e1c008509ea8de35316286b272c4107bed01.png and kullback_leibler_divergence_fd9fe5003b5dfa76cd57dccf213781e787a553d0.png, denoted kullback_leibler_divergence_81dbc5e7ae59279a0f8d19ee9a9872f0a4fe6e42.png, is a measure of the information gained when one revises one's beliefs from the prior distribution kullback_leibler_divergence_88c6e1c008509ea8de35316286b272c4107bed01.png to the posterior distribution kullback_leibler_divergence_fd9fe5003b5dfa76cd57dccf213781e787a553d0.png. In other words, amount of information lost when kullback_leibler_divergence_88c6e1c008509ea8de35316286b272c4107bed01.png is used to approximate kullback_leibler_divergence_fd9fe5003b5dfa76cd57dccf213781e787a553d0.png.

Most importantly, the KL-divergence can be written

kullback_leibler_divergence_ede88a1c37a2e417bbf17e62c4646acf2bc97150.png

where kullback_leibler_divergence_755a947cc5621b043f5ac028c989d937f5660938.png is the optimal parameter and kullback_leibler_divergence_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png is the one we vary to approximate kullback_leibler_divergence_7e435906ce48f47fdb46660c4c57fc204fa1ebd2.png. The second term in the equation above is the only one which depend on the "unknown" parameter kullback_leibler_divergence_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png (kullback_leibler_divergence_755a947cc5621b043f5ac028c989d937f5660938.png is fixed, since this is the parameter we assume kullback_leibler_divergence_fefe9e556d399665a26a37824ec578cbffb0cabe.png to take on). Now, suppose we have kullback_leibler_divergence_e10e2b430f95617381cdd6d6b52aed29fb971dff.png samples kullback_leibler_divergence_11a687af8d90d1f3e1d5a38656049056ec322012.png from kullback_leibler_divergence_fefe9e556d399665a26a37824ec578cbffb0cabe.png, then observe that the negative log-likelihood for some parametrizable distribution kullback_leibler_divergence_ab437e1f9b3376761b155efe111c9860607c4b86.png is given by

kullback_leibler_divergence_fe7ee4c5b57fa186d27bac26aab5ea027360d257.png

By the Law of Large numbers, we have

kullback_leibler_divergence_b22c00436aadc74869f9fb974c91a6aa3f82bf27.png

where kullback_leibler_divergence_5137f442873e0386319eb3b2452c7003d07730ed.png denotes the expectation over the probability density kullback_leibler_divergence_fefe9e556d399665a26a37824ec578cbffb0cabe.png. But this is exactly the second term in the KL-divergence! Hence, minimizing the KL-divergence between kullback_leibler_divergence_5b2ed34feebd141f08348dcea915b969f884387b.png and kullback_leibler_divergence_40b8e5318b1dd5f615c570051990536cabdd5a58.png is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!

Coding

From Wikipedia:

The Kraft-McMillan theorem establishes than any decodable coding scheme for coding a message to identify one value kullback_leibler_divergence_e27f5d96a277220941afcc8233c49d665a7defe0.png out of a set of possibilities kullback_leibler_divergence_0207be880056b9a69e22e729dd37bced29cd174a.png can be seen as representing an implicit probability distribution kullback_leibler_divergence_c42bd786fe521dcf3863b765e554e2af371419b4.png over kullback_leibler_divergence_0207be880056b9a69e22e729dd37bced29cd174a.png, where kullback_leibler_divergence_8d34c359698e78e1bd67423ed1f50b5ed0d9750f.png is the length of the code for kullback_leibler_divergence_e27f5d96a277220941afcc8233c49d665a7defe0.png in bits

Therefore, we can interpret the Kullback-Leibner divergence as the expected extra message-length per datum that must be commincated if a code that is optimal for a given (wrong) distribution kullback_leibler_divergence_88c6e1c008509ea8de35316286b272c4107bed01.png is, compared to using a code based on the true distribution kullback_leibler_divergence_fd9fe5003b5dfa76cd57dccf213781e787a553d0.png.

Let's break this a bit down:

  • Kraft-McMillan theorem basically says when taking the exponential of the length of each valid "codeword", the resulting set (of exponentials) looks like a probability mass function.
  • If we were to create a coding schema for the values in kullback_leibler_divergence_0207be880056b9a69e22e729dd37bced29cd174a.png (our set of symbols) using binary representation (bits) using our "suggested" probability distribution kullback_leibler_divergence_88c6e1c008509ea8de35316286b272c4107bed01.png, KL-divergence gives us the expected extra message-length ber datum compared to using a code based on the true distribution kullback_leibler_divergence_fd9fe5003b5dfa76cd57dccf213781e787a553d0.png