Maximum entropy models

Table of Contents

Notation

  • maximum_entropy_models_46279fc874e69dcf5823e49c29e5e50e30c195f7.png for a probability distribution denotes the integral over all inputs if continuous, and sum if discrete

Overview

  • MaxEnt models are models which attempts to maximize the information (Shannon) entropy wrt. constraints on statistics (e.g. first moment) of the "data" or underlying generating distribution
  • The optimization problem is then

    maximum_entropy_models_bfd1f5399e379312fa839e018994e5ba4a27bda5.png

    where maximum_entropy_models_3fa5b645482ca39f5049f2eaa922436d58b71f9d.png are arbitrary functions on the input-space, and maximum_entropy_models_e3c2acd8fda16fe4a737b744feab5cbfe8ecaf48.png are the values which the expectation of maximum_entropy_models_3fa5b645482ca39f5049f2eaa922436d58b71f9d.png should take

  • Using Lagrange multipliers, we can write this as the minimization problem

    maximum_entropy_models_f8f377e4615e0d7dbf3baa88398b96b2e940ee44.png

    where we've chosen these arbitrary maximum_entropy_models_e3c2acd8fda16fe4a737b744feab5cbfe8ecaf48.png to be the observed expectations of maximum_entropy_models_3fa5b645482ca39f5049f2eaa922436d58b71f9d.png.

  • Solving for maximum_entropy_models_1ed25cf746e36b58c8d980488e7f49fe0ccb1b39.png by taking the functional derivative and setting to zero:

    maximum_entropy_models_1a0a87e9d4b004e6eb1ed08c36b46fc327989370.png

The general form of the maximum entropy distribution is then given by

maximum_entropy_models_6022aafc5ceeeb768039cf4e57387a8c69415321.png

where

maximum_entropy_models_71aebf0821d3409f7ae933e212e23709f2dd2bb6.png

Furthermore, the values of the Lagrange multipliers maximum_entropy_models_031c55aa57bd247de3214605bc78f119d897b407.png are chosen to match the expectations of the functions maximum_entropy_models_107a8a0dab6fcd71b0aa7654754c602bbfc6a029.png, hence

maximum_entropy_models_91a64f179135315905e4a321b53b644d5c9e2371.png

i.e. the parameters of the distribution can be chosen s.t.

maximum_entropy_models_dc47d6eafb757c511beaf58acff771f715e70399.png

Same holds for discrete random variables.

Still gotta estimate those moments yo!

Principle of Maximum Entropy

Shannon entropy

Say we want to define a function maximum_entropy_models_6c03018fd4253198b850bc6339e99699b567a798.png describing the information, in terms of an event maximum_entropy_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png with probability maximum_entropy_models_4dfbb4efa24b820c3d05ca44f97337b43d713ff2.png; that is, how much information is acquired if we observe event maximum_entropy_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png?

Such a function one would expect to have the following properties:

  1. maximum_entropy_models_a8bc82f6ce650161d008febc1a349b5b1382d645.png is monotonoically decreasing in maximum_entropy_models_fefe9e556d399665a26a37824ec578cbffb0cabe.png, i.e. if maximum_entropy_models_4dfbb4efa24b820c3d05ca44f97337b43d713ff2.png is large, then we gain little "information" by observing event maximum_entropy_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png.
  2. maximum_entropy_models_de9f8da4ea87e1a4b77063b4b1f37719dae35ed1.png, i.e. "information" is a non-negative quantity.
  3. maximum_entropy_models_6622138aef15da80cf030683bae79a24f1725e7e.png, i.e. events occuring with a probability 1 provides no information.
  4. maximum_entropy_models_83696a8c1f034af6ddbdd650dbe2eb3ebb7e41cb.png for two independent events maximum_entropy_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png and maximum_entropy_models_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png, i.e. the "joint information" of two independent random variables is the information of both added together.

It turns out that the following equation satisfies exactly these properties (see here for how to deduce it):

The average amount of information received for an event maximum_entropy_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png is given by

maximum_entropy_models_92c01fe8ad70b7b02943b06183c104e758d7ed2c.png

Justification

  • Consider discrete probability distribution among maximum_entropy_models_fe52c58a96549528260fe0d7d04caf390dd47865.png mutually exclusive propositions
  • Most informative distribution would occur when exactly one of the propositions was known to be true, in which case the information entropy would be zero (no "uncertainty").
  • Least informative distribution would occur when all propositions are equally likely (uniform), in which case the information entropy would be equal to its maximum possible value: maximum_entropy_models_d4b918565ca10694de09f5c5f8190d3c779ea656.png.
  • One can therefore view information entropy as a numerical measure of how uninformative a particular distribution is, ranging from zero to one.
  • Choosing a distribution with maximum entropy we are in a way choosing the most uninformative distribution possible, and to choose a distribution with lower entropy would be to assume information which is unknown.

Walllis derivation - combinatorial argument

  • https://en.wikipedia.org/wiki/Principle_of_maximum_entropy?oldformat=true#The_Wallis_derivation
  • Makes no reference to entropy as a measure of "uncertainty"
  • Want to make probability assignment to maximum_entropy_models_fe52c58a96549528260fe0d7d04caf390dd47865.png mutually exclusive propositions
  • Let maximum_entropy_models_40751760087ea02fbf6d345c07ff68b30f4a46da.png be the probability of each of the maximum_entropy_models_fe52c58a96549528260fe0d7d04caf390dd47865.png propositions, with maximum_entropy_models_8cfb0c234b01b73fcd6db4dfb239298f280b3050.png being the number of times the i-th proposition "occurs"
  • Probability of any particular results is the multinomial distribution:

    maximum_entropy_models_7f2f9171698d7fab30be1aace48c623c0ec98f70.png

  • Most probably results it the one which maximizes maximum_entropy_models_fe7108f38f2db55db9cb7eee3f63077437a2e510.png, equivalently, we can maximize any monotonic increasing function of maximum_entropy_models_fe7108f38f2db55db9cb7eee3f63077437a2e510.png:

    maximum_entropy_models_af1a157d44e8cfb0bec61c8bc922f666783ba2db.png

  • Assuming maximum_entropy_models_bf38d96b373a74c8a7d1ea0ea13fcca59db035d6.png we have

    maximum_entropy_models_f633b0138fb4f2f71cee903634dcf5c213846d68.png

Issues

  • Information about the probability distributions is intended to exist a priori
  • Information provided by measurements is only an estimate of "true" information, hence we require methods for entropy estimation.

Resources

Bibliography

  • [paninski_2003] Paninski, Estimation of Entropy and Mutual Information, Neural Computation, 15(6), 1191–1253 (2003). link. doi.