Expectation propagation

Table of Contents

Notation

  • expectation_propagation_e10034d7090b31559f9048908da8f9d1c3b86ef6.png denotes the Kullback-Leibler divergence, to be minimized wrt. expectation_propagation_6ae104c4780deb7c11d31e7d5bda0a4cef326b64.png
  • expectation_propagation_6ae104c4780deb7c11d31e7d5bda0a4cef326b64.png is the approximate distribution, which is a member of the exponential family, i.e. can be written:

    expectation_propagation_5922f11fb0d8f47e1b146da6353124d048e15baa.png

  • expectation_propagation_748b9cdd03935265e822ea224b8b9e596a599c4d.png is a fixed distribution
  • expectation_propagation_336cc4816ec0737847752cbcbdf529fc618cae33.png denotes the data
  • expectation_propagation_9c2669845b34c10ad034612182786ca7b99fe0e4.png denotes the parameters / hidden variables

Exponential family

The exponential family of distributions over expectation_propagation_40fb973cd997849a029e392c385d32d4b8c40196.png, given parameters expectation_propagation_7fa6d5539258f1df66040916ec01134f8545cd0f.png, is defined to be the set of distribtions of the form

expectation_propagation_b7bd4114d74b18b8307034a76b732cdf4d4e2942.png

where:

  • expectation_propagation_40fb973cd997849a029e392c385d32d4b8c40196.png may be discrete or continuous
  • expectation_propagation_48a293c705a51b60eeacbf589724ff80b03e747b.png is just some function of expectation_propagation_40fb973cd997849a029e392c385d32d4b8c40196.png

MLE

If we wanted to estimate expectation_propagation_7fa6d5539258f1df66040916ec01134f8545cd0f.png in a density of the exponential family, we would simply take the derivative wrt expectation_propagation_7fa6d5539258f1df66040916ec01134f8545cd0f.png of the likelihood of the distribution and set to zero:

expectation_propagation_619f76ea0a6be3a6689a54f47c2698cad95ed52b.png

Rearranging, we get

expectation_propagation_ce7d897d9ec15f865a00e8e3c5ac1ba07af140dc.png

which is just

expectation_propagation_1f141d4b87535943da6b6c343dc57e9481f9218b.png

The covariance of expectation_propagation_48a293c705a51b60eeacbf589724ff80b03e747b.png can be expressed in terms of the second derivatives of expectation_propagation_2348c5700dbeee948e145693a60c32a9b5b48529.png, and similarily for higher order moments.

Thus, if we had set of i.i.d. data denoted by expectation_propagation_9ff495689e9f0e027d3955dc5dc062a4f8a1e97e.png, for which the likelihood function is given by

expectation_propagation_d895754e10eb91913ad89b4c3d12c88e18309511.png

which is minimized wrt. expectation_propagation_7fa6d5539258f1df66040916ec01134f8545cd0f.png when

expectation_propagation_25a235993f68b3bdb3ea023409d6ae83cb392e67.png

giving us the MLE estimator

expectation_propagation_f564f128e7a9d7217693bdf711921ba406185225.png

i.e. the MLE estimator depends on the data only through expectation_propagation_c9c4b2db1cb1e99924e98ce487b5feeeb676b4ba.png, hence expectation_propagation_48a293c705a51b60eeacbf589724ff80b03e747b.png is a sufficient statistic.

expectation_propagation_f564f128e7a9d7217693bdf711921ba406185225.png

Minimizing KL-divergence between two exponential distributions

The KL divergence then becomes

expectation_propagation_55bbe3aa19fd4c54933af89c8275f3e7a6bf9160.png

where expectation_propagation_7fa6d5539258f1df66040916ec01134f8545cd0f.png, expectation_propagation_bd37da5802801ca5b15db43b40c21cb3d859a4f2.png and expectation_propagation_57b5c04d7d379e53366d9dd7ef286e3935387aee.png define the approximation distribution expectation_propagation_f63749027365e29025a5cb867262c465d2de65bb.png.

We're not making any assumptions regarding the underlying distribution of expectation_propagation_7225b076f6e6326f1636b11d1aad8de58bcc4761.png.

Which is minimized wrt. expectation_propagation_7fa6d5539258f1df66040916ec01134f8545cd0f.png by

expectation_propagation_eeff0010762121b83f7f039618043de79f75c31b.png

From MLE estimator for exponential we then have

expectation_propagation_dc2333b13693024ba8d3d829b4c5643aeca15884.png

hence, the optimum solution corresponds to matching the expected sufficient statistics!

Expectation propagation

  • Joint probability of data and parameters given by

    expectation_propagation_e726877626e98ff9ad638171d1a2599c9b1d1df4.png

  • Want to evaluate

    expectation_propagation_d536c0ee0933e4f755b45a87ffac08a623af7773.png

    and expectation_propagation_5b500e44c6c86f3246bfd54f03f33392c2cac592.png for model comparison:

    expectation_propagation_ce39e265f9d1d3e07c5301ce12f77f504ca9fb34.png

  • Expectation propagation is based on the approximation to the posterior distribution expectation_propagation_ae7c97ad4032bd14e8ccf75bdab9d700fff6bc93.png by expectation_propagation_6bdb30b6ad5f607c41238d96fcf461c62c9be49e.png

    expectation_propagation_199a1e40a4087e003d7d0556c7807fb6115db4e7.png

    where expectation_propagation_2cdbb8952bd06b1348733763ff006a3b5ee9b331.png is an approximation to the factor expectation_propagation_4bacf410ab82dc0c0bf765ad573f7f6ca28c7bdd.png in the true posterior

  • To be tractable, need to constrain the approximators expectation_propagation_cfbd736fcdd4773e9f0e5e46891a6ebb5c91e428.png: assume to be exponential
  • Want to determine expectation_propagation_2cdbb8952bd06b1348733763ff006a3b5ee9b331.png by minimizing KL divergence between true and approx. posterior:

    expectation_propagation_27ebd11dcc528a7770a02412be60de8624234cc4.png

One approach of doing this would be to minimize KL divergence between the pairs expectation_propagation_4bacf410ab82dc0c0bf765ad573f7f6ca28c7bdd.png and expectation_propagation_2cdbb8952bd06b1348733763ff006a3b5ee9b331.png of factors.

It turns out that this no good; even though each of the factors are approximated individually, the product could still give a poor approximation.

(Also, if our assumptions are not true, i.e. true posterior is actually not of the exponential family, then clearly this approach would not be a good one.)

Expectation propagation optimizes each factor expectation_propagation_2cdbb8952bd06b1348733763ff006a3b5ee9b331.png in turn, given the "context" of all remaining factors.

Suppose we wish to update our estimate for expectation_propagation_2e0dad0bdb864eb9c84507c4fe261d3802cfe1b9.png, then we would like the following (assuming data was drawn from some exponential distribution):

expectation_propagation_38994ad36c2fcf5fec988a46e6293fefcccd3b15.png

i.e. want to optimize for the j-th factor conditioned on our estimate for all the other factors. (Expectation maximization, anyone?)

This problem we can pose as minimizing the following KL divergence

expectation_propagation_93d659213fc46b53459a11891eac48eb2982e743.png

where:

  • new approximate distribution:

    expectation_propagation_7a940759da770d1bf7e3154074b6bc884d092676.png

  • the unnormalized distribution:

    expectation_propagation_c8eb9b9f24d6cf82b6d9d951abfd6cd072cd9752.png

  • normalization with expectation_propagation_c8d9b3f50e695697fb751ba6be296b0a9e3f625f.png replaced by expectation_propagation_d91966a8a3797ca3fd442f49a8674f92da2e8089.png:

    expectation_propagation_3377720ddbbfd89c3b03f0b8ff732c3c44cc7c89.png

Which we already know comes down to having:

expectation_propagation_dc2333b13693024ba8d3d829b4c5643aeca15884.png

expectation_propagation_f9a6ba4dc4483b3182dfddeb474d3be20d1447a0.png

and we wish to approximate the posterior distribution expectation_propagation_ae7c97ad4032bd14e8ccf75bdab9d700fff6bc93.png by a distribution of the form

expectation_propagation_42f351772e9e29067003cc2830203305f8c810ec.png

and the model evidence expectation_propagation_5b500e44c6c86f3246bfd54f03f33392c2cac592.png.

  1. Initialize all of the approximating factors expectation_propagation_2cdbb8952bd06b1348733763ff006a3b5ee9b331.png
  2. Initialize the posterior approximation by setting

    expectation_propagation_16c0b8fe5ed618eac85897dcba48d5a5a1483b29.png

  3. Until convergence:
    1. Choose a factor expectation_propagation_2e0dad0bdb864eb9c84507c4fe261d3802cfe1b9.png to improve
    2. Remove expectation_propagation_2e0dad0bdb864eb9c84507c4fe261d3802cfe1b9.png from posterior by division:

      expectation_propagation_f051e0cf1cccf81424eb388928a317aec5e7e24b.png

    3. Let expectation_propagation_47ce2445c622d5103b4afb5e7dc88d09f4f5aa2a.png be such that

      expectation_propagation_9c62c927f2511b983598d8f1e948d2c1ac3dac87.png

      i.e. matching the sufficient statistics (moments) of expectation_propagation_47ce2445c622d5103b4afb5e7dc88d09f4f5aa2a.png with those of expectation_propagation_13b8e6ab9f10b2fed2253cf8e40d63474858ed2f.png, including evaluating the normalization constant

      expectation_propagation_ea37a48926006cad7ed09142d886321a80e1fba3.png

    4. Evaluate and store new factor

      expectation_propagation_026f99b1b2e7ca497fbfb2e09f28197f5d3658d4.png

  4. Evaluate the approximation to the model evidence:

    expectation_propagation_000af59c6bd1b3c26c814c551af0038171d620ee.png

Notes

  • No guarantee that iterations will converge
    • However, if iteratiors do converge; resulting solution will be a stationary point of a particular energy function (although each iteration is not guaranteed to actually decrease this energy function)
  • Does not make sense to apply EP to mixtures as the approximation tries to capture all of the modes of the posterior distribution

Moment matching / Assumed density filtering (ADF): a special case of EP

  • Initializes all expectation_propagation_2e0dad0bdb864eb9c84507c4fe261d3802cfe1b9.png to unity and performs a single pass through the approx. factors, updating each of them once
  • Does not make use of batching
  • Highly dependent on order of data points, as expectation_propagation_2e9d5bc71fd5bde1f5de12736b69ddc40722b8d0.png are only updated once