# Expectation propagation

## Notation

• denotes the Kullback-Leibler divergence, to be minimized wrt.
• is the approximate distribution, which is a member of the exponential family, i.e. can be written:

• is a fixed distribution
• denotes the data
• denotes the parameters / hidden variables

## Exponential family

The exponential family of distributions over , given parameters , is defined to be the set of distribtions of the form

where:

• may be discrete or continuous
• is just some function of

### MLE

If we wanted to estimate in a density of the exponential family, we would simply take the derivative wrt of the likelihood of the distribution and set to zero:

Rearranging, we get

which is just

The covariance of can be expressed in terms of the second derivatives of , and similarily for higher order moments.

Thus, if we had set of i.i.d. data denoted by , for which the likelihood function is given by

which is minimized wrt. when

giving us the MLE estimator

i.e. the MLE estimator depends on the data only through , hence is a sufficient statistic.

## Minimizing KL-divergence between two exponential distributions

The KL divergence then becomes

where , and define the approximation distribution .

We're not making any assumptions regarding the underlying distribution of .

Which is minimized wrt. by

From MLE estimator for exponential we then have

hence, the optimum solution corresponds to matching the expected sufficient statistics!

## Expectation propagation

• Joint probability of data and parameters given by

• Want to evaluate

and for model comparison:

• Expectation propagation is based on the approximation to the posterior distribution by

where is an approximation to the factor in the true posterior

• To be tractable, need to constrain the approximators => assume to be exponential
• Want to determine by minimizing KL divergence between true and approx. posterior:

One approach of doing this would be to minimize KL divergence between the pairs and of factors.

It turns out that this no good; even though each of the factors are approximated individually, the product could still give a poor approximation.

(Also, if our assumptions are not true, i.e. true posterior is actually not of the exponential family, then clearly this approach would not be a good one.)

Expectation propagation optimizes each factor in turn, given the "context" of all remaining factors.

Suppose we wish to update our estimate for , then we would like the following (assuming data was drawn from some exponential distribution):

i.e. want to optimize for the j-th factor conditioned on our estimate for all the other factors. (Expectation maximization, anyone?)

This problem we can pose as minimizing the following KL divergence

where:

• new approximate distribution:

• the unnormalized distribution:

• normalization with replaced by :

Which we already know comes down to having:

and we wish to approximate the posterior distribution by a distribution of the form

and the model evidence .

1. Initialize all of the approximating factors
2. Initialize the posterior approximation by setting

3. Until convergence:
1. Choose a factor to improve
2. Remove from posterior by division:

3. Let be such that

i.e. matching the sufficient statistics (moments) of with those of , including evaluating the normalization constant

4. Evaluate and store new factor

4. Evaluate the approximation to the model evidence:

### Notes

• No guarantee that iterations will converge
• Fixed-points guaranteed to exist when the approximations are in the exponential family
• So at least then it converges, but to what? Dunno
• Can also have multiple fixed-points
• However, if iteratiors do converge; resulting solution will be a stationary point of a particular energy function (although each iteration is not guaranteed to actually decrease this energy function)
• Does not make sense to apply EP to mixtures as the approximation tries to capture all of the modes of the posterior distribution

### Moment matching / Assumed density filtering (ADF): a special case of EP

• Initializes all to unity and performs a single pass through the approx. factors, updating each of them once
• Does not make use of batching
• Highly dependent on order of data points, as are only updated once