# Expectation Maximization (EM)

## Table of Contents

## Summary

Expectation Maximization (EM) is an algorithm for obtaining the probability of some hidden random variable given some observations .

It is a two-step algorithm, which are repeated in order until convergence:

**Expectation-step:**Choose some probability distribution ,1which gives us a (tight) lower-bound for for the curren estimate of

**Maximization-step:**optimize the2

Or rather, following the notation of Wikipedia:

**Expectation step (E step):**Calculate (or obtain an expression for) the expected value of the (*complete*) log likelihood function, with respect to the conditional distribution of given under the current estimate of the parameters :3**Maximization step (M step):**Find the parameter that maximizes this quantity:4

## Notation

- is the observation
- is the rv. related to the
- is a set of parameters for our model, which we want to estimate
- is the log-likelihood of the data given the estimated parameters
- means the expectation over the rv. using the probability distribution

## Goal

- Have some model for
- Observe only
Objective is then to

*maximize*:5i.e. maximize the log-likelihood over the parameters by marginalizing over the rvs. .

## Idea

The idea behind the EM algorithm is to at each step construct a lower-bound for the log-likelihood using our estimate for from the previous step. In this sense, the EM algorithm is a type of inference known as variational inference.

The reason for doing this is that we can often find a closed-form optimal value for , which is not likely to be the case if we simply tried to take the derivative of set to zero and try to solve.

Finding the optimal parameters for the M-step is just to take the derivative of the M-step, set to zero and solve. It turns out that this is more often easier than directly doing the same for .

### Viewing it in the form coordinate descent

If we let

then we know from Jensen's inequality that , and so increasing
at each step also monotonically increases the log-likelihood . It can be shown that
this is then equivalent to using *coordinate descent* on .

## Derivation

If we take our objective defined above, we and simply multiply by , we get:

Where we choose to be s.t. and , i.e. is a probability distribution.

We then observe that the last sum (within the ) is actually the following expectation:

Now using Jensen's inquality for a *concave* (because is concave) function , which tells us :

Thus, we have constructed a *lower-bound* on the log-likelihood:

**Why?** This would imply that the expectation above would also be constant. And since
is true for any constant , the inequality would be *tight*, i.e. be an equality, as wanted!

It's also equivalent of saying that we want to be proportional to .
Combining this with the fact that we want to be a probability distribution, so we
need to satisfy the proportionality and . Then it turns out (makes sense, since we need it to sum to 1 *and* be prop to )
that we want:

Which gives us the **E-step** of the EM-algorithm:

Giving us a (tight) lower-bound for for the current estimate of .

The **M-step** is then to optimize that lower-bound wrt. :

## Appendix A: Jensen's inquality

If

- is a
*convex*function. - is a rv.

Then .

**Further**, if ( is strictly convex), then
"with probability 1".

If we instead have:

- is a
*concave*function

Then . **This is the one we need for the derivation of EM-algorithm.**

To get a visual intuition, here is a image from Wikipedia: