Expectation Maximization (EM)
Table of Contents
Summary
Expectation Maximization (EM) is an algorithm for obtaining the probability of some
hidden random variable
given some observations
.
It is a two-step algorithm, which are repeated in order until convergence:
Expectation-step: Choose some probability distribution
,
1
which gives us a (tight) lower-bound for
for the curren estimate of
Maximization-step: optimize the
2
Or rather, following the notation of Wikipedia:
Expectation step (E step): Calculate (or obtain an expression for) the expected value of the (complete) log likelihood function, with respect to the conditional distribution of
given
under the current estimate
of the parameters
:
3
Maximization step (M step): Find the parameter that maximizes this quantity:
4
Notation
is the
observation
is the rv. related to the 
is a set of parameters for our model, which we want to estimate
is the log-likelihood of the data given the estimated parameters 
means the expectation over the rv.
using the probability distribution 
Goal
- Have some model for

- Observe only

Objective is then to maximize:
5
i.e. maximize the log-likelihood over the parameters
by marginalizing over
the rvs.
.
Idea
The idea behind the EM algorithm is to at each step construct a lower-bound
for the log-likelihood using our estimate for
from the previous
step. In this sense, the EM algorithm is a type of inference known as variational inference.
The reason for doing this is that we can often find a closed-form optimal value
for
, which is not likely to be the case if we simply
tried to take the derivative of
set to zero and try to solve.
Finding the optimal parameters for the M-step is just to take the derivative of the
M-step, set to zero and solve. It turns out that this is more often easier than
directly doing the same for
.
Viewing it in the form coordinate descent
If we let
6
then we know from Jensen's inequality that
, and so increasing
at each step also monotonically increases the log-likelihood
. It can be shown that
this is then equivalent to using coordinate descent on
.
Derivation
If we take our objective defined above, we and simply multiply by
, we get:
7
Where we choose
to be s.t.
and
,
i.e.
is a probability distribution.
We then observe that the last sum (within the
) is actually the
following expectation:
8
Now using Jensen's inquality for a concave (because
is concave) function
, which tells us
:
9
Thus, we have constructed a lower-bound on the log-likelihood:
10
Why? This would imply that the expectation above would also be constant. And since
is true for any constant
, the inequality would be tight, i.e. be an equality, as wanted!
It's also equivalent of saying that we want
to be proportional to
.
Combining this with the fact that we want
to be a probability distribution, so we
need to satisfy the proportionality and
. Then it turns out (makes sense, since we need it to sum to 1 and be prop to
)
that we want:
11
Which gives us the E-step of the EM-algorithm:
12
Giving us a (tight) lower-bound for
for the current estimate of
.
The M-step is then to optimize that lower-bound wrt.
:
13
Appendix A: Jensen's inquality
If
is a convex function.
is a rv.
Then
.
Further, if
(
is strictly convex), then
"with probability 1".
If we instead have:
is a concave function
Then
. This is the one we need for the derivation of EM-algorithm.
To get a visual intuition, here is a image from Wikipedia: