Expectation Maximization (EM)

Table of Contents

Summary

Expectation Maximization (EM) is an algorithm for obtaining the probability of some hidden random variable expectation_maximization_111c9fc503b1bee0e85c341822567661c0e88d72.png given some observations expectation_maximization_553e3281e8d268e42b95a8a4dd22e70563a7024d.png.

It is a two-step algorithm, which are repeated in order until convergence:

  1. Expectation-step: Choose some probability distribution expectation_maximization_d86cacb30bd2a8689192647a97b747450e59be93.png,

    expectation_maximization_db7832c8a6163d167844b9a88e152fcf1c20a702.png

    which gives us a (tight) lower-bound for expectation_maximization_3731a1fc01f50a312b81a79b2199704ce8822f9e.png for the curren estimate of expectation_maximization_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png

  2. Maximization-step: optimize the

    expectation_maximization_bf217ddd7eb85b895052e7baf1d4a78262f78533.png

Or rather, following the notation of Wikipedia:

  1. Expectation step (E step): Calculate (or obtain an expression for) the expected value of the (complete) log likelihood function, with respect to the conditional distribution of expectation_maximization_58b11249474ff8d328852e7691d6ff147462c985.png given expectation_maximization_3be865f97ba51fea036b457df8ecb6241792f4e1.png under the current estimate of the parameters expectation_maximization_f11dc35a4bc73d8bc8dfa4466760abe6c4bcac31.png:

    expectation_maximization_1c49b7421fc1aa0bb465499b84aa4898851c6382.png

  2. Maximization step (M step): Find the parameter that maximizes this quantity:

    expectation_maximization_5b717b7b19ee4769c5878b00de3f1ee6f53b745c.png

Notation

  • expectation_maximization_8303236eff4245908c9c2331541d2fd11c4d8ff1.png is the expectation_maximization_3a8f608d13c2c43500a663547fbedb1958d9b4c8.png observation
  • expectation_maximization_fd1b919699bff9467275753c7cc471cb1c913a2b.png is the rv. related to the expectation_maximization_3a8f608d13c2c43500a663547fbedb1958d9b4c8.png
  • expectation_maximization_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png is a set of parameters for our model, which we want to estimate
  • expectation_maximization_3731a1fc01f50a312b81a79b2199704ce8822f9e.png is the log-likelihood of the data given the estimated parameters expectation_maximization_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png
  • expectation_maximization_e2a60726b912fcd88f9667111bf5884c06e42bb5.png means the expectation over the rv. expectation_maximization_fd1b919699bff9467275753c7cc471cb1c913a2b.png using the probability distribution expectation_maximization_d86cacb30bd2a8689192647a97b747450e59be93.png

Goal

  • Have some model for expectation_maximization_edf65508fa1b21647b1f178c8adfb96a0bee9ef5.png
  • Observe only expectation_maximization_3c314f80373742988ad542a6d4ce66a111b17847.png
  • Objective is then to maximize:

    expectation_maximization_9019fbc4fede3c451f37e9b867fbdd6b4482da0f.png

    i.e. maximize the log-likelihood over the parameters expectation_maximization_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png by marginalizing over the rvs. expectation_maximization_fd1b919699bff9467275753c7cc471cb1c913a2b.png.

Idea

The idea behind the EM algorithm is to at each step construct a lower-bound for the log-likelihood using our estimate for expectation_maximization_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png from the previous step. In this sense, the EM algorithm is a type of inference known as variational inference.

The reason for doing this is that we can often find a closed-form optimal value for expectation_maximization_e986dc9e37d3d8f6ea16823a226429ab3a5a93bf.png, which is not likely to be the case if we simply tried to take the derivative of expectation_maximization_0c62fc892f6419ca8f80adb99a4d861cab08c149.png set to zero and try to solve.

Finding the optimal parameters for the M-step is just to take the derivative of the M-step, set to zero and solve. It turns out that this is more often easier than directly doing the same for expectation_maximization_3731a1fc01f50a312b81a79b2199704ce8822f9e.png.

Viewing it in the form coordinate descent

If we let

expectation_maximization_47ecb44e955ec168caf9e9768cba6d610328dcb2.png

then we know from Jensen's inequality that expectation_maximization_30b75539602cc19191addf4849bdfb2fe86aabb8.png, and so increasing expectation_maximization_3de9957cd86fb08aca25c41f182307c4c2f611a4.png at each step also monotonically increases the log-likelihood expectation_maximization_3731a1fc01f50a312b81a79b2199704ce8822f9e.png. It can be shown that this is then equivalent to using coordinate descent on expectation_maximization_31783bf9ca68d36cf431288f11cfe6cf60a14ef3.png.

Derivation

If we take our objective defined above, we and simply multiply by expectation_maximization_d8dac0566cf49d7b7b9e0dc22429ceb93c5eaaf2.png, we get:

expectation_maximization_f99387353b55910894d8a07a6e942c9056cb4648.png

Where we choose expectation_maximization_d86cacb30bd2a8689192647a97b747450e59be93.png to be s.t. expectation_maximization_bc57e1065b7c2c0abfaf60972df0b0b6c3a24500.png and expectation_maximization_61e8fa8fe3cf8c29cbbbacab486007d1e3cd93f6.png, i.e. expectation_maximization_d86cacb30bd2a8689192647a97b747450e59be93.png is a probability distribution.

We then observe that the last sum (within the expectation_maximization_02e8c49eba0ce76929968261a18bed533a6bfd9d.png) is actually the following expectation:

expectation_maximization_fd5ce5176b2d3283ebfa8217cf82e6b0ce9800b2.png

Now using Jensen's inquality for a concave (because expectation_maximization_de48444fd12baecaf426ed28762d31afb4127788.png is concave) function expectation_maximization_cdd1cc131da6040eca078917132a377727053c44.png, which tells us expectation_maximization_bc6729bba2ae16cac5e09ecade8ec640f35a87b5.png:

expectation_maximization_c67c279f2156fa713fad9199c959774c7667746f.png

Thus, we have constructed a lower-bound on the log-likelihood:

expectation_maximization_38d9925dc5cfe6e1363bd93b700e27e96b7ea12c.png

Why? This would imply that the expectation above would also be constant. And since expectation_maximization_9689c14276bf2822843076142dc9a87a8cb5061a.png is true for any constant expectation_maximization_32e55c6e155c100dc50ef35a36a56df33f4c0687.png, the inequality would be tight, i.e. be an equality, as wanted!

It's also equivalent of saying that we want expectation_maximization_d86cacb30bd2a8689192647a97b747450e59be93.png to be proportional to expectation_maximization_4689e1c78cd62700425bdddf52c438e4a9a13007.png. Combining this with the fact that we want expectation_maximization_d86cacb30bd2a8689192647a97b747450e59be93.png to be a probability distribution, so we need to satisfy the proportionality and expectation_maximization_746488ce2da5797db04e72f5cd80a4c90ecf653d.png. Then it turns out (makes sense, since we need it to sum to 1 and be prop to expectation_maximization_fd9fe5003b5dfa76cd57dccf213781e787a553d0.png) that we want:

expectation_maximization_dd83a50f69b5aebec013607b3746b50f63024cad.png

Which gives us the E-step of the EM-algorithm:

expectation_maximization_db7832c8a6163d167844b9a88e152fcf1c20a702.png

Giving us a (tight) lower-bound for expectation_maximization_3731a1fc01f50a312b81a79b2199704ce8822f9e.png for the current estimate of expectation_maximization_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png.

The M-step is then to optimize that lower-bound wrt. expectation_maximization_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png:

expectation_maximization_bf217ddd7eb85b895052e7baf1d4a78262f78533.png

Appendix A: Jensen's inquality

If

  • expectation_maximization_cdd1cc131da6040eca078917132a377727053c44.png is a convex function.
  • expectation_maximization_0207be880056b9a69e22e729dd37bced29cd174a.png is a rv.

Then expectation_maximization_c1b346ad6666868bcdb0b35699ab275e24c90814.png.

Further, if expectation_maximization_fde5bf11104a3b3f128febc857e04438ec7f56dc.png (expectation_maximization_cdd1cc131da6040eca078917132a377727053c44.png is strictly convex), then expectation_maximization_ea7785bc94c640668704c4d6b0a382e859d51fe5.png "with probability 1".

If we instead have:

  • expectation_maximization_cdd1cc131da6040eca078917132a377727053c44.png is a concave function

Then expectation_maximization_bc6729bba2ae16cac5e09ecade8ec640f35a87b5.png. This is the one we need for the derivation of EM-algorithm.

To get a visual intuition, here is a image from Wikipedia: jensens_inequality.png