Hidden Markov Models

Table of Contents


Hidden Markov Model (HHM) is a probability model used to model latent states ( hidden_markov_models_1b784349992f84a04dd922ed86ed4a01d068f538.png ) based on observations which depends on the current state of the system. That is; at some step hidden_markov_models_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png the system is in some "internal" / unobservable / latent state hidden_markov_models_1b784349992f84a04dd922ed86ed4a01d068f538.png. We then observe an output hidden_markov_models_ce8deeba8921a13666b0564f15c0771f99fece1c.png which depends on the current state, hidden_markov_models_1b784349992f84a04dd922ed86ed4a01d068f538.png, the system is in.



  • Hidden states/rvs hidden_markov_models_0218cf40ead872d8b0caafb5fef5ccd657c496f9.png
    • hidden_markov_models_d8f77f0ab65b5737a851aa88e9e86c7ef0570b63.png i.e. a discrete set of states
  • Observed rvs hidden_markov_models_0034f448df0e4700346cb6ab32af5c5ee6275d1e.png
    • hidden_markov_models_2afa4a727f490959c694236a14186fb3190cd624.png is the set of all possible values that hidden_markov_models_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png can take, e.g. discrete, hidden_markov_models_e2133c652fe722a44da0e96987729d212aaf50c2.png



The reason why we use hidden_markov_models_d39c68db0cbdad3e5515e59bbdb19cc969018d94.png here is that elements of hidden_markov_models_2afa4a727f490959c694236a14186fb3190cd624.png, can potentially be multi-dimensional, e.g. hidden_markov_models_6b275bb58d7ba19ed7d1e3af03a5a88bee057a24.png, then hidden_markov_models_c6106a541ded69ff9d820f8449ca1ff57d653969.png represents a vector of observed rvs, not just a single observed rv.


  • Transition probabilities/matrix: hidden_markov_models_3a6fbb9ff62cf368f4ad9da6e31d30167143fa4a.png
  • Emission probabilites: hidden_markov_models_40e1462878d022c0b3939cd234e60f2207193994.png for hidden_markov_models_365ba8a38d89246994d745e9655f22be5f5b562e.png
    • hidden_markov_models_72103cbefc02cdcd1cdaffca00f896bdf855b8e4.png is then a prob. dist. on hidden_markov_models_2afa4a727f490959c694236a14186fb3190cd624.png
  • Initial dist: hidden_markov_models_28989522568e994c5ec5e000c12bbb30edb8863d.png

The difference between hidden_markov_models_96d04b8fc8f0b4d4d6532fe44a12ff93e0668425.png and hidden_markov_models_94e37b4cb7801e334fb7bf56be5e49eb0260194b.png, e.g. lower- and uppercase notation is that hidden_markov_models_96d04b8fc8f0b4d4d6532fe44a12ff93e0668425.png represents the distribution of the rv hidden_markov_models_94e37b4cb7801e334fb7bf56be5e49eb0260194b.png, while hidden_markov_models_94e37b4cb7801e334fb7bf56be5e49eb0260194b.png is the actual rv, and so hidden_markov_models_24167e922ddf094bf3c3e1b589d77d275cb1b7ad.png means the case where the rv hidden_markov_models_94e37b4cb7801e334fb7bf56be5e49eb0260194b.png takes on the value hidden_markov_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png. The notation hidden_markov_models_37893b4798d0fa994a0e1313e9898ccc4eff6afb.png for some arbitrary hidden_markov_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png is then talking about a discrete distribution.

Using the substitutions from above, the join distribution is the following:


The emission prob. dist for the different values of the hidden states can be basically any prob. dist.

The plot below shows a problem where HMM might be a good application, where we have the states hidden_markov_models_058e0ee5fa5c650664a7edbd68e21ca3b682e499.png in blue and observations in red with emission probability hidden_markov_models_28fd46959345aceff4e3d54f818fdf105ce6b1d8.png.


Forward-backward algorithm / alpha-beta recursion

Example of a dynamic programming algorithm. This provides us with a way of efficiently estimating the probability of the system being in some state hidden_markov_models_96d04b8fc8f0b4d4d6532fe44a12ff93e0668425.png at the hidden_markov_models_916e1e768f4406e3ffe688502abb60f06e33c41e.png step given all hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png observations we've seen, hidden_markov_models_553e3281e8d268e42b95a8a4dd22e70563a7024d.png, i.e. hidden_markov_models_6d3ae8f0508bfd15003605def98835b08c8e4ba2.png.

This does not provide us with a way to calculate the probability of a sequence of states, given the observations we've seen. For that we have to have a look at the Viterbi algorithm.


  • hidden_markov_models_25a2d57634852cad30d7025734cf94e5ce5ff71e.png i.e. the sequence of observations made, not the rv representing some hidden_markov_models_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png
  • hidden_markov_models_b500ab078ed03d0812682a1ffbc22a063a46e3a6.png represents the sequence of observations from the hidden_markov_models_debedb7f7748a9279d98de01a251be322c096645.png observation to the hidden_markov_models_494f69b962b78c1246de9860610fa54782c36233.png observation


We make the assumptions that we know hidden_markov_models_e0f621459016b31bfe6265a720cb33117127b1e8.png, hidden_markov_models_e12432e0160eaf247c7f6088866203921395b6ee.png and hidden_markov_models_a9ffdb7f4c157208d58b83a83d842a4081e6effa.png.

Our goal of the entire algorithm is to compute hidden_markov_models_6d3ae8f0508bfd15003605def98835b08c8e4ba2.png.

Forward algorithm
computes hidden_markov_models_77a4a6b7703b53f39592b98b4dcc2574a9e91a3c.png for all hidden_markov_models_30724ad381f79372389bcb5dc212ab75edb70694.png
Backward algorithm
computes hidden_markov_models_e504c1afa4e79c1062c34113439f693da68ec9a3.png for all hidden_markov_models_30724ad381f79372389bcb5dc212ab75edb70694.png

We can write our goal in the following way:


Which means that if we can solve the forward and backward algorithms described above, we can compute hidden_markov_models_6d3ae8f0508bfd15003605def98835b08c8e4ba2.png.

What we're actually saying here is:

  • Given that we know the value of the rv. hidden_markov_models_94e37b4cb7801e334fb7bf56be5e49eb0260194b.png the sequence of observations hidden_markov_models_f2e98cd28e59eb8b97f6bbad8e3b8cd6798d6c59.png does not depend on the observations prior to the hidden_markov_models_916e1e768f4406e3ffe688502abb60f06e33c41e.png observation.
  • Observations prior to the hidden_markov_models_916e1e768f4406e3ffe688502abb60f06e33c41e.png observation, does not depend on the observations in the future

Forward algorithm

Our goal is to compute hidden_markov_models_77a4a6b7703b53f39592b98b4dcc2574a9e91a3c.png. We start out with the following:


In words: probability of observing hidden_markov_models_c6106a541ded69ff9d820f8449ca1ff57d653969.png given that we have observed the sequence hidden_markov_models_51d1fbd63c04ab18067cdc96d932be248f2332b2.png and we're in state hidden_markov_models_96d04b8fc8f0b4d4d6532fe44a12ff93e0668425.png marginalized / summed over the possible previous states hidden_markov_models_ae0f7ff2dbd20cbe290181fedc56770d68bae51c.png.

But since the observation at the hidden_markov_models_916e1e768f4406e3ffe688502abb60f06e33c41e.png step, hidden_markov_models_c6106a541ded69ff9d820f8449ca1ff57d653969.png, is independent of hidden_markov_models_ae0f7ff2dbd20cbe290181fedc56770d68bae51c.png given that we know hidden_markov_models_96d04b8fc8f0b4d4d6532fe44a12ff93e0668425.png. And so we have:


And the hidden_markov_models_916e1e768f4406e3ffe688502abb60f06e33c41e.png state, hidden_markov_models_96d04b8fc8f0b4d4d6532fe44a12ff93e0668425.png, is independent of the previous observations hidden_markov_models_51d1fbd63c04ab18067cdc96d932be248f2332b2.png given that we know hidden_markov_models_ae0f7ff2dbd20cbe290181fedc56770d68bae51c.png. And so further we have:


Thus, we end up with:


And if you go back to our assumptions you see that we have assumed that we know the emission probabilities hidden_markov_models_65a7ae6a8ece251fa625aae99a61daa625ff6f86.png and the transition probabilities hidden_markov_models_4b3ea1cd322de12792e502b7917bf174426dea7b.png. And thus we only have one unknown in our equation.

Tracking this recursive relationship all the way back to the hidden_markov_models_8695baa5ba4d5b19604a9593a10a84def7b6564b.png, we end up with zero unknowns, since we also assumed to know the initial probability hidden_markov_models_a9ffdb7f4c157208d58b83a83d842a4081e6effa.png.

So just to conclude:


provides us with the enough info to compute hidden_markov_models_77a4a6b7703b53f39592b98b4dcc2574a9e91a3c.png in a recursive manner.

Time complexity
  • For each step hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png we have to sum over the hidden_markov_models_fe52c58a96549528260fe0d7d04caf390dd47865.png possible values for the previous state hidden_markov_models_ae0f7ff2dbd20cbe290181fedc56770d68bae51c.png, and so this is of time complexity hidden_markov_models_c6baa890d9fac81624708597f03efcd8ce4ce719.png
  • But we have to do the step above for every possible value of hidden_markov_models_96d04b8fc8f0b4d4d6532fe44a12ff93e0668425.png, thus we have an additional hidden_markov_models_c6baa890d9fac81624708597f03efcd8ce4ce719.png
  • Finally, we gotta do hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png of these steps, and so we have the two steps above with total time complexity hidden_markov_models_865f303e96804cf167684c5f6dc0a3b30c3232a3.png being computed hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png times, leading us the time complexity of the Forward algorithm to be hidden_markov_models_0403e5483d70d65007e46738fe65d00c8141cb8c.png

This is cool as it is linear in the number of steps, hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png!

Backward algorithm

Our goal is to compute hidden_markov_models_e504c1afa4e79c1062c34113439f693da68ec9a3.png for all hidden_markov_models_30724ad381f79372389bcb5dc212ab75edb70694.png. And again we try to deduce some recursive algorithm.

We start out with:


Then if we simply define some auxilary variable hidden_markov_models_589dc01e2a793dc40748cd9bc6d224553663647f.png to be


Aaaand we got ourselves a nice recursive relationship once again! Again, by assumption, we know what hidden_markov_models_493397467f2f2d55091a1e7a677fbecff4d02729.png and hidden_markov_models_d9ca322501a392f3502671f7d6ff5170286b89a4.png is.

And the final step is as follows


Thus, we conclude the Backward algorithm with the following:


Time complexity

Again we have the same reasoning as for the forward algorithm time complexity and end up with time complexity of hidden_markov_models_0403e5483d70d65007e46738fe65d00c8141cb8c.png.


Throughout this implementation we will be validating against this wikipedia example.

First we gotta turn those data representations into structures which our implementation can work with:

wiki_emission_probs = np.array([[0.9, 0.1], [0.2, 0.8]])
wiki_initial_dist = np.array([[0.5, 0.5]])
wiki_observations = np.array([0, 0, 1, 0, 0])
wiki_transition_probs = np.array([[0.7, 0.3], [0.3, 0.7]])

S_wiki = np.arange(2)
T_wiki = wiki_transition_probs
X_wiki = wiki_observations
epsilon_wiki = lambda z, x: wiki_emission_probs[z][x]
pi_wiki = np.array([0.5, 0.5])
cnt = 0

def forward(pi, X, S, epsilon, T):
    n_obs = len(X)
    n_states = len(S)

    alpha = np.zeros((n_obs + 1, n_states))
    alpha[0] = pi

    for k_minus_one, x in enumerate(X):
        # readjust the index of alpha since
        # we're including pi too
        k = k_minus_one + 1

        alpha_k = alpha[k]

        for z_k in S:
            alpha_k[z_k] = sum(
                # p(z_k | z_{k-1}) * p(x_k | z_k) * alpha_{k-1}(z_{k-1})
                T[z, z_k] * epsilon(z_k, x) * alpha[k - 1][z]
                for z in S

        # gotta normalize to get probab p(z_k, x_{1:k})
        alpha[k] = alpha_k / alpha_k.sum()

    return alpha

forward(pi_wiki, X_wiki, S_wiki, epsilon_wiki, T_wiki)
array([[ 0.5       ,  0.5       ],
       [ 0.81818182,  0.18181818],
       [ 0.88335704,  0.11664296],
       [ 0.19066794,  0.80933206],
       [ 0.730794  ,  0.269206  ],
       [ 0.86733889,  0.13266111]])


cnt = 0

def backward(X, S, epsilon, T):
    n_obs = len(X)
    n_states = len(S)
    beta = np.zeros((n_obs + 1, n_states))  # 1 extra for initial state

    beta[0] = np.ones(*S.shape)
    beta[0] /= beta[0].sum()

    for k_plus_one in range(n_obs):  # n-1, ..., 1
        k = k_plus_one + 1               # iterating backwards yah know
        beta_k = beta[k]                       # just aliasing the pointers
        beta_k_plus_one = beta[k_plus_one]     # beta_{k+1}
        x_plus_one = X[k_plus_one]       # x_{k+1}

        for z_k in S:                    # for each val of z_k
            beta_k[z_k] = sum(
                # p(z_{k+1} | z_k) * p(X_{k+1} | z_{k+1}) * beta_{k+1}(z_{k+1})
                T[z_k, z] * epsilon(z, x_plus_one) * beta_k_plus_one[z]
                for z in S               # z_{k+1}

            # validating the time complexity
            global cnt
            cnt += n_states

        # normalize the beta over the z's at each stage
        # since it represents p(x_{k+1:n} | z_k)
        beta_k = beta_k / beta_k.sum()

    # finally we gotta get the probabilities of the intial state
    x = X[0]
    for z_1 in S:
        beta[-1][z_1] = sum(T[z_1, z] * epsilon(z, x) * beta[-2][z] for z in S)

    # and normalize this too
    beta[-1] /= beta[-1].sum()

    # reversing the results so that we have 0..n
    return beta[::-1]

backward(X_wiki, S_wiki, epsilon_wiki, T_wiki)
array([[ 0.64693556,  0.35306444],
       [ 0.03305881,  0.02275383],
       [ 0.0453195 ,  0.0751255 ],
       [ 0.22965   ,  0.12185   ],
       [ 0.345     ,  0.205     ],
       [ 0.5       ,  0.5       ]])

Which is the wanted result! Great!

<2017-03-05 Sun> I find it interesting that the forward and backward algorithms provide me with similar, but slightly different answers when we run both of them independently on all of our observations.

BRAINFART! The forward algorithm aims to compute the joint distribution hidden_markov_models_2f0975ecb9d768d3349e353fb047e52d5fcd724b.png, while the backward algorithm computes the conditional hidden_markov_models_6d145e32ba282d85d3f60f79f478a18d63158287.png, in the case where we give the algorithms all the observations. They're not computing the same thing, duh.

And just to validate the time-complexity:

print "Want it to be about %d; it is in fact %d" % (len(X) * len(S_wiki), cnt)
Want it to be about 400; it is in fact 20


Now for bringing it all together: forward-backward.

cnt = 0

def forward_backward(pi, X, S, epsilon, T):
    n_obs = len(X)
    n_states = len(S)

    posterior = np.zeros((n_obs + 1, n_states))

    forward_ = forward(pi, X, S, epsilon, T)
    backward_ = backward(X, S, epsilon, T)

    for k in range(n_obs + 1):
        posterior[k] = forward_[k] * backward_[k]
        posterior[k] /= posterior[k].sum()

    return posterior

forward_backward(pi_wiki, X_wiki, S_wiki, epsilon_wiki, T_wiki)
array([[ 0.64693556,  0.35306444],
       [ 0.86733889,  0.13266111],
       [ 0.82041905,  0.17958095],
       [ 0.30748358,  0.69251642],
       [ 0.82041905,  0.17958095],
       [ 0.86733889,  0.13266111]])

Eureka! We got the entire posterior distribution of the hidden states given the observations!


<2017-02-07 Tue> Even though the plot looks nice and all, it seems weird to me that everyone seems to be talkin about issues with underflow while I'm getting overflow issues when dealing with large n o.O

I wasn't normalizing at each step, but instead waited until I had all the values and then normalized. Which means that I would get insanely large values of alpha, and thus doom.

I'm not 100% but I think it would still be a probability distribution in the end, if, you know, I could get to the end without overflow error…


predicted = []
posterior = forward_backward(pi, X, S, epsilon, T)

for p in posterior:
plt.scatter(steps, X, s=25)
plt.scatter(steps, zs, s=10)
plt.scatter(steps, predicted[:-1], alpha=0.1, c="g")
plt.title("%d samples" % n)


Here you can see the actual states plotted in blue, the normally distributed observations in red, and the predictions of the hidden states as a vague green. Yeah, I should probably mess with the sizes and colors to make it more apparent..

Viterbi algorithm

Allows us to compute the posterior probability of a sequence of states, hidden_markov_models_e1c21065b7f7536ac6fc9be1f43dac1d4434cef6.png, given some sequence of observations, hidden_markov_models_5b3e970b9a8a00471930fe49e678051a526d9214.png.


  • hidden_markov_models_25a2d57634852cad30d7025734cf94e5ce5ff71e.png i.e. the sequence of observations made, not the rv representing some hidden_markov_models_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png
  • hidden_markov_models_b500ab078ed03d0812682a1ffbc22a063a46e3a6.png represents the sequence of observations from the hidden_markov_models_debedb7f7748a9279d98de01a251be322c096645.png observation to the hidden_markov_models_494f69b962b78c1246de9860610fa54782c36233.png observation


The main idea behind the Viterbi algorithm is to find the optimal / most probable sub-sequence at each step, and then find the optimal / most probable sub-sequence at the next step, using the previous steps optimal path.

The naive approach would be to simply obtain the probabilities for all the different sequences of states, hidden_markov_models_3955dbb20f470b21bc8942f7f2fbff3cdf246d10.png given the sequence of observations hidden_markov_models_848fc5ab67d75de5f002736d73bbcb2b6b8a1837.png. The issue with this approach is quite clearly the number of possible combinations, which would be hidden_markov_models_8241fba8ef82b5d603a29c96ca4e0a3256590a59.png, where hidden_markov_models_fe52c58a96549528260fe0d7d04caf390dd47865.png is the number of possible states.

A "dynamic programming" approach would be to attept to find a more feasible sub-problem and check if we can recursively solve these sub-problems, reducing computation by storing intermediate solutions as we go.



So, let's start with the most likely intial path:



  • hidden_markov_models_503aa8857783746d0a4f0a5f9c2dbfc86d85e58f.png, i.e. probability of first state taking value hidden_markov_models_9c15196dd07b1add486b8b54592e74bfe946ed95.png, where hidden_markov_models_228a18553c654a9eb34aee37e867da435ace1a41.png

Our next step is then to define hidden_markov_models_b61ef57fcc950e2eb29942099f9d64891b71e1ec.png in terms of

The most likely state sequence hidden_markov_models_03f585a5b3a3583178f7c9c096e10ccde8790ab9.png is given by the recurrence relations:



  • hidden_markov_models_23e08837339c0243e1dd62ad90e5977c13762aff.png - probability of ???
  • hidden_markov_models_7585d76aa49d7ebd79f51f2e5d9fbafbb2959427.png, i.e. probability of most probable sequence of the first hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png elements, given that the (entire?) path ends in state hidden_markov_models_9c15196dd07b1add486b8b54592e74bfe946ed95.png.

My interpretation

Viterbi algorithm is all about making locally optimal choices, storing this locally optimal sub-paths as you go, and reuse this computation when trying different choices of states.

You start out by the following:

  1. Choose the initial state hidden_markov_models_8695baa5ba4d5b19604a9593a10a84def7b6564b.png s.t. you maximize the probability of observing hidden_markov_models_e565bc0533730e16c7145e73f5dfdd8b097e7237.png, i.e.


    Now to the crucial part (and the part which makes this a dynamic programming algorithm!), we store both the probability of the sequence we've chosen so far and the states we've chosen. In this case, it's simply hidden_markov_models_f91cbee97fa72f6ed1b848acfcb4f3da0ddd5da6.png and hidden_markov_models_f7d5ff05128a3158525f1a37b1b5627f224afdc2.png.

  2. Choose the next state hidden_markov_models_f6a87c888a31f4632b8c537cac9a526938656cbb.png s.t. you maximize the probability of observing hidden_markov_models_09e61b5451263ab84ea6bbedceb956ab590a9cbe.png, conditioned on already being in state hidden_markov_models_f91cbee97fa72f6ed1b848acfcb4f3da0ddd5da6.png, i.e.


    And again, we store the current path and it's probability, i.e. hidden_markov_models_22ce275a2af9ff013e777e911f41ccac9f8a3481.png and hidden_markov_models_314599d9aa50c07cb961150a96abbd69502c4f0b.png. Important: We store this somewhere else than the first sub-path obtained, i.e. hidden_markov_models_fc2beb71f8cd4db91c867ce82931290dda4cc7f3.png and hidden_markov_models_d8753af02c93bf67aa04866b03e04d3080271d0e.png.

  3. Perform this step until you reach the final step/observation, i.e. hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png times and you end up in hidden_markov_models_fd50d549b4a213f2724e01bf104ac43e370c8fd4.png.

Pretty complex, ehh? Simply choosing what seems the best at each step you take.

Now, let's think about the number of computations we got to do here.

  • At each step we got to decide between hidden_markov_models_fe52c58a96549528260fe0d7d04caf390dd47865.png possible states, computing the probability hidden_markov_models_01052b3acd14589d92c16658acf7e661b0e6468b.png for each one of them. This is required so that we know which one to choose for our next state. That is; at each step we do hidden_markov_models_fe52c58a96549528260fe0d7d04caf390dd47865.png computations.
  • We make the above choice until we reach the end, and so we this is hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png steps.

This leads us to a complexity of hidden_markov_models_72fb40c8533bf03bfc9872fecde9576f6ecbd0ee.png, where

  • hidden_markov_models_fe52c58a96549528260fe0d7d04caf390dd47865.png is number of possible states
  • hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png is number of steps/observations

Unfortunately we're not done yet. We've performed hidden_markov_models_660b3bf5dd79af246f1fd954d425730b511689a2.png operations, but what if the initial probability of the state we chose first, hidden_markov_models_f91cbee97fa72f6ed1b848acfcb4f3da0ddd5da6.png, has "horrible" probabilites as soon as you get past the first step?! Maybe if we set hidden_markov_models_8695baa5ba4d5b19604a9593a10a84def7b6564b.png equal to some other value, we would get a much better probability in the end?

Hence, smart as we are, we go "Okay, let's try again, but this time we choose the second best initial state!". And because we're running out of notation, we call this hidden_markov_models_a5175194bb6afbbc85442ce2358d49bea21f63ae.png. Yeah I know, it's pretty bad, but you get the picture. The hidden_markov_models_760ddb26d3b2277fb4266f6911a96e3b3f081b7e.png denotes that this is the 2nd try until we've either realized that it's worse than our current hidden_markov_models_f91cbee97fa72f6ed1b848acfcb4f3da0ddd5da6.png, in which case we discard the new value, or we realize it's in fact better, and so we simply replace it, i.e. set hidden_markov_models_5080ca14daad2cf7191e1765480f5a47df10e44b.png, also replacing the probability.

But how do we decide that the new suggested value is better? By walking through the path again, of course! But, this time we're going to try and be smart.

  • If we at some step hidden_markov_models_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png reach a node that we visited in our first "walk", we don't have to keep going until the end, but instead we terminate and do the following:
    • Compare the probability of the sub-sequence up until this step that we're currently looking at, with the previous best estimate for this same sub-sequnce, i.e. compare hidden_markov_models_86348815665e2e887ad94877bb73699cbbd62ffd.png to hidden_markov_models_9ccf781689f8e369aa8912d0359856c9a736afc0.png.
    • If the new sub-path is more probable, then we replace our current best estimate for this sub-path, and recompute the joint conditional distribution for the rest of the path.
      • Simply divide away our previous estimate, and multiply with new and better sub-path probability:


For the sake of clarification, when we use the word sub-path, we're talking about a sequence of states hidden_markov_models_adaf61fa2ececa974566be41c4ce13edcb44b4cc.png for some hidden_markov_models_7374019952a4766fc5e4764ca633a1ceb37ea611.png, i.e. all sub-paths start at step hidden_markov_models_7ba43ff00864c097bf5a587573e23164e2417a0b.png, ending at some hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png. We're not talking about sequences starting at any arbitrary hidden_markov_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png.

We repeat the process above for all possible initial states, which gives us another factor of hidden_markov_models_fe52c58a96549528260fe0d7d04caf390dd47865.png in the time-complexity. This is due to having to perform the (possibly, see below) hidden_markov_models_95c3e19adda0f5bc644b1b9e411bc3204a347059.png number of operations we talked about earlier for each initial state, for which there are hidden_markov_models_fe52c58a96549528260fe0d7d04caf390dd47865.png different ones.

This means that, at worst, we will not reach any already known / visited sequence, and so will do all the hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png steps for each possible initial state. For example, consider the case of every state only being able to transition to itself. In such cases we will have a time complexity of hidden_markov_models_3920f93764f9291d6e661a0a71d703f7e821ef51.png which is still substantially better than hidden_markov_models_7d06ab046771fb3db8f82bc443f3897c9bf316a7.png, the naive approach. But as I said, this is worst case scenario! More often we will find sub-sequences that we already have computed, and so we won't have to do all the hidden_markov_models_f07b932b12c91bca424fd428d383495413b5b6a9.png steps for every initial state!

Thus, for the Viterbi algorithm we get a time complexity of hidden_markov_models_3ed815466408a0482ccbbb15538dfd97c39d4fec.png (note Big-O, not Big-Theta)!


First we need to set up the memoization part of the algorithm. Between each "attempt" we need to remember the following:

  • At each step hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png, the probability of most the likely sub-path. We'll store as an array best´╗┐_sequence, with the the hidden_markov_models_761aed4222b206c027fa8978a6f0275ceb0c11c3.png element pointing to an array of length hidden_markov_models_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png, holding the sequence/path of states for our best guess so far.
  • At each step hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png, the state which was chosen by the most likely sub-path. We'll store this in an array best_sequence, with the hidden_markov_models_761aed4222b206c027fa8978a6f0275ceb0c11c3.png element being the probability of the sequence/path stored at best_sequence_probas[i].
n_obs = len(X_wiki)
n_states = len(S_wiki)

best_sequence = [None] * n_obs
best_sequence_probas = np.zeros(n_obs)

n_obs, n_states
5 2

First we choose the "best" initial state as our starting point, and then start stepping through. Best initial state corresponds to maximize the hidden_markov_models_f3d7a34047b3dafffa0be3330246c19888ef8310.png over hidden_markov_models_8695baa5ba4d5b19604a9593a10a84def7b6564b.png.

Notice how we're not actually computing real probabilities as we're not normalizing. The reason for this is simply that it is not necessary to obtain the most likely sequence.

x_1 = X_wiki[0]

z_hat = S[0]
z_hat_proba = epsilon_wiki(S[0], x_1) * pi_wiki[S[0]]

initial_step_norm_factor = z_hat_proba

for z in S_wiki[1:]:
    z_proba = epsilon_wiki(z, x_1) * pi_wiki[z]
    initial_step_norm_factor += z_proba

    if z_proba > z_hat_proba:
        z_hat = z
        z_hat_proba = z_proba

best_sequence[0] = 1
best_sequence_probas[0] = epsilon_wiki(1, x_1) * pi_wiki[1] / initial_step_norm_factor

best_sequence[0], best_sequence_probas[0]
1 0.18181818181818182
for i, x in enumerate(X_wiki):
    if i == 0:

    prev_sequence = best_sequence[i - 1]
    z_hat_prev = best_sequence[i - 1]
    z_hat_prev_proba = best_sequence_probas[i - 1]

    # initialize
    z_hat = S_wiki[0]
    z_hat_proba = T_wiki[z_hat_prev, z_hat] * epsilon_wiki(z_hat, x)

    step_norm_factor = z_hat_proba

    for z in S_wiki[1:]:
        z_proba = T_wiki[z_hat_prev, z] * epsilon_wiki(z, x)
        step_norm_factor += z_proba

        if z_proba > z_hat_proba:
            z_hat = z
            z_hat_proba = z_proba

    best_sequence[i] = z_hat
    best_sequence_probas[i] = z_hat_proba * z_hat_prev_proba / step_norm_factor

best_sequence, best_sequence_probas
(1 0 1 0 0) array ((0.18181818 0.11973392 0.09269723 0.06104452 0.0557363))

When I started doing this, I wanted to save the sub-path at each step. Then I realized that this was in fact not necessary, but we instead could simply store the state hidden_markov_models_513c57ca3690de709df7dc2cc86e1d287de0562c.png at step hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png and the probability of the previous best estimate for the sub-path being in state hidden_markov_models_513c57ca3690de709df7dc2cc86e1d287de0562c.png at step hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png. When we're finished, we're left with a sequence hidden_markov_models_9ece792cf4b521c5e9c90bc48c96e45f3db3590a.png which is the best estimate.

If you don't remember; what we did previously was storing the entire sub-path up until the current step, i.e. at step hidden_markov_models_094b02afce734f4ce51933d0093ef3d2da9f8123.png we stored the sequence hidden_markov_models_b108ed0d5ae82dbaddf5c0b834409f589fdd5b8c.png, which is bloody unecessary. This would allow us to afterwards obtain the best estimate for any path though, so that would be nice. But whenever we would hit a better estimate for a previously obtained path, we would have to step through each hidden_markov_models_d30779bbea1357af8622b5f631b2857a547741be.png element for the stored sub-paths and replace them.

That was our first run, now we gotta give it a shot for the second best one. Since, as mentioned before, we've assumed a uniform prior, we simply pick the next state.

curr_seq = [0]
curr_seq_probas = [epsilon_wiki(curr_seq[0], x_1) * pi[curr_seq[0]] / initial_step_norm_factor]

for i, x in enumerate(X_wiki):
    if i == 0:

    z_prev = curr_seq[i - 1]
    z_prev_proba = curr_seq_probas[i - 1]

    # initialize the argmax process
    z_next = S_wiki[0]
    z_next_proba = T_wiki[z_prev, z_next] * epsilon_wiki(z_next, x)

    step_norm_factor = z_next_proba

    for z in S_wiki[1:]:
        z_proba = T_wiki[z_prev, z] * epsilon_wiki(z, x)
        step_norm_factor += z_proba

        if z_proba > z_next_proba:
            z_next = z
            z_next_proba = z_proba

    curr_seq_proba = curr_seq_probas[i - 1] * z_next_proba / step_norm_factor

    # check if the rest of the path has been
    # computed already
    z_hat = best_sequence[i]
    if z_hat == z_next:
        # we've computed this before, and the rest
        # of the path is the best estimate we have
        # as of right now
        best_seq_proba = best_sequence_probas[i]
        if best_seq_proba < curr_seq_proba:
            # the new path is better, so we gotta
            # replace the best estimate
            best_sequence[:i + 1] = curr_seq
            # and the preceding probabilities
            best_sequence_probas[:i + 1] = curr_seq_probas

            # then we have to recompute the probabilities
            # for the rest of the best estimate path
            for j in range(i + 1, n_obs):
                best_sequence_probas[j] = curr_seq_proba * \
                         best_sequence_probas[j] \
                         / best_seq_proba

            print "We found a better one! Current is %f, new is %f" \
                % (best_seq_proba, curr_seq_proba)

        # since we've already computed the sub-path after this step
        # given the state we are currently in, we terminate

curr_seq, curr_seq_probas
0 0
0.8181818181818181 0.7470355731225296
best_sequence, best_sequence_probas
(0 0 1 0 0) array ((0.81818182 0.74703557 0.57835012 0.38086471 0.34774604))

This looks very good to me!

def viterbi(pi, X, S, epsilon, T):
    n_obs = len(X)

    # initialize
    best_sequence = [None] * n_obs
    best_sequence_probas = np.zeros(n_obs)

    x_1 = X[0]
    initial_marginal = sum(epsilon(z, x_1) * pi[z] for z in S)

    sorted_S = sorted(S, key=lambda z: epsilon(z, x_1) * pi[z], reverse=True)

    for z_1 in sorted_S:
        curr_seq = [z_1]
        curr_seq_probas = [epsilon(z_1, x_1) * pi[z_1] / initial_marginal]

        for i, x in enumerate(X):
            if i == 0:

            z_prev = curr_seq[i - 1]
            z_prev_proba = curr_seq_probas[i - 1]

            # Max and argmax simultaenously
            # initialize the argmax process
            z_next = S[0]
            z_next_proba = T[z_prev, z_next] * epsilon(z_next, x)
            step_marginal = z_next_proba

            for z in S[1:]:
                z_proba = T[z_prev, z] * epsilon(z, x)

                step_marginal += z_proba

                if z_proba > z_next_proba:
                    z_next = z
                    z_next_proba = z_proba

            # Memoize our results
            curr_seq_proba = z_prev_proba * z_next_proba / step_marginal

            # check if the rest of the path has been computed already
            z_hat = best_sequence[i]
            if z_hat == z_next:
                # We've computed this before, and the rest
                # of the path is the best estimate we have.
                # Thus we check if this new path is better!
                best_seq_proba = best_sequence_probas[i]
                if best_seq_proba < curr_seq_proba:
                    # replace the best estimate sequence
                    best_sequence[:i + 1] = curr_seq
                    # and the corresponding probabilities
                    best_sequence_probas[:i + 1] = curr_seq_probas

                    # recompute the probabilities
                    # for the rest of the best estimate path
                    for j in range(i + 1, n_obs):
                        best_sequence_probas[j] = curr_seq_proba * \
                                 best_sequence_probas[j] \
                                 / best_seq_proba

                    print "We found a better one at step %d!" \
                         "Current is %f, new is %f" \
                         % (i, best_seq_proba, curr_seq_proba)

                # since we've already computed the sub-path after this step
                # given the state we are currently in, we terminate

        # gone through all the observations for this state
        if i == n_obs - 1:
            best_seq_proba = best_sequence_probas[i]
            if best_seq_proba < curr_seq_proba:
                best_sequence = curr_seq
                best_sequence_probas = curr_seq_probas

    return best_sequence, best_sequence_probas
viterbi(pi_wiki, X_wiki, S_wiki, epsilon_wiki, T_wiki)
0 0 1 0 0
0.8181818181818181 0.7470355731225296 0.5783501211271196 0.3808647139129812 0.3477460431379394


Using numpy.ndarray for curr_seq

I'm actually not sure whether or not this would speed up things. The way Python list work provides us with a ammortized constant time append method. This means that we can potentially save loads of memory compared to having to preallocate memory for n_obs elements using numpy.ndarray. Most of the time we won't be walking through the entire sequence of observations, and so most of the memory allocated won't even be used.

Nonetheless I haven't actually made any benchmarks, so it's all speculation. I just figured that it wasn't worth trying for the reason given here.

Python class implementation

Should this even be a class? I think it sort of makes sense, considering it's kind of a model of a state machine. And so modelling it in a imperative manner makes sense in my opinion.

Would it be a good idea to implement all of these as a staticmethod, and then have the instance specific method simply call the staticmethod? This would allow us to choose whether or not we actually want to create a full instance of a HiddenMarkovModel or not.

Naaah, we simply have functions which take those arguments and then have the class HiddenMarkovModel call those methods.


  • forward
  • backward
  • forward-backward
  • viterbi


  • posterior
  • epsilon and emission_probabilities
  • T and transition_probabilities
  • pi and initial_probabilities
  • S and states
  • X and observations
class HiddenMarkovModel:
    A Hidden Markov Model (HMM) is fully defined by the
    state space, transition probabilities, emission probabilities,
    and initial probabilities of the states.

    This makes it easier to define a model, and then
    rerun it on different observations `X`.

    pi: array-like floats, 1d
        Initial probability vector.
    S: array-like ints, 1d
        Defines the possible states the system can be in.
    epsilon: callable(z, x) -> float
        Takes current state :int:`z` and observation :float:`x`
        as argument, returning a :float: representing the 
        probability of seeing observation :float:`x` given
        the system is in state :int:`z`.
    T: array-like floats, 2d
        Matrix / 2-dimensional array where the index `[i][j]`
        gives us the probability of transitioning from state
        `i` to state `j`.
    X: array-like (optional)
        Represents the observations. This can be a array-like
        structure of basically anything, as long as :callable:`epsilon`
        can handle the elements of `X`. So observations can be
        scalar values, vectors, or whatever. 
    def __init__(self, pi=None, S=None, epsilon=None, T=None, X=None):
        self.pi = pi
        self.S = S
        self.epsilon = epsilon
        self.T = T
        self.X = X

        # some aliases for easier use for others
        self.initial_probabilities = self.pi
        self.states = self.S
        self.emission_probabilities = self.epsilon
        self.transition_probabilities = self.T
        self.observations = self.X

        self.posterior = None
        self.optimal_path = None

    def forward(self, X=None):
        if X is None:
            X = self.X

        return forward(self.pi, X, self.S, self.epsilon, self.T)

    def backward(self, X=None):
        if X is None:
            X = self.X

        return backward(X, self.S, self.epsilon, self.T)

    def forward_backward(self, X=None):
        if X is None:
            X = self.X

        n_obs = len(X)
        n_states = len(S)

        posterior = np.zeros((n_obs + 1, n_states))

        forward_ = self.forward(X)
        backward_ = self.backward(X)

        for k in range(n_obs + 1):
            posterior[k] = forward_[k] * backward_[k]
            posterior[k] /= posterior[k].sum()

        # Should we actually set it, or leave that to the user?
        self.posterior = posterior

        return posterior

    def viterbi(self, X=None):
        if X is None:
            X = self.X

        self.optimal_path = viterbi(self.pi, X, self.S, self.epsilon, self.T)
        return self.optimal_path
hmm = HiddenMarkovModel(pi_wiki, S_wiki, epsilon_wiki, T_wiki)

0 0 1 0 0
0.8181818181818181 0.7470355731225296 0.5783501211271196 0.3808647139129812 0.3477460431379394

Appendix A: Full code examples