# Hidden Markov Models

## Model

Hidden Markov Model (HHM) is a probability model used to model latent states ( ) based on observations which depends on the current state of the system. That is; at some step the system is in some "internal" / unobservable / latent state . We then observe an output which depends on the current state, , the system is in.

### Derivation

• Hidden states/rvs
• i.e. a discrete set of states
• Observed rvs
• is the set of all possible values that can take, e.g. discrete,

Then,

1

The reason why we use here is that elements of , can potentially be multi-dimensional, e.g. , then represents a vector of observed rvs, not just a single observed rv.

#### Parameters

• Transition probabilities/matrix:
• Emission probabilites: for
• is then a prob. dist. on
• Initial dist:

The difference between and , e.g. lower- and uppercase notation is that represents the distribution of the rv , while is the actual rv, and so means the case where the rv takes on the value . The notation for some arbitrary is then talking about a discrete distribution.

Using the substitutions from above, the join distribution is the following:

2

The emission prob. dist for the different values of the hidden states can be basically any prob. dist.

The plot below shows a problem where HMM might be a good application, where we have the states in blue and observations in red with emission probability .

## Forward-backward algorithm / alpha-beta recursion

Example of a dynamic programming algorithm. This provides us with a way of efficiently estimating the probability of the system being in some state at the step given all observations we've seen, , i.e. .

This does not provide us with a way to calculate the probability of a sequence of states, given the observations we've seen. For that we have to have a look at the Viterbi algorithm.

### Notation

• i.e. the sequence of observations made, not the rv representing some
• represents the sequence of observations from the observation to the observation

### Algorithm

We make the assumptions that we know , and .

Our goal of the entire algorithm is to compute .

Forward algorithm
computes for all
Backward algorithm
computes for all

We can write our goal in the following way:

3

Which means that if we can solve the forward and backward algorithms described above, we can compute .

What we're actually saying here is:

• Given that we know the value of the rv. the sequence of observations does not depend on the observations prior to the observation.
• Observations prior to the observation, does not depend on the observations in the future

#### Forward algorithm

Our goal is to compute . We start out with the following:

4

In words: probability of observing given that we have observed the sequence and we're in state marginalized / summed over the possible previous states .

But since the observation at the step, , is independent of given that we know . And so we have:

5

And the state, , is independent of the previous observations given that we know . And so further we have:

6

Thus, we end up with:

7

And if you go back to our assumptions you see that we have assumed that we know the emission probabilities and the transition probabilities . And thus we only have one unknown in our equation.

Tracking this recursive relationship all the way back to the , we end up with zero unknowns, since we also assumed to know the initial probability .

So just to conclude:

8

provides us with the enough info to compute in a recursive manner.

##### Time complexity
• For each step we have to sum over the possible values for the previous state , and so this is of time complexity
• But we have to do the step above for every possible value of , thus we have an additional
• Finally, we gotta do of these steps, and so we have the two steps above with total time complexity being computed times, leading us the time complexity of the Forward algorithm to be

This is cool as it is linear in the number of steps, !

#### Backward algorithm

Our goal is to compute for all . And again we try to deduce some recursive algorithm.

We start out with:

9

Then if we simply define some auxilary variable to be

10

Aaaand we got ourselves a nice recursive relationship once again! Again, by assumption, we know what and is.

And the final step is as follows

11

Thus, we conclude the Backward algorithm with the following:

12
##### Time complexity

Again we have the same reasoning as for the forward algorithm time complexity and end up with time complexity of .

### Implementation

Throughout this implementation we will be validating against this wikipedia example.

First we gotta turn those data representations into structures which our implementation can work with:

wiki_emission_probs = np.array([[0.9, 0.1], [0.2, 0.8]])
wiki_initial_dist = np.array([[0.5, 0.5]])
wiki_observations = np.array([0, 0, 1, 0, 0])
wiki_transition_probs = np.array([[0.7, 0.3], [0.3, 0.7]])

S_wiki = np.arange(2)
T_wiki = wiki_transition_probs
X_wiki = wiki_observations
epsilon_wiki = lambda z, x: wiki_emission_probs[z][x]
pi_wiki = np.array([0.5, 0.5])

##### Forward
cnt = 0

def forward(pi, X, S, epsilon, T):
n_obs = len(X)
n_states = len(S)

alpha = np.zeros((n_obs + 1, n_states))
alpha[0] = pi

for k_minus_one, x in enumerate(X):
# readjust the index of alpha since
# we're including pi too
k = k_minus_one + 1

alpha_k = alpha[k]

for z_k in S:
alpha_k[z_k] = sum(
# p(z_k | z_{k-1}) * p(x_k | z_k) * alpha_{k-1}(z_{k-1})
T[z, z_k] * epsilon(z_k, x) * alpha[k - 1][z]
for z in S
)

# gotta normalize to get probab p(z_k, x_{1:k})
alpha[k] = alpha_k / alpha_k.sum()

return alpha

forward(pi_wiki, X_wiki, S_wiki, epsilon_wiki, T_wiki)

array([[ 0.5       ,  0.5       ],
[ 0.81818182,  0.18181818],
[ 0.88335704,  0.11664296],
[ 0.19066794,  0.80933206],
[ 0.730794  ,  0.269206  ],
[ 0.86733889,  0.13266111]])


YAY!

##### Backward
cnt = 0

def backward(X, S, epsilon, T):
n_obs = len(X)
n_states = len(S)
beta = np.zeros((n_obs + 1, n_states))  # 1 extra for initial state

beta[0] = np.ones(*S.shape)
beta[0] /= beta[0].sum()

for k_plus_one in range(n_obs):  # n-1, ..., 1
k = k_plus_one + 1               # iterating backwards yah know
beta_k = beta[k]                       # just aliasing the pointers
beta_k_plus_one = beta[k_plus_one]     # beta_{k+1}
x_plus_one = X[k_plus_one]       # x_{k+1}

for z_k in S:                    # for each val of z_k
beta_k[z_k] = sum(
# p(z_{k+1} | z_k) * p(X_{k+1} | z_{k+1}) * beta_{k+1}(z_{k+1})
T[z_k, z] * epsilon(z, x_plus_one) * beta_k_plus_one[z]
for z in S               # z_{k+1}
)

# validating the time complexity
global cnt
cnt += n_states

# normalize the beta over the z's at each stage
# since it represents p(x_{k+1:n} | z_k)
beta_k = beta_k / beta_k.sum()

# finally we gotta get the probabilities of the intial state
x = X[0]
for z_1 in S:
beta[-1][z_1] = sum(T[z_1, z] * epsilon(z, x) * beta[-2][z] for z in S)

# and normalize this too
beta[-1] /= beta[-1].sum()

# reversing the results so that we have 0..n
return beta[::-1]

backward(X_wiki, S_wiki, epsilon_wiki, T_wiki)

array([[ 0.64693556,  0.35306444],
[ 0.03305881,  0.02275383],
[ 0.0453195 ,  0.0751255 ],
[ 0.22965   ,  0.12185   ],
[ 0.345     ,  0.205     ],
[ 0.5       ,  0.5       ]])


Which is the wanted result! Great!

<2017-03-05 Sun> I find it interesting that the forward and backward algorithms provide me with similar, but slightly different answers when we run both of them independently on all of our observations.

BRAINFART! The forward algorithm aims to compute the joint distribution , while the backward algorithm computes the conditional , in the case where we give the algorithms all the observations. They're not computing the same thing, duh.

And just to validate the time-complexity:

print "Want it to be about %d; it is in fact %d" % (len(X) * len(S_wiki), cnt)

Want it to be about 400; it is in fact 20

##### Forward-backward

Now for bringing it all together: forward-backward.

cnt = 0

def forward_backward(pi, X, S, epsilon, T):
n_obs = len(X)
n_states = len(S)

posterior = np.zeros((n_obs + 1, n_states))

forward_ = forward(pi, X, S, epsilon, T)
backward_ = backward(X, S, epsilon, T)

for k in range(n_obs + 1):
posterior[k] = forward_[k] * backward_[k]
posterior[k] /= posterior[k].sum()

return posterior

forward_backward(pi_wiki, X_wiki, S_wiki, epsilon_wiki, T_wiki)

array([[ 0.64693556,  0.35306444],
[ 0.86733889,  0.13266111],
[ 0.82041905,  0.17958095],
[ 0.30748358,  0.69251642],
[ 0.82041905,  0.17958095],
[ 0.86733889,  0.13266111]])


Eureka! We got the entire posterior distribution of the hidden states given the observations!

### Discussion

<2017-02-07 Tue> Even though the plot looks nice and all, it seems weird to me that everyone seems to be talkin about issues with underflow while I'm getting overflow issues when dealing with large n o.O

I wasn't normalizing at each step, but instead waited until I had all the values and then normalized. Which means that I would get insanely large values of alpha, and thus doom.

I'm not 100% but I think it would still be a probability distribution in the end, if, you know, I could get to the end without overflow error…

### Example

predicted = []
posterior = forward_backward(pi, X, S, epsilon, T)

for p in posterior:
predicted.append(np.argmax(p))

plt.scatter(steps, X, s=25)
plt.scatter(steps, zs, s=10)
plt.scatter(steps, predicted[:-1], alpha=0.1, c="g")
plt.title("%d samples" % n)


Here you can see the actual states plotted in blue, the normally distributed observations in red, and the predictions of the hidden states as a vague green. Yeah, I should probably mess with the sizes and colors to make it more apparent..

## Viterbi algorithm

Allows us to compute the posterior probability of a sequence of states, , given some sequence of observations, .

### Notation

• i.e. the sequence of observations made, not the rv representing some
• represents the sequence of observations from the observation to the observation

### Algorithm

The main idea behind the Viterbi algorithm is to find the optimal / most probable sub-sequence at each step, and then find the optimal / most probable sub-sequence at the next step, using the previous steps optimal path.

The naive approach would be to simply obtain the probabilities for all the different sequences of states, given the sequence of observations . The issue with this approach is quite clearly the number of possible combinations, which would be , where is the number of possible states.

A "dynamic programming" approach would be to attept to find a more feasible sub-problem and check if we can recursively solve these sub-problems, reducing computation by storing intermediate solutions as we go.

#### Wikipedia

where

• , i.e. probability of first state taking value , where

Our next step is then to define in terms of

The most likely state sequence is given by the recurrence relations:

13

where

• - probability of ???
• , i.e. probability of most probable sequence of the first elements, given that the (entire?) path ends in state .

#### My interpretation

Viterbi algorithm is all about making locally optimal choices, storing this locally optimal sub-paths as you go, and reuse this computation when trying different choices of states.

You start out by the following:

1. Choose the initial state s.t. you maximize the probability of observing , i.e.

14

Now to the crucial part (and the part which makes this a dynamic programming algorithm!), we store both the probability of the sequence we've chosen so far and the states we've chosen. In this case, it's simply and .

2. Choose the next state s.t. you maximize the probability of observing , conditioned on already being in state , i.e.

15

And again, we store the current path and it's probability, i.e. and . Important: We store this somewhere else than the first sub-path obtained, i.e. and .

3. Perform this step until you reach the final step/observation, i.e. times and you end up in .

Pretty complex, ehh? Simply choosing what seems the best at each step you take.

Now, let's think about the number of computations we got to do here.

• At each step we got to decide between possible states, computing the probability for each one of them. This is required so that we know which one to choose for our next state. That is; at each step we do computations.
• We make the above choice until we reach the end, and so we this is steps.

This leads us to a complexity of , where

• is number of possible states
• is number of steps/observations

Unfortunately we're not done yet. We've performed operations, but what if the initial probability of the state we chose first, , has "horrible" probabilites as soon as you get past the first step?! Maybe if we set equal to some other value, we would get a much better probability in the end?

Hence, smart as we are, we go "Okay, let's try again, but this time we choose the second best initial state!". And because we're running out of notation, we call this . Yeah I know, it's pretty bad, but you get the picture. The denotes that this is the 2nd try until we've either realized that it's worse than our current , in which case we discard the new value, or we realize it's in fact better, and so we simply replace it, i.e. set , also replacing the probability.

But how do we decide that the new suggested value is better? By walking through the path again, of course! But, this time we're going to try and be smart.

• If we at some step reach a node that we visited in our first "walk", we don't have to keep going until the end, but instead we terminate and do the following:
• Compare the probability of the sub-sequence up until this step that we're currently looking at, with the previous best estimate for this same sub-sequnce, i.e. compare to .
• If the new sub-path is more probable, then we replace our current best estimate for this sub-path, and recompute the joint conditional distribution for the rest of the path.

• Simply divide away our previous estimate, and multiply with

new and better sub-path probability:

16

For the sake of clarification, when we use the word sub-path, we're talking about a sequence of states for some , i.e. all sub-paths start at step , ending at some . We're not talking about sequences starting at any arbitrary .

We repeat the process above for all possible initial states, which gives us another factor of in the time-complexity. This is due to having to perform the (possibly, see below) number of operations we talked about earlier for each initial state, for which there are different ones.

This means that, at worst, we will not reach any already known / visited sequence, and so will do all the steps for each possible initial state. For example, consider the case of every state only being able to transition to itself. In such cases we will have a time complexity of which is still substantially better than , the naive approach. But as I said, this is worst case scenario! More often we will find sub-sequences that we already have computed, and so we won't have to do all the steps for every initial state!

Thus, for the Viterbi algorithm we get a time complexity of (note Big-O, not Big-Theta)!

### Implementation

First we need to set up the memoization part of the algorithm. Between each "attempt" we need to remember the following:

• At each step , the probability of most the likely sub-path. We'll store as an array best﻿_sequence, with the the element pointing to an array of length , holding the sequence/path of states for our best guess so far.
• At each step , the state which was chosen by the most likely sub-path. We'll store this in an array best_sequence, with the element being the probability of the sequence/path stored at best_sequence_probas[i].
n_obs = len(X_wiki)
n_states = len(S_wiki)

best_sequence = [None] * n_obs
best_sequence_probas = np.zeros(n_obs)

n_obs, n_states

 5 2

First we choose the "best" initial state as our starting point, and then start stepping through. Best initial state corresponds to maximize the over .

Notice how we're not actually computing real probabilities as we're not normalizing. The reason for this is simply that it is not necessary to obtain the most likely sequence.

x_1 = X_wiki[0]

z_hat = S[0]
z_hat_proba = epsilon_wiki(S[0], x_1) * pi_wiki[S[0]]

initial_step_norm_factor = z_hat_proba

for z in S_wiki[1:]:
z_proba = epsilon_wiki(z, x_1) * pi_wiki[z]
initial_step_norm_factor += z_proba

if z_proba > z_hat_proba:
z_hat = z
z_hat_proba = z_proba

best_sequence[0] = 1
best_sequence_probas[0] = epsilon_wiki(1, x_1) * pi_wiki[1] / initial_step_norm_factor

best_sequence[0], best_sequence_probas[0]

 1 0.181818
for i, x in enumerate(X_wiki):
if i == 0:
continue

prev_sequence = best_sequence[i - 1]
z_hat_prev = best_sequence[i - 1]
z_hat_prev_proba = best_sequence_probas[i - 1]

# initialize
z_hat = S_wiki[0]
z_hat_proba = T_wiki[z_hat_prev, z_hat] * epsilon_wiki(z_hat, x)

step_norm_factor = z_hat_proba

for z in S_wiki[1:]:
z_proba = T_wiki[z_hat_prev, z] * epsilon_wiki(z, x)
step_norm_factor += z_proba

if z_proba > z_hat_proba:
z_hat = z
z_hat_proba = z_proba

best_sequence[i] = z_hat
best_sequence_probas[i] = z_hat_proba * z_hat_prev_proba / step_norm_factor

best_sequence, best_sequence_probas

 (1 0 1 0 0) array ((0.18181818 0.11973392 0.09269723 0.06104452 0.0557363))

When I started doing this, I wanted to save the sub-path at each step. Then I realized that this was in fact not necessary, but we instead could simply store the state at step and the probability of the previous best estimate for the sub-path being in state at step . When we're finished, we're left with a sequence which is the best estimate.

If you don't remember; what we did previously was storing the entire sub-path up until the current step, i.e. at step we stored the sequence , which is bloody unecessary. This would allow us to afterwards obtain the best estimate for any path though, so that would be nice. But whenever we would hit a better estimate for a previously obtained path, we would have to step through each element for the stored sub-paths and replace them.

That was our first run, now we gotta give it a shot for the second best one. Since, as mentioned before, we've assumed a uniform prior, we simply pick the next state.

curr_seq = [0]
curr_seq_probas = [epsilon_wiki(curr_seq[0], x_1) * pi[curr_seq[0]] / initial_step_norm_factor]

for i, x in enumerate(X_wiki):
if i == 0:
continue

z_prev = curr_seq[i - 1]
z_prev_proba = curr_seq_probas[i - 1]

# initialize the argmax process
z_next = S_wiki[0]
z_next_proba = T_wiki[z_prev, z_next] * epsilon_wiki(z_next, x)

step_norm_factor = z_next_proba

for z in S_wiki[1:]:
z_proba = T_wiki[z_prev, z] * epsilon_wiki(z, x)
step_norm_factor += z_proba

if z_proba > z_next_proba:
z_next = z
z_next_proba = z_proba

curr_seq_proba = curr_seq_probas[i - 1] * z_next_proba / step_norm_factor
curr_seq.append(z_next)
curr_seq_probas.append(curr_seq_proba)

# check if the rest of the path has been
z_hat = best_sequence[i]
if z_hat == z_next:
# we've computed this before, and the rest
# of the path is the best estimate we have
# as of right now
best_seq_proba = best_sequence_probas[i]
if best_seq_proba < curr_seq_proba:
# the new path is better, so we gotta
# replace the best estimate
best_sequence[:i + 1] = curr_seq
# and the preceding probabilities
best_sequence_probas[:i + 1] = curr_seq_probas

# then we have to recompute the probabilities
# for the rest of the best estimate path
for j in range(i + 1, n_obs):
best_sequence_probas[j] = curr_seq_proba * \
best_sequence_probas[j] \
/ best_seq_proba

print "We found a better one! Current is %f, new is %f" \
% (best_seq_proba, curr_seq_proba)

# since we've already computed the sub-path after this step
# given the state we are currently in, we terminate
break

curr_seq, curr_seq_probas

 0 0 0.818182 0.747036
best_sequence, best_sequence_probas

 (0 0 1 0 0) array ((0.81818182 0.74703557 0.57835012 0.38086471 0.34774604))

This looks very good to me!

def viterbi(pi, X, S, epsilon, T):
n_obs = len(X)

# initialize
best_sequence = [None] * n_obs
best_sequence_probas = np.zeros(n_obs)

x_1 = X[0]
initial_marginal = sum(epsilon(z, x_1) * pi[z] for z in S)

sorted_S = sorted(S, key=lambda z: epsilon(z, x_1) * pi[z], reverse=True)

for z_1 in sorted_S:
curr_seq = [z_1]
curr_seq_probas = [epsilon(z_1, x_1) * pi[z_1] / initial_marginal]

for i, x in enumerate(X):
if i == 0:
continue

z_prev = curr_seq[i - 1]
z_prev_proba = curr_seq_probas[i - 1]

# Max and argmax simultaenously
# initialize the argmax process
z_next = S[0]
z_next_proba = T[z_prev, z_next] * epsilon(z_next, x)
step_marginal = z_next_proba

for z in S[1:]:
z_proba = T[z_prev, z] * epsilon(z, x)

step_marginal += z_proba

if z_proba > z_next_proba:
z_next = z
z_next_proba = z_proba

# Memoize our results
curr_seq_proba = z_prev_proba * z_next_proba / step_marginal
curr_seq.append(z_next)
curr_seq_probas.append(curr_seq_proba)

# check if the rest of the path has been computed already
z_hat = best_sequence[i]
if z_hat == z_next:
# We've computed this before, and the rest
# of the path is the best estimate we have.
# Thus we check if this new path is better!
best_seq_proba = best_sequence_probas[i]
if best_seq_proba < curr_seq_proba:
# replace the best estimate sequence
best_sequence[:i + 1] = curr_seq
# and the corresponding probabilities
best_sequence_probas[:i + 1] = curr_seq_probas

# recompute the probabilities
# for the rest of the best estimate path
for j in range(i + 1, n_obs):
best_sequence_probas[j] = curr_seq_proba * \
best_sequence_probas[j] \
/ best_seq_proba

print "We found a better one at step %d!" \
"Current is %f, new is %f" \
% (i, best_seq_proba, curr_seq_proba)

# since we've already computed the sub-path after this step
# given the state we are currently in, we terminate
break

# gone through all the observations for this state
if i == n_obs - 1:
best_seq_proba = best_sequence_probas[i]
if best_seq_proba < curr_seq_proba:
best_sequence = curr_seq
best_sequence_probas = curr_seq_probas

return best_sequence, best_sequence_probas

viterbi(pi_wiki, X_wiki, S_wiki, epsilon_wiki, T_wiki)

 0 0 1 0 0 0.818182 0.747036 0.57835 0.380865 0.347746

#### Discussion

##### Using numpy.ndarray for curr_seq

I'm actually not sure whether or not this would speed up things. The way Python list work provides us with a ammortized constant time append method. This means that we can potentially save loads of memory compared to having to preallocate memory for n_obs elements using numpy.ndarray. Most of the time we won't be walking through the entire sequence of observations, and so most of the memory allocated won't even be used.

Nonetheless I haven't actually made any benchmarks, so it's all speculation. I just figured that it wasn't worth trying for the reason given here.

## Python class implementation

Should this even be a class? I think it sort of makes sense, considering it's kind of a model of a state machine. And so modelling it in a imperative manner makes sense in my opinion.

Would it be a good idea to implement all of these as a staticmethod, and then have the instance specific method simply call the staticmethod? This would allow us to choose whether or not we actually want to create a full instance of a HiddenMarkovModel or not.

Naaah, we simply have functions which take those arguments and then have the class HiddenMarkovModel call those methods.

Methods:

• forward
• backward
• forward-backward
• viterbi

Properties:

• posterior
• epsilon and emission_probabilities
• T and transition_probabilities
• pi and initial_probabilities
• S and states
• X and observations
class HiddenMarkovModel:
"""
A Hidden Markov Model (HMM) is fully defined by the
state space, transition probabilities, emission probabilities,
and initial probabilities of the states.

This makes it easier to define a model, and then
rerun it on different observations X.

Parameters
----------
pi: array-like floats, 1d
Initial probability vector.
S: array-like ints, 1d
Defines the possible states the system can be in.
epsilon: callable(z, x) -> float
Takes current state :int:z and observation :float:x
as argument, returning a :float: representing the
probability of seeing observation :float:x given
the system is in state :int:z.
T: array-like floats, 2d
Matrix / 2-dimensional array where the index [i][j]
gives us the probability of transitioning from state
i to state j.
X: array-like (optional)
Represents the observations. This can be a array-like
structure of basically anything, as long as :callable:epsilon
can handle the elements of X. So observations can be
scalar values, vectors, or whatever.
"""
def __init__(self, pi=None, S=None, epsilon=None, T=None, X=None):
self.pi = pi
self.S = S
self.epsilon = epsilon
self.T = T
self.X = X

# some aliases for easier use for others
self.initial_probabilities = self.pi
self.states = self.S
self.emission_probabilities = self.epsilon
self.transition_probabilities = self.T
self.observations = self.X

self.posterior = None
self.optimal_path = None

def forward(self, X=None):
if X is None:
X = self.X

return forward(self.pi, X, self.S, self.epsilon, self.T)

def backward(self, X=None):
if X is None:
X = self.X

return backward(X, self.S, self.epsilon, self.T)

def forward_backward(self, X=None):
if X is None:
X = self.X

n_obs = len(X)
n_states = len(S)

posterior = np.zeros((n_obs + 1, n_states))

forward_ = self.forward(X)
backward_ = self.backward(X)

for k in range(n_obs + 1):
posterior[k] = forward_[k] * backward_[k]
posterior[k] /= posterior[k].sum()

# Should we actually set it, or leave that to the user?
self.posterior = posterior

return posterior

def viterbi(self, X=None):
if X is None:
X = self.X

self.optimal_path = viterbi(self.pi, X, self.S, self.epsilon, self.T)
return self.optimal_path

hmm = HiddenMarkovModel(pi_wiki, S_wiki, epsilon_wiki, T_wiki)
hmm.viterbi(X_wiki)
hmm.forward_backward(X_wiki)

hmm.optimal_path

 0 0 1 0 0 0.818182 0.747036 0.57835 0.380865 0.347746