Conditional Random Fields (CRFs)

Table of Contents

Overview

  • Graphical model
  • Is a discriminative model => attempts to directly model $p(\mathbf{y} | \mathbf{x})$

Graphical models

Distributions over very many variables can often be represented as a product of local functions that depend on a much smaller subset of variables.

Undirected models

Notation

  • $Y$ is a set of rvs. that we target
  • $Y_s \in Y$ takes outcomes from the (continuous or discrete) set $\mathcal{Y}$
  • $\mathbf{y}$ is an arbitrary assignment to $Y$
  • $y_s$ denotes value assigned to $Y_s$ by $\mathbf{y}$
  • $\underset{\mathbf{y} \backslash y_s}{\sum}$ is summation over all possible combinations of $\mathbf{y}$ where $Y_s = y_s$
  • $\Psi_a (\mathbf{y})$ is factor where $a$ is an index in $\{1, ..., A\}$, $A$ being the number of factors
  • $1_{y=y'}$ is a indicator variable, having value $1$ if $y = y'$ and $0$ otherwise
  • $N(s)$ for some rvs. $s$ denotes the neighbors of $s$ in some graph $G$
  • $V = \{1, 2, ..., |Y| \}$ usually denotes indices for rvs. of some graph $G$
  • $F = \{ 1, 2, ..., A \}$ usually denotes the indices for factors of some graph $G$

Factors

  • Believe probability distribution $p$ can be represented as a product of factors $\Psi_a (\mathbf{y})$
  • Each factor $\Psi_a (\mathbf{y})$ depends only on a subset $Y_a \in Y$ of the variables
  • Non-negative scalar which can be thought of a compatibility measure between the values $\mathbf{y}_a$

Formally, given a collections of subs ${Y}_{a=1}^A$ of $Y$, an unidirected graphical model is the set of all distributions that can be written as

    \begin{equation*}
    p(\mathbf{y}) = \frac{1}{Z} \overset{A}{\underset{a=1}{\prod}} \Psi_a (\mathbf{y}_a)
\end{equation*}
1

for any choice of factors $\mathcal{F} = {\Psi_a}$ that have $\Psi_a(\mathbf{y}_a) \ge 0$ for all $\mathbf{y}_a$.

We use the term random field to refer to a particular distribution among those defined by a unidirected graphical model.

The constant $Z$ is the normalization factor (also referred to as the parition function) that ensures the distribution $p$ sums to $1$.

    \begin{equation*}
    Z = \underset{\mathbf{y}}{\sum} \overset{A}{\underset{a=1}{\prod}} \Psi_a (\mathbf{y}_a)
\end{equation*}
2

Notice that the summation in $Z$ is over the exponentially many possible assignments $\mathbf{y}$.

Factor graph

A factor graph is biparite graph $G = (V, E, F)$, i.e. a graph with two disjoint sets of nodes $V$ and $F$:

  • $V = \{ 1, 2, ... |Y| \}$ indexes the rvs. in the model
  • $F = \{ 1, 2, ..., A \}$ indexes the factors

By this representation, we mean that if a variable node $Y_s$ for $s \in V$ is connected to a factor node $\Psi_a$ for $a \in F$ then $Y_s$ is one of the arguments of $\Psi_a$.

Markov network

A Markov network represents the conditional independence relationships in a multivariate distribution, and is a graph over random variables only. That is; $G$ is a graph over integers (rv. indices) $V$.

A distribution $p$ is Markov with respect to $G$ if it satisfies the local Markov property: for any two variables $Y_s, Y_t \in Y$, the variable $Y_s$ is independent of $Y_t$ conditioned on its neighbors, $Y_{N(s)}$. Intuitively, this means that $Y_{N(s)}$ contains all the information that is useful for predicting $Y_s$

An issue with a Markov network is that it is ambigious, i.e. "not descriptive enough" from the factorization persepective. If you have a look at the figure below you can see an example of what we're talking about.

markov-field-ambiguity.png

Any distribution that factorizes as $p(y_1, y_2, y_3) \propto f(y_1, y_2, y_3)$ for some positive function $f$ is Markov with respect to this graph. However, we might want to represent the graph using a more restricted parameterization, where $p(y_1, y_2, y_3) \propto f(y_1, y_2) g(y_2, y_3) h(y_1, y_3)$. This is a more strict family of functions, and so one would expect it to be easier to estimate. Unfortunately, as we can see in the figure, Markov networks doesn't really distinguish between these cases. Lucky us, factor graphs do!

Directed Models

  • Describes how the distribution factorizes into a local conditional probability distribution.

Inference

Belief propagation

Notation

  • $s$ is the variable we want to compute the marginal distribution for
  • $a \in N(s)$ is the neighborhood of *factors of $s$, and it's the index of the neighboring factor of $s$
  • $V_a$ the set of variables "upstream" of $a$, i.e. set of variables $v$ for which the factor $a$ is between $s$ and $v$ ($a$ separates $v$ and $s$), with $a$ being the index of a neighboring factor of $s$
  • $F_a$ the set of factors "upstream" of $a$, i.e. set of factors for which $a$ is between those and $a$ ("separates" $s$ from this set of factors), including the factor $a$ itself
  • Message from $a$ to $s$

    \begin{equation*}
m_{as}(x_s) = \sum_{\mathbf{y}_{V_a}} \prod_{\Psi_b \in F_a} \Psi_b(\mathbf{y}_b)
\end{equation*}

Stuff

  • Direct generalization of exact algorithms for Linear CRFs (i.e. Viterbi algorithm for sequences and Forward-backward algorithm for joint distributions)
  • Each neighboring factor of $s$ makes a multiplicative contribution to the marginal of $s$, called a message

In the special case where our graph is a tree, the graph itself is a the one and only spanning tree, hence we partition the entire graph wrt. $s$: the sets $\{ V_a \} \cup \{ s \}$ form a partition of the variables in $G$. In a bit simpler wording:

  • You're looking at $s$
  • Choose some other variable (node in the graph) $a$
  • All variables between $s$ and $a$ are along a single path (i.e. only one possible way to traverse from $a$ to $s$)
  • Therefore, the sets $\{ V_1 \} \cup \{ s \}, \{ V_2 \} \cup \{ s \}, \dots, \{ V_{N_s} \} \cup \{ s \}$ form a partition of the variables in $G$, where $N_s$ denotes the number of neighboring nodes of $s$
  • Thus, we can split upt the summation required for the marginal into a product of independent subproblems
\begin{equation*}
\begin{split}
  p(y_s) & \propto \sum_{\mathbf{y} \setminus y_s} \prod_a \Psi_a(\mathbf{y}_a) \\
  &= \prod_{a \in N(s)} \sum_{\mathbf{y}_{V_a}} \prod_{\Psi_b \in F_a} \Psi_b(\mathbf{y}_a)
\end{split}
\end{equation*}

And denoting each of the factors in the above equation by $m_{as}$, i.e.,

\begin{equation*}
m_{as}(x_s) = \sum_{\mathbf{y}_{V_a}} \prod_{\Psi_b \in F_a} \Psi_b(\mathbf{y}_b)
\end{equation*}

which can be though of as a message from $a$ to $s$.

And also, if instead take the product over neighboring factors, we can define the message from variables to factors as

\begin{equation*}
m_{sa}(x_s) = \sum_{\mathbf{y}_{V_s}} \prod_{\Psi_b \in F_s} \Psi_b(\mathbf{y}_b)
\end{equation*}

which would give us

\begin{equation*}
p(\mathbf{y}_a) \propto \prod_{s \in N(a)} \sum_{\mathbf{y}_{V_s}} \prod_{\Psi_b \in F_s} \Psi_b (\mathbf{y}_b)
\end{equation*}

Let's break this down:

  • $s \in N(a)$ is product over all variables $s$ which are neighbors to the factor $a$
  • $\mathbf{y}_{V_s}$ are all possible values the variables $s \in V_s$ (i.e. "variables on one side of $s$, including $s$") can take on
  • $\prod_{\Psi_b \in F_s} \Psi_b(\mathbf{y}_b)$ is the product over all neigboring factors $b$ of the variable $s$, excluding $a$ itself (which is our "focus" at the moment)

Observe that when talking about the "marginal for a factor ", we're looking at the marginal of a vector $\mathbf{y}_a$ rather than a single value $y_s$ as in the marginal for a variable.

  • $p(\mathbf{y}_a)$ denotes the vector of random variables $y_s$ with $s \in N(a)$ which are incoming to $\Psi_a$, NOT the distribution over the output of $\Psi_a$

Luckily, we can rewrite the messages recursively

\begin{equation*}
\begin{split}
  m_{as}(x_s) &= \sum_{\mathbf{y}_a \setminus y_s} \Psi_a(\mathbf{y}_a) \prod_{t \in a \setminus s} m_{ta}(x_t) \\
  m_{sa}(x_s) &= \prod_{b \in N(s) \setminus a} m_{bs}(x_s)
\end{split}
\end{equation*}

These recursion-relations define the belief propagation algorithm.

Belief progagation is the algorithm defined by the following recursion relations:

\begin{equation*}
\begin{split}
  m_{as}(x_s) &= \sum_{\mathbf{y}_a \setminus y_s} \Psi_a(\mathbf{y}_a) \prod_{t \in a \setminus s} m_{ta}(x_t) \\
  m_{sa}(x_s) &= \prod_{b \in N(s) \setminus a} m_{bs}(x_s)
\end{split}
\end{equation*}

In the case where the graph $G$ is not a tree, we do not have the guarantee of convergence and refer to this algorithm as loopy belief propagation, as we can still find a fixed point.

As it turns out, Loopy BP can in fact be seen as a variational method, meaning that there actually exists an objective function over beliefs that is approximately minimized by the iterative BP procedure.

Appendix A: Vocabular

random field
a particular distribution among those defined by an unidirected graphical model
biparite graph
set of graph vertices decomposed into two disjoint sets s.t. no two graph vertices from the same set are adjacent / "directly connected".
markov blanket
simply a fancy word for the independent neighbors
multivariate distribution
distribution of multiple variables

Appendix B: Discriminative vs. Generative methods

Discriminative model
attempts to model $p(\mathbf{x}, y)$, and thus indirectly model $p(y | \mathbf{x})$ using Bayes'
Generative model
attemps to model $p(y | \mathbf{x})$ directly

First, it's very hard (impossible?) to tell beforehand whether a generative or discriminative model will perform better on a given dataset. Nonetheless, there are some benefits/negatives to both.

Benefits of discriminative:

  • Agnostic about the form of $p(\mathbf{x})$
  • Make independence assumptions among $\mathbf{y}$ and how $\mathbf{y}$ depend on $\mathbf{x}$, but do not make conditional independence assumptions between the $$s
  • Thus, these models are better at suited for including rich, overlapping (dependent) features.
  • Better probability estimates than generative models; consider a vector of features $\mathbf{x} = (x_1, x_2, ..., x_k)$ which we transform to repeat the features, like so $\mathbf{x}' = (x_1, x_1, x_2, x_2, ..., x_k, x_k)$. Then train, say a Naive Bayes', on this data. Even though no new data has been added (just added multiple completely correlated features), the model will be more confident in the probability estimates, i.e. $p(y | \mathbf{x}')$ will be further away from $0.5$ than $p(y| \mathbf{x})$. Note that his in a way demonstrates the independence assumption between the features which is beeing made a generative model.

Benefits of generative model:

  • More natural for handling latent variables, partially-labeled data, and unlabeled data. Even in extreme case of unlabeled data, generative models can be applied, while discriminative methods is less natural.
  • In some cases, generative model can perform better, intuitively because $p(\mathbf{x})$ might have a smoothening effect on the conditional.
  • If application of the model requires both prediction of future inputs as well as outputs, generative models is a good choice since they model the joint distribution, and so we can potentially marginalize over whatever we don't want to predict to obtain a prediction for our target variable (being input or output).

It may not be clear why these approaches are different, as you could convert from one to the other, both ways, using Bayes' rule.

TODO Ongoing work

One interesting way to look at it is as follows:

Suppose we have a generative model $p_g$ which takes the form:

\begin{equation*}
      p_g(\mathbf{y}, \mathbf{x}; \theta) = p_g(\mathbf{y}; \theta) p_g(\mathbf{x} | \mathbf{y}; \theta)
\end{equation*}
3

And using the chain rule of probability we get

\begin{equation*}
      p_g(\mathbf{y}, \mathbf{x}; \theta) = p_g(\mathbf{x}; \theta) p_g(\mathbf{y} | \mathbf{x}; \theta)
\end{equation*}
4

where:

  • $p_g(\mathbf{x}; \theta) = \underset{\mathbf{y}}{\sum} p_g(\mathbf{y}, \mathbf{x}; \theta)$ is computed by inference
  • $p_g(\mathbf{y} | \mathbf{x}; \theta) = \frac{p_g(\mathbf{y}, \mathbf{x}; \theta)}{p_g(\mathbf{x}; \theta)}$ is computed by inference

And then suppose we another model, but this is discriminative model $p_c$:

Resources

  • Named Entity Recognition with a Maximum Entropy Approach A paper using MEMMs for word-tagging. I link this here since MEMMs are very similar to CRFs, but also because they provide a very nice set of features for use in word-tagging, which is something you might want to use CRFs for.

    In fact, both CRFs and MEMMs are discriminative, unlike the generative HMMs, sequence models. Both CRFs and MEMMs are MaxEnt (maximum entropy) models, but a CRF is models the entire sequence while MEMMs makes decisions for each state independently of the other states.