# Conditional Random Fields (CRFs)

## Table of Contents

## Overview

- Graphical model
- Is a
*discriminative*model => attempts to directly model

## Graphical models

Distributions over very many variables can often be represented as
a *product of local functions* that depend on a much smaller subset
of variables.

### Undirected models

#### Notation

- is a set of rvs. that we target
- takes outcomes from the (continuous or discrete) set
- is an arbitrary assignment to
- denotes value assigned to by
- is summation over all possible combinations of where
- is
*factor*where is an index in , being the number of factors - is a indicator variable, having value if and otherwise
- for some rvs. denotes the
*neighbors*of in some graph - usually denotes indices for rvs. of some graph
- usually denotes the indices for factors of some graph

#### Factors

- Believe probability distribution can be represented as a product of
*factors* - Each factor depends only on a
*subset*of the variables - Non-negative scalar which can be thought of a compatibility measure between the values

Formally, given a collections of subs of , an **unidirected graphical model**
is the *set* of all distributions that can be written as

for any choice of *factors* that have for all .

We use the term **random field** to refer to a *particular distribution*
among those defined by a unidirected graphical model.

The constant is the normalization factor (also referred to as the
*parition function*) that ensures the distribution sums to .

Notice that the summation in is over the *exponentially* many possible assignments .

#### Factor graph

A **factor graph** is *biparite* graph , i.e. a graph with
two *disjoint* sets of nodes and :

- indexes the rvs. in the model
- indexes the factors

By this representation, we mean that if a *variable node* for is connected to a
factor node for then is one of the *arguments* of .

##### Markov network

A **Markov network** represents the conditional independence relationships
in a multivariate distribution, and is a graph over random variables *only*.
That is; is a graph over integers (rv. indices) .

A distribution is *Markov with respect to * if it satisfies the local
**Markov property**: for any two variables , the variable
is independent of conditioned on its neighbors, . Intuitively,
this means that contains all the information that is useful for
predicting

An issue with a Markov network is that it is ambigious, i.e. "not descriptive enough" from the factorization persepective. If you have a look at the figure below you can see an example of what we're talking about.

Any distribution that factorizes as for some
positive function is Markov with respect to this graph. *However*, we
might want to represent the graph using a more *restricted* parameterization,
where . This is a more strict
family of functions, and so one would expect it to be easier to estimate.
Unfortunately, as we can see in the figure, Markov networks doesn't really
distinguish between these cases. Lucky us, *factor graphs* do!

#### Directed Models

- Describes how the distribution factorizes into a local conditional probability distribution.

## Inference

### Belief propagation

#### Notation

- is the variable we want to compute the marginal distribution for
- is the
**neighborhood of *factors**of , and it's the**index of the neighboring factor**of - the set of variables "upstream" of , i.e. set of variables for which the
*factor*is*between*and ( separates and ), with being the*index*of a neighboring factor of - the set of factors "upstream" of , i.e. set of factors for which is
*between*those and ("separates" from this set of factors), including the factor itself **Message**from to

#### Stuff

- Direct generalization of
*exact*algorithms for Linear CRFs (i.e. Viterbi algorithm for sequences and Forward-backward algorithm for joint distributions) - Each neighboring factor of makes a multiplicative contribution to the marginal of , called a
**message**

In the special case where our graph is a tree, the graph itself is a the one and only spanning tree, hence we partition the entire graph wrt. : the *sets* form a *partition* of the variables in . In a bit simpler wording:

- You're looking at
- Choose some other variable (node in the graph)
- All variables
*between*and are along a*single path*(i.e. only one possible way to traverse from to ) - Therefore, the sets form a
*partition*of the variables in , where denotes the number of neighboring nodes of - Thus, we can split upt the summation required for the marginal into a product of
*independent subproblems*

And denoting each of the factors in the above equation by , i.e.,

which can be though of as a **message** from to .

And also, if instead take the product over neighboring *factors*, we can define the message **from variables to factors** as

which would give us

Let's break this down:

- is product over all
*variables*which are neighbors to the*factor* - are all possible values the
*variables*(i.e. "variables on one side of , including ") can take on - is the product over all neigboring
*factors*of the variable , excluding itself (which is our "focus" at the moment)

Observe that when talking about the "marginal for a *factor* ", we're looking at the marginal of a *vector* rather than a single value as in the marginal for a *variable*.

- denotes the vector of random variables with which are
*incoming*to , NOT the distribution over the*output*of

Luckily, we can rewrite the messages *recursively*

These recursion-relations define the **belief propagation algorithm**.

**Belief progagation** is the algorithm defined by the following recursion relations:

In the case where the graph is *not* a tree, we do not have the guarantee of convergence and refer to this algorithm as **loopy belief propagation**, as we can still find a fixed point.

As it turns out, Loopy BP can in fact be seen as a variational method, meaning that there actually exists an objective function over beliefs that is *approximately* minimized by the iterative BP procedure.

## Appendix A: Vocabular

- random field
- a particular distribution among those defined by an unidirected graphical model
- biparite graph
- set of graph vertices decomposed into two
*disjoint*sets s.t. no two graph vertices from the same set are adjacent / "directly connected". - markov blanket
- simply a fancy word for the independent neighbors
- multivariate distribution
- distribution of multiple variables

## Appendix B: Discriminative vs. Generative methods

- Discriminative model
- attempts to model , and thus indirectly model using Bayes'
- Generative model
- attemps to model directly

First, it's very hard (impossible?) to tell beforehand whether a generative or discriminative model will perform better on a given dataset. Nonetheless, there are some benefits/negatives to both.

Benefits of **discriminative**:

- Agnostic about the form of
- Make independence assumptions among and how depend on , but
do
**not**make conditional independence assumptions between the $$s - Thus, these models are better at suited for including rich, overlapping (dependent) features.
- Better probability estimates than generative models; consider a vector of features which we transform to repeat the features, like so . Then train, say a Naive Bayes', on this data. Even though no new data has been added (just added multiple completely correlated features), the model will be more confident in the probability estimates, i.e. will be further away from than . Note that his in a way demonstrates the independence assumption between the features which is beeing made a generative model.

Benefits of **generative** model:

- More natural for handling latent variables, partially-labeled data, and unlabeled data. Even in extreme case of unlabeled data, generative models can be applied, while discriminative methods is less natural.
- In some cases, generative model can perform better, intuitively because might have a smoothening effect on the conditional.
- If application of the model requires both prediction of future
*inputs*as well as outputs, generative models is a good choice since they model the*joint*distribution, and so we can potentially marginalize over whatever we*don't*want to predict to obtain a prediction for our target variable (being input or output).

It may not be clear why these approaches are different, as you could convert from one to the other, both ways, using Bayes' rule.

### TODO Ongoing work

One interesting way to look at it is as follows:

Suppose we have a *generative* model which takes the form:

And using the chain rule of probability we get

where:

- is computed by inference
- is computed by inference

And then suppose we another model, but this is *discriminative* model :

## Resources

Named Entity Recognition with a Maximum Entropy Approach A paper using MEMMs for word-tagging. I link this here since MEMMs are very similar to CRFs, but also because they provide a very nice set of features for use in word-tagging, which is something you might want to use CRFs for.

In fact, both CRFs and MEMMs are discriminative, unlike the generative HMMs, sequence models. Both CRFs and MEMMs are MaxEnt (maximum entropy) models, but a CRF is models the

*entire sequence*while MEMMs makes decisions for each state independently of the other states.