Conditional Random Fields (CRFs)

Table of Contents

Overview

  • Graphical model
  • Is a discriminative model => attempts to directly model conditional_random_fields_86c3e954b12721b3c2b701da6db84f92514b2f18.png

Graphical models

Distributions over very many variables can often be represented as a product of local functions that depend on a much smaller subset of variables.

Undirected models

Notation

  • conditional_random_fields_3ca302b80fb078e124ac6b194794815069394f1a.png is a set of rvs. that we target
  • conditional_random_fields_b36676847a55b51b2bf4ac7e5fc24ea9a3eb457c.png takes outcomes from the (continuous or discrete) set conditional_random_fields_78fe10aaa74d6bed9ab2238be53774fe93bfbf0e.png
  • conditional_random_fields_29c26837fff0673a98d6523a80b727d38797f426.png is an arbitrary assignment to conditional_random_fields_3ca302b80fb078e124ac6b194794815069394f1a.png
  • conditional_random_fields_5d8382a9fce8f1444904ab0e9e37f228c3d3b49e.png denotes value assigned to conditional_random_fields_b9ef6b6eae1302cd9376186c18c5c9deb79de1fb.png by conditional_random_fields_29c26837fff0673a98d6523a80b727d38797f426.png
  • conditional_random_fields_db736db08a1ae929986c2303329a14a654e73897.png is summation over all possible combinations of conditional_random_fields_29c26837fff0673a98d6523a80b727d38797f426.png where conditional_random_fields_5180e837714186b45b37eed238fb22b38871789f.png
  • conditional_random_fields_6673bf10cbe9d2637727c15b1b7c29830112c37d.png is factor where conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png is an index in conditional_random_fields_01e3a84e971ae604c0e6cb58613128d241051d1e.png, conditional_random_fields_8939ac39ed16887f3b3c033a47d4ab60d120251a.png being the number of factors
  • conditional_random_fields_7decdcb76704547d9f774b707811a36b514df798.png is a indicator variable, having value conditional_random_fields_7ba43ff00864c097bf5a587573e23164e2417a0b.png if conditional_random_fields_c9976ef028406c341619065ec811b9aedb8bb5d7.png and conditional_random_fields_d83ba0b34682ac87f8d84c8310f6bbd28d1fe65b.png otherwise
  • conditional_random_fields_a8f0934a01b2829c58aa965fb91cbccee114424a.png for some rvs. conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png denotes the neighbors of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png in some graph conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png
  • conditional_random_fields_d2e7d6be78e2eb303e3f969beafd8ba142094bde.png usually denotes indices for rvs. of some graph conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png
  • conditional_random_fields_0438b73607389fefbca3dd48b96b209856ed1d00.png usually denotes the indices for factors of some graph conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png

Factors

  • Believe probability distribution conditional_random_fields_fefe9e556d399665a26a37824ec578cbffb0cabe.png can be represented as a product of factors conditional_random_fields_6673bf10cbe9d2637727c15b1b7c29830112c37d.png
  • Each factor conditional_random_fields_6673bf10cbe9d2637727c15b1b7c29830112c37d.png depends only on a subset conditional_random_fields_46305e5f92376df3d2116588be2f3cf775b04d3f.png of the variables
  • Non-negative scalar which can be thought of a compatibility measure between the values conditional_random_fields_ea89a2620d072880d1edc4199808be946190b40c.png

Formally, given a collections of subs conditional_random_fields_b99a2f12edb4e9993c505a0940d19c6524bbf1f9.png of conditional_random_fields_3ca302b80fb078e124ac6b194794815069394f1a.png, an unidirected graphical model is the set of all distributions that can be written as

conditional_random_fields_93854b560e3777d8557d18a9c83438f8b0322893.png

for any choice of factors conditional_random_fields_956c99406fecb4274282edba49fbdebfad7fafb3.png that have conditional_random_fields_a2f5e09313b3f397fd005a9c9ef5eaa1dff3b145.png for all conditional_random_fields_ea89a2620d072880d1edc4199808be946190b40c.png.

We use the term random field to refer to a particular distribution among those defined by a unidirected graphical model.

The constant conditional_random_fields_0aa0470120f07b61ef9e985225cf9bbd6b79177c.png is the normalization factor (also referred to as the parition function) that ensures the distribution conditional_random_fields_fefe9e556d399665a26a37824ec578cbffb0cabe.png sums to conditional_random_fields_7ba43ff00864c097bf5a587573e23164e2417a0b.png.

conditional_random_fields_d75e759f236b7d9b52446f6572694de0dc435f1f.png

Notice that the summation in conditional_random_fields_0aa0470120f07b61ef9e985225cf9bbd6b79177c.png is over the exponentially many possible assignments conditional_random_fields_29c26837fff0673a98d6523a80b727d38797f426.png.

Factor graph

A factor graph is biparite graph conditional_random_fields_822355074d30fc49f3ba5ce7b9966db945b3ca01.png, i.e. a graph with two disjoint sets of nodes conditional_random_fields_79dc7b66c784a82e557ef3df794e82b67a292b17.png and conditional_random_fields_e08421ac16591ee60adecc1e14a359550f81a1e2.png:

  • conditional_random_fields_ed5335aa3d79877f1325a9c88ca8884ffe2022e5.png indexes the rvs. in the model
  • conditional_random_fields_0438b73607389fefbca3dd48b96b209856ed1d00.png indexes the factors

By this representation, we mean that if a variable node conditional_random_fields_b9ef6b6eae1302cd9376186c18c5c9deb79de1fb.png for conditional_random_fields_fcc0a9f8fe554673696bfc45aab261811771e989.png is connected to a factor node conditional_random_fields_ada4a80c345bd3a0947224d17cfa595249ce3bac.png for conditional_random_fields_f633ce510c0a77282cf1a59283f074496f9e5f65.png then conditional_random_fields_b9ef6b6eae1302cd9376186c18c5c9deb79de1fb.png is one of the arguments of conditional_random_fields_ada4a80c345bd3a0947224d17cfa595249ce3bac.png.

Markov network

A Markov network represents the conditional independence relationships in a multivariate distribution, and is a graph over random variables only. That is; conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png is a graph over integers (rv. indices) conditional_random_fields_79dc7b66c784a82e557ef3df794e82b67a292b17.png.

A distribution conditional_random_fields_fefe9e556d399665a26a37824ec578cbffb0cabe.png is Markov with respect to conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png if it satisfies the local Markov property: for any two variables conditional_random_fields_ad1e3fe58109ebbabb85d1aa14bf4f2a252e3cf5.png, the variable conditional_random_fields_b9ef6b6eae1302cd9376186c18c5c9deb79de1fb.png is independent of conditional_random_fields_a316f788976a5692ce509a55d8a3de2b1d46a068.png conditioned on its neighbors, conditional_random_fields_b5bbc68fd5fa20e3902f5cba7611f6b49f811efd.png. Intuitively, this means that conditional_random_fields_b5bbc68fd5fa20e3902f5cba7611f6b49f811efd.png contains all the information that is useful for predicting conditional_random_fields_b9ef6b6eae1302cd9376186c18c5c9deb79de1fb.png

An issue with a Markov network is that it is ambigious, i.e. "not descriptive enough" from the factorization persepective. If you have a look at the figure below you can see an example of what we're talking about.

markov-field-ambiguity.png

Any distribution that factorizes as conditional_random_fields_04a6377b571508a451f8b28dfda53be8ad83e3da.png for some positive function conditional_random_fields_cdd1cc131da6040eca078917132a377727053c44.png is Markov with respect to this graph. However, we might want to represent the graph using a more restricted parameterization, where conditional_random_fields_ae84a502ae1812f0a384c3dd1871f12ce5f0584b.png. This is a more strict family of functions, and so one would expect it to be easier to estimate. Unfortunately, as we can see in the figure, Markov networks doesn't really distinguish between these cases. Lucky us, factor graphs do!

Directed Models

  • Describes how the distribution factorizes into a local conditional probability distribution.

Inference

Belief propagation

Notation

  • conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png is the variable we want to compute the marginal distribution for
  • conditional_random_fields_f33936c30a19c15507617e49579bcf0bc5ced7ea.png is the neighborhood of *factors of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png, and it's the index of the neighboring factor of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png
  • conditional_random_fields_706fa02cdb48720b7744478a076bd8686ed44463.png the set of variables "upstream" of conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png, i.e. set of variables conditional_random_fields_ebac26f7eba81eb516f423f028596d8e6018aa7a.png for which the factor conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png is between conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png and conditional_random_fields_ebac26f7eba81eb516f423f028596d8e6018aa7a.png (conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png separates conditional_random_fields_ebac26f7eba81eb516f423f028596d8e6018aa7a.png and conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png), with conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png being the index of a neighboring factor of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png
  • conditional_random_fields_af3fc4570389c662d1f18f2b397f75142f8c102f.png the set of factors "upstream" of conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png, i.e. set of factors for which conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png is between those and conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png ("separates" conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png from this set of factors), including the factor conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png itself
  • Message from conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png to conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png

    conditional_random_fields_ed69c1c578927dbedbb447b4be311acfe02d3b9a.png

Stuff

  • Direct generalization of exact algorithms for Linear CRFs (i.e. Viterbi algorithm for sequences and Forward-backward algorithm for joint distributions)
  • Each neighboring factor of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png makes a multiplicative contribution to the marginal of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png, called a message

In the special case where our graph is a tree, the graph itself is a the one and only spanning tree, hence we partition the entire graph wrt. conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png: the sets conditional_random_fields_7916f3e1bf4a39ee04a9c989b2a0f51b6e50d422.png form a partition of the variables in conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png. In a bit simpler wording:

  • You're looking at conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png
  • Choose some other variable (node in the graph) conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png
  • All variables between conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png and conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png are along a single path (i.e. only one possible way to traverse from conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png to conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png)
  • Therefore, the sets conditional_random_fields_130d61b33fa55124d962e9d81233842ce04f278a.png form a partition of the variables in conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png, where conditional_random_fields_bb17a2ec1cec88e1936e09f64d5a7dd14d0e2efe.png denotes the number of neighboring nodes of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png
  • Thus, we can split upt the summation required for the marginal into a product of independent subproblems

conditional_random_fields_a02cd71d93fe322c7806c91d7fa2154a558a9278.png

And denoting each of the factors in the above equation by conditional_random_fields_0439eab8ac089d8158d77c6f3886bf27cb648365.png, i.e.,

conditional_random_fields_ed69c1c578927dbedbb447b4be311acfe02d3b9a.png

which can be though of as a message from conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png to conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png.

And also, if instead take the product over neighboring factors, we can define the message from variables to factors as

conditional_random_fields_0a980082ea0f65e776c13c17122ea5c69d5aab48.png

which would give us

conditional_random_fields_6b5e39bc409a1c954950fe8f34d8d0b19c9b3bcb.png

Let's break this down:

  • conditional_random_fields_8f46b6558bdfd8b34aa00188be94abba37812d93.png is product over all variables conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png which are neighbors to the factor conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png
  • conditional_random_fields_a2c4392e3f74690808f02c5e30f78114dc3e1631.png are all possible values the variables conditional_random_fields_7c9368bef5eff6987112a369f652515c8e0d6198.png (i.e. "variables on one side of conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png, including conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png") can take on
  • conditional_random_fields_a8dc3e6669cd63ee3401c8c011cc1397d75f7c1f.png is the product over all neigboring factors conditional_random_fields_311bf31f836cbf61de75e9461effe47dc0f184eb.png of the variable conditional_random_fields_9493b9776211ef51195c1379ad5f5540a04e566e.png, excluding conditional_random_fields_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png itself (which is our "focus" at the moment)

Observe that when talking about the "marginal for a factor ", we're looking at the marginal of a vector conditional_random_fields_ea89a2620d072880d1edc4199808be946190b40c.png rather than a single value conditional_random_fields_5d8382a9fce8f1444904ab0e9e37f228c3d3b49e.png as in the marginal for a variable.

  • conditional_random_fields_fe55c3754c4c074c85920b40462e967cc4459f39.png denotes the vector of random variables conditional_random_fields_5d8382a9fce8f1444904ab0e9e37f228c3d3b49e.png with conditional_random_fields_8f46b6558bdfd8b34aa00188be94abba37812d93.png which are incoming to conditional_random_fields_ada4a80c345bd3a0947224d17cfa595249ce3bac.png, NOT the distribution over the output of conditional_random_fields_ada4a80c345bd3a0947224d17cfa595249ce3bac.png

Luckily, we can rewrite the messages recursively

conditional_random_fields_7df30b1161903dee33cd1629e51b30f39297775e.png

These recursion-relations define the belief propagation algorithm.

Belief progagation is the algorithm defined by the following recursion relations:

conditional_random_fields_7df30b1161903dee33cd1629e51b30f39297775e.png

In the case where the graph conditional_random_fields_2de98136973021abb46a5a3fc1e4318bafb84264.png is not a tree, we do not have the guarantee of convergence and refer to this algorithm as loopy belief propagation, as we can still find a fixed point.

As it turns out, Loopy BP can in fact be seen as a variational method, meaning that there actually exists an objective function over beliefs that is approximately minimized by the iterative BP procedure.

Appendix A: Vocabular

random field
a particular distribution among those defined by an unidirected graphical model
biparite graph
set of graph vertices decomposed into two disjoint sets s.t. no two graph vertices from the same set are adjacent / "directly connected".
markov blanket
simply a fancy word for the independent neighbors
multivariate distribution
distribution of multiple variables

Appendix B: Discriminative vs. Generative methods

Discriminative model
attempts to model conditional_random_fields_c3063a3b645ff48ecf200782dfad41eb283818bd.png, and thus indirectly model conditional_random_fields_071e43e709e11cbb39388d129194deed488839fc.png using Bayes'
Generative model
attemps to model conditional_random_fields_071e43e709e11cbb39388d129194deed488839fc.png directly

First, it's very hard (impossible?) to tell beforehand whether a generative or discriminative model will perform better on a given dataset. Nonetheless, there are some benefits/negatives to both.

Benefits of discriminative:

  • Agnostic about the form of conditional_random_fields_1ed25cf746e36b58c8d980488e7f49fe0ccb1b39.png
  • Make independence assumptions among conditional_random_fields_29c26837fff0673a98d6523a80b727d38797f426.png and how conditional_random_fields_29c26837fff0673a98d6523a80b727d38797f426.png depend on conditional_random_fields_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png, but do not make conditional independence assumptions between the $$s
  • Thus, these models are better at suited for including rich, overlapping (dependent) features.
  • Better probability estimates than generative models; consider a vector of features conditional_random_fields_903f34c6b2ccbae65f5d78dcbe6aaf1eed6501a2.png which we transform to repeat the features, like so conditional_random_fields_9a209b76a6ae0d80ce0e698aa48a15bf9b3d6e15.png. Then train, say a Naive Bayes', on this data. Even though no new data has been added (just added multiple completely correlated features), the model will be more confident in the probability estimates, i.e. conditional_random_fields_6a8dc7bd5924d28080e3657998598b72a45d70ec.png will be further away from conditional_random_fields_b16b09ba22d8a820fe30270ba6dda98931fb21b0.png than conditional_random_fields_101a146b89ac6627838d779a611c2ea71b97e0f4.png. Note that his in a way demonstrates the independence assumption between the features which is beeing made a generative model.

Benefits of generative model:

  • More natural for handling latent variables, partially-labeled data, and unlabeled data. Even in extreme case of unlabeled data, generative models can be applied, while discriminative methods is less natural.
  • In some cases, generative model can perform better, intuitively because conditional_random_fields_1ed25cf746e36b58c8d980488e7f49fe0ccb1b39.png might have a smoothening effect on the conditional.
  • If application of the model requires both prediction of future inputs as well as outputs, generative models is a good choice since they model the joint distribution, and so we can potentially marginalize over whatever we don't want to predict to obtain a prediction for our target variable (being input or output).

It may not be clear why these approaches are different, as you could convert from one to the other, both ways, using Bayes' rule.

TODO Ongoing work

One interesting way to look at it is as follows:

Suppose we have a generative model conditional_random_fields_50861d9ecb45ec87d68fb67db39f0cc471069974.png which takes the form:

conditional_random_fields_d305f8891f7f9203902ff1c499be2905260e7b3d.png

And using the chain rule of probability we get

conditional_random_fields_04b91aab9db01c9aaf1e85e0cac45f0d7e4343c9.png

where:

  • conditional_random_fields_ab79a1dacadc6abf4007bc7191b5198b84f2ef64.png is computed by inference
  • conditional_random_fields_141c3a15b467cb62343d8872763cdec227761eee.png is computed by inference

And then suppose we another model, but this is discriminative model conditional_random_fields_6b0416fee5ce9f78e544118a95f32000c928a5a1.png:

Resources

  • Named Entity Recognition with a Maximum Entropy Approach A paper using MEMMs for word-tagging. I link this here since MEMMs are very similar to CRFs, but also because they provide a very nice set of features for use in word-tagging, which is something you might want to use CRFs for.

    In fact, both CRFs and MEMMs are discriminative, unlike the generative HMMs, sequence models. Both CRFs and MEMMs are MaxEnt (maximum entropy) models, but a CRF is models the entire sequence while MEMMs makes decisions for each state independently of the other states.