Graphical models

Table of Contents

Notation

  • $\text{pa}(X)$ denotes the parent nodes of the rv. $X$ in the BN
  • $C$ denotes a clique
  • $\mathbf{x}_C$ denotes a set of variables in the clique $C$
  • $\psi_C(\mathbf{x}_C)$ are potential functions

Markov Random Fields

A Markov random field, also known as a Markov network or an undirected graphical model, has a set of nodes each of which corresponds to a variable or group of variables, as well as a set of links each of which connects a pair of nodes.

The links are undirected, i.e. do not carry arrows.

The Markov blanket of a node in a graphical model contains all the variables that "shield" the node from the rest of the network.

What this really means is that every set of nodes in a graphical model is conditionally independent of $A$ when conditioned on the set $\mathrm{MB}(A)$, where $\mathrm{MB}(A)$ denotes the Markov blanket of $A$; formally for distinct nodes $A$ and $B$

\begin{equation*}
\Pr \big( A \mid \mathrm{MB}(A), B \big) = \Pr \big( A \mid \mathrm{MB}(A) \big)
\end{equation*}

In a Bayesian network, $\mathrm{MB}(A)$ is simply the set of nodes consisting of:

  • parents of $A$
  • children of $A$
  • parents of children of $A$ (these nodes are sometimes referred to has having a "explaining-away" effect)

Belief Network

A belief network is a distribution of the form

\begin{equation*}
p(x_1, \dots, x_D) = \prod_{i = 1}^D p \Big( x_i \mid \text{pa}(x_i) \Big)
\end{equation*}

where $\text{pa}(x_i)$ denotes the parental variables of variable $x_i$

This can be written as a DAG, with the i-th node in the graph corresponding to the factor $p \big( x_i \mid \text{pa}(x_i) \big)$.

Factorization properties

  • Without loss of generality, only consider maximal cliques, since all other cliques are subsets

The joint distribution is written as a product of potential functions $\psi_C(\mathbf{x}_C)$ over the maximal cliques of the graph

\begin{equation*}
p(\mathbf{x}) = \frac{1}{Z} \prod_C \psi_C(\mathbf{x}_C)
\end{equation*}

where the partition function is given by

\begin{equation*}
Z = \sum_{\mathbf{x}} \prod_C \psi_C(\mathbf{x}_C)
\end{equation*}

which ensures that the distribution $p(\mathbf{x})$ is correctly normalized.

To make a formal connection between the conditional independence and factorization for undirected graphs, we restrict attention to potential functions $\psi_C(\mathbf{x}_C)$ that are strictly positive, thus we can express them as:

\begin{equation*}
\psi_C(\mathbf{x}_C) = \exp \Big( - E(\mathbf{x}_C) \Big)
\end{equation*}

where $E(\mathbf{x}_C)$ is called the energy function , and the exponential representation is called the Boltzmann distribution or Gibbs distribution.

A graph is said to be a D map (for dependency map) of a distribution if every conditional indepedence statement satisfied by the distribution is reflected in the graph.

Thus a completely disconnected graph (no links) wil l be a trivial D map for any distribution.

A graph is said to be a I map (for independence map) if every conditional independence statement implied by the graph is satisfied by a specific distribution.

Clearly a fully connected graph will be a trivial I map for any distribution.

If is the case that every conditional independence property of the distribution is reflected in the graph, and vice versa, then graph is said to be a perfect map.

Inference in Graphical Models

Notation

  • $x_i$ denotes a variable (can be scalar, vector, matrices, etc.)
  • $\mathbf{x}_s$ denotes a subset of variables $x_i$
  • $\text{ne}(x)$ denotes the set of factor nodes that are neighbours of $x$
  • $X_s$ denotes the set of variables in the subtree connected to the variable node $x$ via a factor node $f_s$
  • $X_{sm}$ denotes the set of variables in the subtree connected to the variable node $x_m$ via a factor node $f_s$
  • f$f_s$ depends on $\{ x, x_1, \dots, x_M \}$, which we denote $\mathbf{x}_s$
  • $F_s(x, X_s)$ represents the product of all the factors in the group associated with factor $f_s$
  • Message from factor node $f_s$ to variable node $x$:

    \begin{equation*}
\mu_{f_s \to x}(x) = \sum_{X_s} F_s(x, X_s)
\end{equation*}
  • Message from variable node $x_m$ to factor node $f_s$:

    \begin{equation*}
  \mu_{x_m \to f_s}(x_m) = \sum_{X_{sm}} G_m(x_m, X_{sm})
\end{equation*}

    where $X_{sm}$ denotes the set of factors in the subtree connected to the factor node $f_s$ via the variable node $x_m$

Factor graphs

  • Can write joint distribution over a set of variables in the form of a product of factors

    \begin{equation*}
p(\mathbf{x}) = \prod_s f_s(\mathbf{x}_s)  
\end{equation*}

    where $\mathbf{x}_s$ denotes a subset of the variables.

  • Directed graphs represent special cases in which factors $f_s(\mathbf{x}_s)$ are local conditional distributions
  • Undirect graphs are a special case in which factors are potential functions over the maximal cliques (the normalizing coefficient $\frac{1}{Z}$ can be viewed as a factor defined over the empty set of variables)

Converting distribution expressed in terms of undirected graph to factor graph

  1. Create variable nodes corresponding to the nodes in the original undirected graph
  2. Create additional factor nodes corresponding to the maximal cliques $\mathbf{x}_s$
  3. Factors $f_s(\mathbf{x}_s)$ are set equal to the clique potentials $\psi_s(\mathbf{x}_s)$

There might be several factor graphs corresponding to the same undirected graph, i.e. this construction is not unique.

Sum-product algorithm

  • Problem: evaluate local marginals over nodes or subsets of nodes
  • Applicable to tree-structured graphs
  • Exploit structure of the graph to achieve:
    1. efficient, exact inference algorithm for finding marginals
    2. in situations where several marginals are required to allow computations to be shared efficiently
  • Summarized: barber2012

sum_product_part1.png sum_product_part2.png

Belief propagation is a special case of sum-product algorithm, which performs exact inference on directed graphs without loops. (You have already written about this)

For actual implementation it's useful to use the logarithmic version of the messages. See barber2012.

Derivation
  • Marginal is obtained by summing the joint distribution over all variables except $x$ so that

    \begin{equation*}
p(x) = \sum_{\mathbf{x} \setminus \{x\}} p(\mathbf{x})
\end{equation*}

    where $\mathbf{x} \setminus \left\{ x \right\}$ denotes the set of variables in $\mathbf{x}$ with the variable $x$ omitted.

  • Joint distribution of the set of variables $\mathbf{x}$ can be written as a product of the form

    \begin{equation*}
p(\mathbf{x}) = \prod_{s \in \text{ne}(x)} F_s(x, X_s)  
\end{equation*}
  • Rewriting the marginal above using the joint distribution notation:

    \begin{equation*}
  \begin{split}
    p(x) &= \prod_{s \in \text{ne}(x)} \Bigg[ \sum_{X_s} F_s(x, X_s) \Bigg] \\
    &= \prod_{s \in \text{ne}(x)} \mu_{f_s \to x}(x)
  \end{split}
\end{equation*}

    i.e. marginal is given by the product of all incoming messages arriving at node $x$.

  • Each factor $F_s(x, X_s)$ can be descrbied by a factor (sub-)graph and so can itself be factorized:

    \begin{equation*}
F_s(x, X_s) = f_s(x, x_1, \dots, x_M) \ G_1(x_1, X_{s1}) \dots G_M(x_M, X_{sM})  
\end{equation*}

    where, for convenience, we have denoted the variable associated with factor $f_x$, in addition to $x$, by $x_1, \dots, x_M$.

  • Message $f_s$ to $x$ can be rewritten:

    \begin{equation*}
\begin{split}
  \mu_{f_s \to x} (x) & = \sum_{x_1} \dots \sum_{x_M} f_s(x, x_1, \dots, x_M) \prod_{m \in \text{ne}(f_s) \setminus x} \Bigg[ \sum_{X_{xm}} G_m(x_m, X_{sm}) \Bigg] \\
  &= \sum_{x_1} \dots \sum_{x_M} f_s(x, x_1, \dots, x_M) \prod_{m \in \text{ne}(f_s) \setminus x} \mu_{x_m \to f_s}(x_m)
\end{split}
\end{equation*}

    i.e. marginalizing over all the variables which $f_s$ depend on, and the "messages" which are being sent from the variables to the factor $f_s$. Thus, to evaluate message sent by factor to variable along the link connceting them:

    1. Take product of incoming messages along all other links coming into the factor node
    2. Multiply by the factor associated with that node
    3. Marginalize over all of the variables associated with the incoming messages
  • Message $x_m$ to $f_s$ can be written:

    \begin{equation*}
\begin{split}
  \mu_{x_m \to f_s}(x_m) &= \prod_{l \in \text{ne}(x_m) \setminus \left\{ f_s \right\}} \Bigg[ \sum_{X_{ml}} F_l(x_m, X_{ml}) \Bigg] \\
  &= \prod_{l \in \text{ne}(x_m) \setminus \left\{ f_s \right\}} \mu_{f_l \to x_m}(x_m)
\end{split}
\end{equation*}

    i.e. to evaluate messages passed from variables to adjacent factors we simply take the product of incoming messages the incoming messages along all other links.

  • Messages can be computed recursively
    • Start: if leaf-node is:
      • variable: message it sends along its one and only link is given by

        \begin{equation*}
  \mu_{x \to f}(x) = 1
\end{equation*}
      • factor: message it sends is simply

        \begin{equation*}
  \mu_{f \to x}(x) = f(x)      
\end{equation*}
  • Finally, if we wish to find the marginal distributions $p(\mathbf{x}_s)$ associated with the sets of variables belonging to each factors:

    \begin{equation*}
p(\mathbf{x}_s) = f_s(\mathbf{x}_s) \prod_{i \in \text{ne}(f_s)} \mu_{x_i \to f_s}(x_i)  
\end{equation*}
Summarized

The sum-product algorithm can then be summarized as:

  1. Variable node $x$ is chosen as root of the factor graph
  2. Messages are initiaed at the leaves of the graph, if leaf is:
    • variable: message it sends along its one and only link is given by

      \begin{equation*}
  \mu_{x \to f}(x) = 1
\end{equation*}
    • factor: message it sends is simply

      \begin{equation*}
  \mu_{f \to x}(x) = f(x)      
\end{equation*}
  3. Message passing steps are applied recursively until messages have been propagated along every link:

    \begin{equation*}
\begin{split}
  \mu_{f_s \to x} (x) & = \sum_{x_1} \dots \sum_{x_M} f_s(x, x_1, \dots, x_M) \prod_{m \in \text{ne}(f_s) \setminus x} \Bigg[ \sum_{X_{xm}} G_m(x_m, X_{sm}) \Bigg] \\
  &= \sum_{x_1} \dots \sum_{x_M} f_s(x, x_1, \dots, x_M) \prod_{m \in \text{ne}(f_s) \setminus x} \mu_{x_m \to f_s}(x_m) \\
  &= \sum_{X_s} F_s(x, X_s)
\end{split}
\end{equation*}

    and

    \begin{equation*}
\begin{split}
  \mu_{x_m \to f_s}(x_m) &= \prod_{l \in \text{ne}(x_m) \setminus \left\{ f_s \right\}} \Bigg[ \sum_{X_{ml}} F_l(x_m, X_{ml}) \Bigg] \\
  &= \prod_{l \in \text{ne}(x_m) \setminus \left\{ f_s \right\}} \mu_{f_l \to x_m}(x_m)
\end{split}
\end{equation*}
  4. Once root node has received messages from all of its of its neighbors, the marginal can be evaluated by:

    \begin{equation*}
p(x) = \prod_{s \in \text{ne}(x)} \mu_{f_s \to x}(x)   
\end{equation*}

    or, if one wants the marginal distribution $p(\mathbf{x}_s)$ associated with the sets of variables belong to each of the factors:

    \begin{equation*}
  p(\mathbf{x}_s) = f_s(\mathbf{x}_s) \prod_{i \in \text{ne}(f_s)} \mu_{x_i \to f_s}(x_i)
\end{equation*}
  5. If we started from a:
    • directed graph: already correctly normalized
    • undirected graph: started with unknown normalization constant $\frac{1}{Z}$. From steps 1-4 we now have unnormalized marginal $\tilde{p}(\mathbf{x})$, which we can marginalize over a single one of these $x_i$ (holding all other variables constant), which then gives us the normalization constant:

      \begin{equation*}
p(\mathbf{x}) = \frac{\tilde{p}(\mathbf{x})}{Z}
\end{equation*}

Could find marginals for every variable in the node in the graph by first passing messages from leaf to root, as in sum-product algorithm, and then when receiving all messages to the root, pass messages from the root back to the leaves.

Then we would have computed the messages along each of the links / edges in both directions, hence we could simply marginalize over these messages to obtain the wanted marginal.

Max-sum algorithm

  • Problem: find most probable state or sequence:

    \begin{equation*}
\mathbf{x}^{\text{max}} = \underset{\mathbf{x}}{\text{argmax}}\ p(\mathbf{x})  
\end{equation*}

    which will have the joint probability

    \begin{equation*}
p(\mathbf{x}^{\text{max}}) = \max_{\mathbf{x}} p(\mathbf{x})  
\end{equation*}
  • Summarized: barber2012

max_product.png

Figure 1: barber2012

Derivation
  • Will make use of

    \begin{equation*}
\max(ab, ac) = a \max(b, c)  
\end{equation*}

    which holds if $a \ge 0$ (always true for factors in graphical models!)

  • Work with $\ln$ instead (for numerical purposes), i.e.

    \begin{equation*}
\ln \Big( \max_{\mathbf{x}} p(\mathbf{x}) \Big) = \max_{\mathbf{x}} \ln p(\mathbf{x})  
\end{equation*}

    where the distributive property is preserved since

    \begin{equation*}
\max(a + b, a + c) = a + \max(b, c)  
\end{equation*}

    hence we replace products with sums.

Causality

In setting any variable into a particular state we need to surgically remove all parental links of that variable.

This is called the do operator, and constrants observational ("see") inference $p(x \mid y)$ with a casual ("make" or "do") inference $p \big( x \mid \text{do}(y) \big)$.

Formally, let all variables $\mathcal{X} = \mathcal{X}_C \cup \mathcal{X}_{\bar{C}}$ be written in terms of the interventon variables $\mathcal{X}_C$ and the non-intervention variables $\mathcal{X}_{\bar{C}}$.

For a Belief Network $p(\mathcal{X}) = \prod_i p \big( X_i \mid \text{pa}(X_i) \big)$, inferring the effect of setting variables $X_{c_1}, \dots, X_{c_k} \in \mathcal{C}$, in states $x_{c_1}, \dots, x_{c_k}$, is equivalent to standard evidential inference in the post intervention distribution:

\begin{equation*}
p \Big( \mathcal{X}_{\bar{C}} \mid \text{do} \big( X_{c_1} = x_{c_1} \big), \dots, \text{do} \big( X_{c_k} = x_{c_k} \big) \Big) = \prod_{j \in \bar{C}} p \Big( X_j \mid \text{pa}(X_j) \Big)
\end{equation*}

where any parental variable inclded in the intervention set is set to its intervention state. An alternative notation is

\begin{equation*}
p \Big( \mathcal{X}_{\bar{C}} \ || \ x_{c_1}, \dots, x_{c_k} \Big)
\end{equation*}

Intervention here refers to the "act of intervening", as in making some variable take on a particular state, i.e. we're interferring with the "total" state.

Bibliography

Bibliography

  • [barber2012] Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press (2012).