# Graphical models

## Table of Contents

## Notation

- denotes the
*parent nodes*of the rv. in the BN - denotes a clique
- denotes a set of variables in the clique
- are
**potential functions**

## Markov Random Fields

A **Markov random field**, also known as a **Markov network** or an **undirected graphical model**, has a set of nodes each of which corresponds to a variable or group of variables, as well as a set of links each of which connects a pair of nodes.

The links are undirected, i.e. do not carry arrows.

The **Markov blanket** of a node in a graphical model contains all the variables that "shield" the node from the rest of the network.

What this really means is that every set of nodes in a graphical model is *conditionally independent of when conditioned on the set *, where denotes the **Markov blanket** of ; formally for distinct nodes and

In a Bayesian network, is simply the set of nodes consisting of:

- parents of
- children of
- parents of children of (these nodes are sometimes referred to has having a "explaining-away" effect)

## Belief Network

A **belief network** is a distribution of the form

where denotes the *parental* variables of variable

This can be written as a DAG, with the i-th node in the graph corresponding to the factor .

## Factorization properties

- Without loss of generality, only consider maximal cliques, since all other cliques are subsets

The joint distribution is written as a product of **potential functions** over the maximal cliques of the graph

where the **partition function** is given by

which ensures that the distribution is correctly normalized.

To make a formal connection between the conditional independence and factorization for undirected graphs, we restrict attention to **potential functions** that are *strictly positive*, thus we can express them as:

where is called the **energy function** , and the exponential representation is called the **Boltzmann distribution** or **Gibbs distribution**.

A graph is said to be a **D map** (for **dependency map**) of a distribution if every conditional indepedence statement satisfied *by the distribution* is reflected in the graph.

Thus a completely disconnected graph (no links) wil l be a trivial **D map** for any distribution.

A graph is said to be a **I map** (for **independence map**) if every conditional independence statement implied *by the graph* is satisfied by a specific distribution.

Clearly a fully connected graph will be a trivial **I map** for any distribution.

If is the case that every conditional independence property of the distribution is reflected in the graph, and vice versa, then graph is said to be a **perfect map**.

## Inference in Graphical Models

### Notation

- denotes a variable (can be scalar, vector, matrices, etc.)
- denotes a subset of variables
- denotes the set of
*factor*nodes that are neighbours of - denotes the set of variables in the subtree connected to the
*variable*node via a*factor*node - denotes the set of variables in the subtree connected to the
*variable*node via a*factor*node - f depends on , which we denote
- represents the product of all the factors in the group associated with factor
Message from

*factor*node to*variable*node :Message from

*variable*node to*factor*node :where denotes the set of factors in the subtree connected to the

*factor*node via the*variable*node

### Factor graphs

Can write joint distribution over a set of variables in the form of a product of

*factors*where denotes a

*subset*of the variables.- Directed graphs represent special cases in which factors are
*local conditional distributions* - Undirect graphs are a special case in which factors are
*potential functions*over the maximal cliques (the normalizing coefficient can be viewed as a factor defined over the empty set of variables)

#### Converting distribution expressed in terms of undirected graph to factor graph

- Create variable nodes corresponding to the nodes in the original undirected graph
- Create additional factor nodes corresponding to the maximal cliques
- Factors are set equal to the
*clique potentials*

There might be several factor graphs corresponding to the same undirected graph, i.e. this construction is not *unique*.

#### Sum-product algorithm

**Problem:**evaluate local marginals over nodes or subsets of nodes- Applicable to tree-structured graphs
- Exploit structure of the graph to achieve:
- efficient, exact inference algorithm for
*finding marginals* - in situations where several marginals are required to allow computations to be shared efficiently

- efficient, exact inference algorithm for
- Summarized: barber2012

**Belief propagation** is a special case of **sum-product algorithm**, which performs exact inference on directed graphs *without loops*. (You have already written about this)

For actual implementation it's useful to use the logarithmic version of the messages. See barber2012.

##### Derivation

Marginal is obtained by summing the joint distribution over all variables except so that

where denotes the set of variables in with the variable omitted.

Joint distribution of the set of variables can be written as a product of the form

Rewriting the marginal above using the joint distribution notation:

i.e. marginal is given by the product of all incoming messages arriving at node .

Each factor can be descrbied by a

*factor*(sub-)graph and so can itself be factorized:where, for convenience, we have denoted the variable associated with factor , in addition to , by .

Message to can be rewritten:

i.e. marginalizing over all the variables which depend on, and the "messages" which are being sent from the

*variables*to the*factor*. Thus, to evaluate message sent by factor to variable along the link connceting them:- Take product of incoming messages along all
*other*links coming into the factor node - Multiply by the factor associated with that node
- Marginalize over all of the variables associated with the incoming messages

- Take product of incoming messages along all
Message to can be written:

i.e. to evaluate messages passed from variables to adjacent factors we simply take the product of incoming messages the incoming messages along all

*other*links.- Messages can be computed recursively
**Start:**if leaf-node is:*variable*: message it sends along its one and only link is given by*factor*: message it sends is simply

Finally, if we wish to find the marginal distributions associated with the sets of variables belonging to each factors:

##### Summarized

The **sum-product algorithm** can then be summarized as:

- Variable node is chosen as root of the factor graph
- Messages are initiaed at the leaves of the graph, if leaf is:
*variable*: message it sends along its one and only link is given by*factor*: message it sends is simply

Message passing steps are applied recursively until messages have been propagated along every link:

and

Once root node has received messages from all of its of its neighbors, the marginal can be evaluated by:

or, if one wants the marginal distribution associated with the sets of variables belong to each of the factors:

- If we started from a:
*directed*graph: already correctly normalized*undirected*graph: started with unknown normalization constant . From steps 1-4 we now have unnormalized marginal , which we can marginalize over a*single*one of these (holding all other variables constant), which then gives us the normalization constant:

Could find marginals for every variable in the node in the graph by first passing messages from leaf to root, as in sum-product algorithm, and then when receiving all messages to the root, pass messages *from* the root back to the leaves.

Then we would have computed the messages along each of the links / edges in both directions, hence we could simply marginalize over these messages to obtain the wanted marginal.

#### Max-sum algorithm

**Problem:**find most probable*state*or*sequence*:which will have the joint probability

- Summarized: barber2012

Figure 1: barber2012

##### Derivation

Will make use of

which holds if (always true for factors in graphical models!)

Work with instead (for numerical purposes), i.e.

where the distributive property is preserved since

hence we replace products with sums.

## Causality

In setting any variable into a particular state we need to surgically remove all parental links of that variable.

This is called the **do operator**, and constrants *observational* ("see") inference with a casual ("make" or "do") inference .

Formally, let all variables be written in terms of the interventon variables and the non-intervention variables .

For a Belief Network , inferring the effect of setting variables , in states , is equivalent to standard evidential inference in the *post intervention distribution*:

where any parental variable inclded in the intervention set is set to its intervention state. An alternative notation is

**Intervention** here refers to the "act of intervening", as in making some variable take on a particular state, i.e. we're interferring with the "total" state.

## Bibliography

# Bibliography

- [barber2012] Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press (2012).