Graphical models
Table of Contents
Notation
 denotes the parent nodes of the rv. 
 in the BN
 denotes a clique
 denotes a set of variables in the clique 
 are potential functions
Markov Random Fields
A Markov random field, also known as a Markov network or an undirected graphical model, has a set of nodes each of which corresponds to a variable or group of variables, as well as a set of links each of which connects a pair of nodes.
The links are undirected, i.e. do not carry arrows.
The Markov blanket of a node in a graphical model contains all the variables that "shield" the node from the rest of the network.
What this really means is that every set of nodes in a graphical model is conditionally independent of 
 when conditioned on the set 
, where 
 denotes the Markov blanket of 
; formally for distinct nodes 
 and 
In a Bayesian network, 
 is simply the set of nodes consisting of:
- parents of 

 - children of 

 - parents of children of 
 (these nodes are sometimes referred to has having a "explaining-away" effect) 
Belief Network
A belief network is a distribution of the form
where 
 denotes the parental variables of variable 
This can be written as a DAG, with the i-th node in the graph corresponding to the factor 
.
Factorization properties
- Without loss of generality, only consider maximal cliques, since all other cliques are subsets
 
The joint distribution is written as a product of potential functions 
 over the maximal cliques of the graph
where the partition function is given by
which ensures that the distribution 
 is correctly normalized.
To make a formal connection between the conditional independence and factorization for undirected graphs, we restrict attention to potential functions 
 that are strictly positive, thus we can express them as:
where 
 is called the energy function , and the exponential representation is called the Boltzmann distribution or Gibbs distribution.
A graph is said to be a D map (for dependency map) of a distribution if every conditional indepedence statement satisfied by the distribution is reflected in the graph.
Thus a completely disconnected graph (no links) wil l be a trivial D map for any distribution.
A graph is said to be a I map (for independence map) if every conditional independence statement implied by the graph is satisfied by a specific distribution.
Clearly a fully connected graph will be a trivial I map for any distribution.
If is the case that every conditional independence property of the distribution is reflected in the graph, and vice versa, then graph is said to be a perfect map.
Inference in Graphical Models
Notation
 denotes a variable (can be scalar, vector, matrices, etc.)
 denotes a subset of variables 
 denotes the set of factor nodes that are neighbours of 
 denotes the set of variables in the subtree connected to the variable node 
 via a factor node 
 denotes the set of variables in the subtree connected to the variable node 
 via a factor node 
- f
 depends on 
, which we denote 
 
 represents the product of all the factors in the group associated with factor 
Message from factor node
 to variable node 
:
Message from variable node
 to factor node 
:
where
 denotes the set of factors in the subtree connected to the factor node 
 via the variable node 
Factor graphs
Can write joint distribution over a set of variables in the form of a product of factors
where
 denotes a subset of the variables.
- Directed graphs represent special cases in which factors 
 are local conditional distributions - Undirect graphs are a special case in which factors are potential functions over the maximal cliques (the normalizing coefficient 
 can be viewed as a factor defined over the empty set of variables) 
Converting distribution expressed in terms of undirected graph to factor graph
- Create variable nodes corresponding to the nodes in the original undirected graph
 - Create additional factor nodes corresponding to the maximal cliques 

 - Factors 
 are set equal to the clique potentials 
 
There might be several factor graphs corresponding to the same undirected graph, i.e. this construction is not unique.
Sum-product algorithm
- Problem: evaluate local marginals over nodes or subsets of nodes
 - Applicable to tree-structured graphs
 - Exploit structure of the graph to achieve:
- efficient, exact inference algorithm for finding marginals
 - in situations where several marginals are required to allow computations to be shared efficiently
 
 - Summarized: barber2012
 
Belief propagation is a special case of sum-product algorithm, which performs exact inference on directed graphs without loops. (You have already written about this)
For actual implementation it's useful to use the logarithmic version of the messages. See barber2012.
Derivation
Marginal is obtained by summing the joint distribution over all variables except
 so that
where
 denotes the set of variables in 
 with the variable 
 omitted.
Joint distribution of the set of variables
 can be written as a product of the form
Rewriting the marginal above using the joint distribution notation:
i.e. marginal is given by the product of all incoming messages arriving at node
.
Each factor
 can be descrbied by a factor (sub-)graph and so can itself be factorized:
where, for convenience, we have denoted the variable associated with factor
, in addition to 
, by 
.
Message
 to 
 can be rewritten:
i.e. marginalizing over all the variables which
 depend on, and the "messages" which are being sent from the variables to the factor 
. Thus, to evaluate message sent by factor to variable along the link connceting them:
- Take product of incoming messages along all other links coming into the factor node
 - Multiply by the factor associated with that node
 - Marginalize over all of the variables associated with the incoming messages
 
Message
 to 
 can be written:
i.e. to evaluate messages passed from variables to adjacent factors we simply take the product of incoming messages the incoming messages along all other links.
- Messages can be computed recursively
- Start: if leaf-node is:
variable: message it sends along its one and only link is given by
factor: message it sends is simply
 
 - Start: if leaf-node is:
 Finally, if we wish to find the marginal distributions
 associated with the sets of variables belonging to each factors:
Summarized
The sum-product algorithm can then be summarized as:
- Variable node 
 is chosen as root of the factor graph - Messages are initiaed at the leaves of the graph, if leaf is:
variable: message it sends along its one and only link is given by
factor: message it sends is simply
 Message passing steps are applied recursively until messages have been propagated along every link:
and
Once root node has received messages from all of its of its neighbors, the marginal can be evaluated by:
or, if one wants the marginal distribution
 associated with the sets of variables belong to each of the factors:
- If we started from a:
- directed graph: already correctly normalized
 undirected graph: started with unknown normalization constant
. From steps 1-4 we now have unnormalized marginal 
, which we can marginalize over a single one of these 
 (holding all other variables constant), which then gives us the normalization constant:
 
Could find marginals for every variable in the node in the graph by first passing messages from leaf to root, as in sum-product algorithm, and then when receiving all messages to the root, pass messages from the root back to the leaves.
Then we would have computed the messages along each of the links / edges in both directions, hence we could simply marginalize over these messages to obtain the wanted marginal.
Max-sum algorithm
Problem: find most probable state or sequence:
which will have the joint probability
- Summarized: barber2012
 
Figure 1: barber2012
Derivation
Will make use of
which holds if
 (always true for factors in graphical models!)
Work with
 instead (for numerical purposes), i.e.
where the distributive property is preserved since
hence we replace products with sums.
Causality
In setting any variable into a particular state we need to surgically remove all parental links of that variable.
This is called the do operator, and constrants observational ("see") inference 
 with a casual ("make" or "do") inference 
.
Formally, let all variables 
 be written in terms of the interventon variables 
 and the non-intervention variables 
.
For a Belief Network 
, inferring the effect of setting variables 
, in states 
, is equivalent to standard evidential inference in the post intervention distribution:
where any parental variable inclded in the intervention set is set to its intervention state. An alternative notation is
Intervention here refers to the "act of intervening", as in making some variable take on a particular state, i.e. we're interferring with the "total" state.
Bibliography
Bibliography
- [barber2012] Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press (2012).