# Monte Carlo methods

## Table of Contents

## Overview

- Attempts to solve two types of problems:
- Generate samples from a given probability distribution
Estimate expectations of functions under this distribution, e.g.

- Class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution.
- State of the chain after a number of steps is then used as a sample from the desired distribution
*Really*good stuff about MCMC methods can be found in Chapter 29 and 30 in mackay2003information.

## Notation

- denotes the
*empirical*expectation of - denotes the
*partition function*or*normalization factor*of distribution , i.e. we assume that defines a probability distribution.

## Importance Sampling (IS)

Consider the following

where we've let . Using MCMC methods, we can approximate this expectation

or equivalently,

This is **importance sampling** estimator of and is unbiased.

The importance sampling *problem* is then to find some *biasing* distribution such that the variance of the importance sampling estimator is lower than the variance for the general Monte Carlo estimate.

We can replace and with and , assuming

Then

where

Since

We have

with . That is, as , taking the *unnormalized* weigth we arrive at the same result as if we used the proper weights .

### Issues

#### Large number of dimensions

In a large number of dimensions, one will often have the problem that a *small* number of large weights will dominate the the sampling process. In the case where large values of lie in low-probability regions this becomes a huge issue.

## Rejection Sampling

- Have which is too complex to sample from
- We have a
*simpler*density which we can evalute (up to a factor ) and raw samples from Assume that we know the value of a constant s.t.

- For :
- Generate
- Generate
*Accept*if ,*reject*if

Then

We're simply sampling from a distibution we can compute, and then computing the ratio of samples under which *also* fell under . Since we then know the "area" under we know that the area under is simply the ratio of points accepted.

This is basically a more general approach to the Box method Sampling, where we instead of having some distribution to sample from, we simply use a n-dimensional box (or rather, we let be a multi-dimensional ).

### Issues

#### Large number of dimensions

In large number of dimensions it's very likley that the requirement

forces to be so *huge* that acceptances become very rare.

Even finding in a high-dimensional space might be incredibly difficult.

## Markov Chain Monte Carlo (MCMC)

### Markov chains

A Markov chain is said to be reversible if there is a probability distribution over its states such that

for all times n and all states and . This condition is known as the **detailed balance condition** (some books call it the **local balance equation**).

### Methods

#### Metropolis-Hastings

The family of **Metropolis-Hastings** MCMC methods all involve designing a Markov process using two sub-steps:

- Propose. Use some
**proposal distribution**to propose the next state given the current state . - Acceptance-rejection. Use some
**acceptance distribution**and accept / reject depending on this conditional distribution.

The transition probability is then

Inserting into the detailed balance equation for a MC we have

This is satisfied by (for example) the *Metropolis* choice:

The **Metropolis-Hastings algorithm** in its full glory is then:

- Initialize
- Pick initial state
- Set

- Iterate
**Generate**: randomly generate a candidate state according to**Compute**: compute acceptance probability**Accept or Reject**:- Generate a uniform random number
- If ,
*accept*new state and set - If ,
*reject*new state and set

**Increment**: set

The empirical distribution of the states will approach .

When using MCMC methods for inference, it's important to remember that is usually the *log-likelihood* of the data.

This can also be seen by considering the posterior distribution of the parameters given the data, and considering the ratio between two different parameters, say and . Then

Letting the prior-distribution on be the proposal distribution in Metropolis-Hastings algorithm, we see that in this case we're simply taking steps towards the parameter which maximizes the log-likelihood under some prior.

Further, one could make the ratio by using a uniform distribution (if we know that for some .

The choice of is often referred to as a **(transition) kernel** when talking about different Metropolis-Hastings methods.

#### Gibbs

**Gibbs sampling** or a **Gibbs sampler** is a special case of a Metropolis-Hastings algorithm. The point of a **Gibbs sampler** is that given a multivariate distribution it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution.

Suppose we want to sample from a join distribution . Denote k-th sample by . We proceed as follows:

**Initial value**:**Sample**. For :- Sample . If sampling from this conditional distribution is not tractable, then see Step 2 in the Metropolis-Hastings algorithm for how to sample.

**Repeat**above step until we have the number of samples we want.

##### Proof of convergence

Suppose we want to estimate the for a state .

Let denote the values which can take on, i.e. is all the possible states, keeping constant.

Assuming , the Gibbs procedure gives us

Which we can rewrite as

by using the identity

We note that

and get

Recall that , so we have

which is simply

That is,

which implies the detailed balance equation is satisfied, hence is the *stationary distribution* of the resulting Markov Chain.

Further, to ensure that this stationary distribution actually exists, observe that we can reach any state from the state in some number of steps, and that there are no "isolated" states, hence the chain constructed by the Gibbs transition creates an *aperiodic* and *irreducible* Markov Chain, which is guaranteed to converge to its stationary distribution .

In our definition of Gibbs sampling, we sample from the different dimensions *sequentially*, but here we have not assumed how sample the different dimensions, instead giving it some arbitrary distribution . Hence Gibbs sampling also works when not performing the sampling sequentially, but the sequential version is standard approach and usually what people refer to when talking about Gibbs sampling.

The notation instead of is often when talking about Markov Chains. In this case we're mainly interested in sampling from a distribution , therefore we instead let denote the stationary distribution of the chain.

In a Markov Random Field (MRF) we only need to marginalize over the immediate neighbours, so the conditional distribution can be written

##### Convergence bound

If is the transition matrix of the Gibbs chain, then the convergence rate of the periodic Gibbs sampler for the stationary distribution of an MRF is bounded by

where

- is the arbitrary initial distribution
is the

*distance in variation*for distributions and on a finite state space

### Finding a starting point

#### Burn-In

- Throw away some iterations at the beginning of an MCMC run

### Issues

#### Dependent samples

The samples of a Markov Chain are in general *dependent*, since we're conditioning on the *current* state when sampling the *proposal* state.

## Partition function estimation

### Notation

- is a
*unnormalized*probability distribution Probability distribution

with

### Overview

Log-likelihood is given by

Gradient of is then

These two terms are often referred to as:

**postive phase****negative phase**

The 2nd term in the summation can be written:

Interchanging integration and gradient, which we can do under under suitable conditions:

Hence,

Since the is constant, we observe that the second term in the above expression can be written

That is,

And, employing MCMC algorithms, we can estimate this expectation as:

i.e. are MCMC samples from the *unnormalized* .

Finally, giving use a more tractable gradient estimate of :

The exact same argument holds for discrete random variables too, simply exchanging the integrand for a summation.

#### Problems

- Using standard MCMC methods, we would have to perform a burnin before sampling those samples in the negative phase. This can be very expensive.
- k-step Contrastive Divergence (CD-k): instead of initializing the chain from random samples, initialize from samples
*from the dataset*.

- k-step Contrastive Divergence (CD-k): instead of initializing the chain from random samples, initialize from samples

### Simulated Annealing (SA)

**Simulated annealing** is a method of moving from a tractable distribution to a distribution of interest via a sequence of intermediate distributions, used as an inexact method of handling *isolated* or *spurrious* modes in Markov chain samplers. *Isolated* modes is a problem for MCMC methods since we're making small changes between each step, hence moving from one isolated mode to another is "difficult".

It employes a sequence of distributions, with probability densities given by to , in which each differes only slightly from . The distribution is the one of interest. The distribution is designed so that the Markov chain used to sample from it allows movement between all regions of the state space.

A traditional scheme is to set

or the *geometric average*:

An **annealing run** is started from some initial state, from which we first simulate a Markov chain designed to converge to . Next we simulate some number of iterations of a Markov chain designed to converge to , starting from the final state of the simulation for . Continue in this fashion, using the final state of the simulation for as the initial state of the simulation for , until we finally simulate the chain designed to converge to .

Unfortunately, there is no reason to think that annealing will give the precisely correct result, in which each mode of is found with *exactly* the right probability.

This is of little consequence in an optimization context, where the final distribution is degenerate (at the maximum), but it is a serious flaw for the many applications in statistics and statistical physics that require a sample from a *non-degenerate* distribution. neal98_anneal_impor_sampl

#### TODO Example: SA applied to RBMs

### Annealed Importance Sampling (AIS)

- Combination of importance sampling and simulated annealing, where perform importance sampling on each of the distributions which converge to the target distribution

### Annealed Importance Sampling (AIS)

#### Notation

is a set of Gibbs distributions defined:

- , which we call the
*extended state space* - The use of square brackets, e.g. , denotes a distribution over
We define

(

*forward distribution*) creates samples from given a sample from :(

*reverse distribution*) creates samples from given a sample from :

#### Derivation

Assume that a set of Gibbs distributions to construct a pair of distributions and on with

- (
*forward distribution*) creates samples from given a sample from - (
*reverse distribution*) creates samples from given a sample from

Then

where . Then, for a function , we have

thus the expectation of

Then, letting , i.e. constant function, we get

**(Generalized) Annealed Importance Sampling (AIS)** a method of estimating the partition function using the following relation:

where is the partition function of the known "proxy" distribution .

The "standard" **AIS** method is obtained by a specific choice of .

#### Bennett's Acceptance Ratio (BAR) method

## Bibliography

- [mackay2003information] MacKay, Kay & Cambridge University Press, Information Theory, Inference and Learning Algorithms, Cambridge University Press (2003).
- [neal98_anneal_impor_sampl] Neal, Annealed Importance Sampling,
*CoRR*, (1998). link.