Monte Carlo methods

Table of Contents

Overview

  • Attempts to solve two types of problems:
    1. Generate samples $\{ \mathbf{x}^{(r)} \}_{r = 1}^{R}$ from a given probability distribution $p(\mathbf{x})$
    2. Estimate expectations of functions under this distribution, e.g.

      \begin{equation*}
\Phi = \left\langle \phi(\mathbf{x}) \right\rangle = \int \dd^N{\mathbf{x}} p(\mathbf{x}) \phi(\mathbf{x})
\end{equation*}
  • Class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution.
  • State of the chain after a number of steps is then used as a sample from the desired distribution
  • Really good stuff about MCMC methods can be found in Chapter 29 and 30 in mackay2003information.

Notation

  • $\left\langle f \right\rangle$ denotes the empirical expectation of $f$
  • $Z_p$ denotes the partition function or normalization factor of distribution $p$, i.e. we assume that $p(x) = \tilde{p}(x) / Z_p$ defines a probability distribution.

Importance Sampling (IS)

Consider the following

\begin{equation*}
\mathbb{E}_p[f(x)] = \int f(x) p(x) \dd{x} = \int f(x) \frac{p(x)}{q(x)} q(x) \dd{x} = \mathbb{E}_q[ f(x) w(x)]
\end{equation*}

where we've let $w(x) = \frac{p(x)}{q(x)}$. Using MCMC methods, we can approximate this expectation

\begin{equation*}
\mathbb{E}_p[f(x)] = \mathbb{E}_q[f(x) w(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i) w(x_i), \quad x_i \sim q(x)
\end{equation*}

or equivalently,

\begin{equation*}
\mathbb{E}_p[f(x)] = \mathbb{E}_q[f(x) w(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x_i) \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q(x)
\end{equation*}

This is importance sampling estimator of $p$ and is unbiased.

The importance sampling problem is then to find some biasing distribution $q$ such that the variance of the importance sampling estimator is lower than the variance for the general Monte Carlo estimate.

We can replace $p(x)$ and $q(x)$ with $\tilde{p}(x)$ and $\tilde{q}(x)$, assuming

\begin{equation*}
p(x) = \frac{1}{Z_p} \tilde{p}(x), \quad q(x) = \frac{1}{Z_q} \tilde{q}(x)
\end{equation*}

Then

\begin{equation*}
\lim_{N \to \infty} \frac{1}{N} \frac{\sum_{i=1}^{N} f(x_i) \tilde{w}(x_i)}{\sum_{i=1}^{N} \tilde{w}(x_i)}, \quad x_i \sim q(x)
\end{equation*}

where

\begin{equation*}
\tilde{w}(x_i) = \frac{\tilde{p}(x_i)}{\tilde{q}(x_i)}
\end{equation*}

Since

\begin{equation*}
\frac{\tilde{p}(x_i)}{\tilde{q}(x_i)} \to \frac{Z_p}{Z_q} \quad \text{as} \quad N \to \infty
\end{equation*}

We have

\begin{equation*}
\begin{split}
  \lim_{N \to \infty} \frac{1}{N} \frac{\sum_{i=1}^{N} f(x_i) \tilde{w}(x_i)}{\sum_{i=1}^{N} \tilde{w}(x_i)} &= \lim_{N \to \infty} \frac{1}{N} \sum_{i=1}^{N} f(x_i) \tilde{w}(x_i) \frac{Z_q}{Z_p} \\
  &= \lim_{N \to \infty} \frac{1}{N} \sum_{i=1}^{N} f(x_i) \frac{\tilde{p}(x_i)}{Z_p} \frac{Z_q}{\tilde{q}(x_i)} \\
  &= \lim_{N \to \infty} \frac{1}{N} \sum_{i=1}^{N} f(x_i) \frac{p(x_i)}{q(x_i)} \\
  &= \mathbb{E}_q[f(x) w(x)] \\
  &= \mathbb{E}_p[f(x)]
\end{split}
\end{equation*}

with $x_i \sim q(x)$. That is, as $N \to \infty$, taking the unnormalized weigth $\tilde{w}(x_i)$ we arrive at the same result as if we used the proper weights $w(x_i)$.

Issues

Large number of dimensions

In a large number of dimensions, one will often have the problem that a small number of large weights $w(\mathbf{x}_k)$ will dominate the the sampling process. In the case where large values of $f(x)$ lie in low-probability regions this becomes a huge issue.

Rejection Sampling

  • Have $p(x) = \tilde{p}(x) / Z_p$ which is too complex to sample from
  • We have a simpler density $q(x)$ which we can evalute (up to a factor $Z_q$) and raw samples from
  • Assume that we know the value of a constant $c$ s.t.

    \begin{equation*}
c \ \tilde{q}(x) > \tilde{p}, \quad \forall x
\end{equation*}
  • For $k = 1, \dots, K$:
    1. Generate $x_k \sim q(x)$
    2. Generate $u \sim \text{Uniform}\Big(0, c \ \tilde{q}(x_k) \Big)$
    3. Accept $x_k$ if $u \le \tilde{p}(x_k)$, reject $x$ if $u > \tilde{p}(x_k)$
  • Then

    \begin{equation*}
Z_p = \frac{r_a}{c Z_q}, \quad \text{where} \quad r_a = \frac{|\text{accepted}|}{K}
\end{equation*}

We're simply sampling from a distibution we can compute, and then computing the ratio of samples under $c \tilde{q}(x)$ which also fell under $\tilde{p}(x)$. Since we then know the "area" under $c \tilde{q}(x)$ we know that the area under $\tilde{p}(x)$ is simply the ratio of points accepted.

This is basically a more general approach to the [BROKEN LINK: No match for fuzzy expression: *Box%20method%20Sampling], where we instead of having some distribution $q(x)$ to sample from, we simply use a n-dimensional box (or rather, we let $q(x)$ be a multi-dimensional $\text{Uniform}(0, 1)$).

Issues

Large number of dimensions

In large number of dimensions it's very likley that the requirement

\begin{equation*}
c \ \tilde{q}(x) > \tilde{p}(x)
\end{equation*}

forces $c$ to be so huge that acceptances become very rare.

Even finding $c$ in a high-dimensional space might be incredibly difficult.

Markov Chain Monte Carlo (MCMC)

Markov chains

A Markov chain is said to be reversible if there is a probability distribution $\pi$ over its states such that

\begin{equation*}
\pi _{i}\Pr(X_{n+1}=j\mid X_{n}=i)=\pi _{j}\Pr(X_{n+1}=i\mid X_{n}=j)
\end{equation*}

for all times n and all states $i$ and $j$. This condition is known as the detailed balance condition (some books call it the local balance equation).

Methods

Metropolis-Hastings

The family of Metropolis-Hastings MCMC methods all involve designing a Markov process using two sub-steps:

  1. Propose. Use some proposal distribution $g(x' \mid x)$ to propose the next state $x'$ given the current state $x$.
  2. Acceptance-rejection. Use some acceptance distribution $A(x' \mid x)$ and accept / reject depending on this conditional distribution.

The transition probability is then

\begin{equation*}
P(x' \mid x) = g(x' \mid x) A(x' \mid x)
\end{equation*}

Inserting into the detailed balance equation for a MC we have

\begin{equation*}
\frac{A(x' \mid x)}{A(x \mid x')} = \frac{P(x')}{P(x)} \frac{g(x \mid x')}{g(x' \mid x)}
\end{equation*}

This is satisfied by (for example) the Metropolis choice:

\begin{equation*}
A(x' \mid x) = \min \Bigg( 1, \frac{P(x')}{P(x)} \frac{g(x \mid x')}{g(x' \mid x)} \Bigg)
\end{equation*}

The Metropolis-Hastings algorithm in its full glory is then:

  1. Initialize
    1. Pick initial state $x_0$
    2. Set $t = 0$
  2. Iterate
    1. Generate: randomly generate a candidate state $x'$ according to $g(x' \mid x)$
    2. Compute: compute acceptance probability $A(x' \mid x) = \min \big( 1, \frac{P(x')}{P(x)} \frac{g(x \mid x')}{g(x' \mid x)} \big)$
    3. Accept or Reject:
      1. Generate a uniform random number $u \in [0, 1]$
      2. If $u \le A(x' \mid x)$, accept new state and set $x_{t + 1} = x'$
      3. If $u > A(x' \mid x)$, reject new state and set $x_{t + 1} = x$
    4. Increment: set $t = t + 1$

The empirical distribution of the states $x_0, \dots, x_T$ will approach $P(x)$.

When using MCMC methods for inference, it's important to remember that $P(x)$ is usually the log-likelihood of the data.

This can also be seen by considering the posterior distribution of the parameters given the data, and considering the ratio between two different parameters, say $\theta_1$ and $\theta_2$. Then

\begin{equation*}
\frac{P(\theta_{\text{new}} \mid x_1, \dots, x_n)}{P(\theta \mid x_1, \dots, x_n)} = \frac{P( x_1, \dots, x_n \mid \theta_{\text{new}})}{P( x_1, \dots, x_n \mid \theta)} \frac{P(\theta)}{P(\theta_{\text{new}})}
\end{equation*}

Letting the prior-distribution on $\theta$ be the proposal distribution $g$ in Metropolis-Hastings algorithm, we see that in this case we're simply taking steps towards the parameter $\theta$ which maximizes the log-likelihood under some prior.

Further, one could make the ratio $\frac{P(\theta)}{P(\theta_{\text{new}})} = 1$ by using a uniform distribution (if we know that $\theta \in [a, b]$ for some $a, b \in \mathbb{R}$.

The choice of $g(\theta' \mid \theta)$ is often referred to as a (transition) kernel when talking about different Metropolis-Hastings methods.

Gibbs

Gibbs sampling or a Gibbs sampler is a special case of a Metropolis-Hastings algorithm. The point of a Gibbs sampler is that given a multivariate distribution it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution.

Suppose we want to sample $\mathbf{X} = \left( x_1, \dots, x_n \right)$ from a join distribution $p(x_1, \dots, x_n)$. Denote k-th sample by $\mathbf{X}^{(k)} = \left(x_1^{(k)}, \dots, x_n^{(k)}\right)$. We proceed as follows:

  1. Initial value: $\mathbf{X}^{(k)}$
  2. Sample $\mathbf{X}^{(k + 1)}$. For $j = 1 \dots n$:
    1. Sample $x_j^{(k + 1)} \sim p(x_j^{(k + 1)} \mid x_1^{(k + 1)}, \dots, x_{j - 1}^{(k + 1)}, x_{j + 1}^{(k)}, \dots, x_n^{(k)})$. If sampling from this conditional distribution is not tractable, then see Step 2 in the Metropolis-Hastings algorithm for how to sample.
  3. Repeat above step until we have the number of samples we want.
Proof of convergence

Suppose we want to estimate the $p(\mathbf{x})$ for a state $\mathbf{x} \in \mathcal{X}$.

Let

  • $\mathcal{X}_{-i}$ denote the values which $(X_1, \dots, X_{i - 1}, X_{i + 1}, \dots, X_n)$ can take on, i.e. $\mathbf{x}_{-i}$ is all the possible states, keeping $X_i$ constant.
  • $q(i)$ denote some mass density over the indices $i = 1, \dots, n$ (with non-zero mass everwhere)

Assuming $x_i' \ne x_i$, the Gibbs procedure gives us

\begin{equation*}
\begin{split}
  p(\mathbf{x}' \mid \mathbf{x}) p(\mathbf{x}) &= q(i) p(x_i' \mid \mathbf{x}_{-i}) p(\mathbf{x}) \\
  &\overset{(1)}{=} q(i) p(\mathbf{x}) \frac{p(x_i', \mathbf{x}_{-i})}{p(\mathbf{x}_{-i})} \\
  &\overset{(2)}{=} q(i) p(x_i, \mathbf{x}_{-i}') \frac{p(x_i', \mathbf{x}_{-i}')}{p(\mathbf{x}_{-i}')} \\
  & \overset{(3)}{=} q(i) \frac{p(x_i, \mathbf{x}_{-i}')}{p(\mathbf{x}_{-i}')} p(x_i', \mathbf{x}_{-i}') \\
  & = q(i)  p(x_i \mid \mathbf{x}_{-i}') p(x_i', \mathbf{x}_{-i}')
\end{split}
\end{equation*}

where in

  • $(1)$ we used

    \begin{equation*}
p(x_i' \mid \mathbf{x}_{-i}) = \frac{p(x_i', \mathbf{x}_i)}{p(\mathbf{x}_{-i})}
\end{equation*}
  • $(2)$ we used

    \begin{equation*}
p(\mathbf{x}) = p(x_i, \mathbf{x}_{-i})
\end{equation*}
  • $(3)$ we used the fact that $\mathbf{x}_{-i} = \mathbf{x}_{-i}'$

Therefore we have

\begin{equation*}
p(\mathbf{x}' \mid \mathbf{x}) p(\mathbf{x}) = q(i) p(x_i' \mid \mathbf{x}_i) p(\mathbf{x}) = q(i) p(x_i \mid \mathbf{x}_i') p(\mathbf{x}') = p(\mathbf{x} \mid \mathbf{x}') p(\mathbf{x}')
\end{equation*}

which implies the detailed balance equation is satisfied, hence $p(\mathbf{x})$ is the stationary distribution of the resulting Markov Chain.

Further, to ensure that this stationary distribution $p(\mathbf{x})$ actually exists, observe that we can reach any state $\mathbf{x}'$ from the state $\mathbf{x}$ in some number of steps, and that there are no "isolated" states, hence the chain constructed by the Gibbs transition creates an aperiodic and irreducible Markov Chain, which is guaranteed to converge to its stationary distribution $p(\mathbf{x})$.

In our definition of Gibbs sampling, we sample from the different dimensions sequentially, but here we have not assumed how sample the different dimensions, instead giving it some arbitrary distribution $q(i)$. Hence Gibbs sampling also works when not performing the sampling sequentially, but the sequential version is standard approach and usually what people refer to when talking about Gibbs sampling.

The notation $\pi(\mathbf{x})$ instead of $p(\mathbf{x})$ is often when talking about Markov Chains. In this case we're mainly interested in sampling from a distribution $p(\mathbf{x})$, therefore we instead let $p(\mathbf{x})$ denote the stationary distribution of the chain.

In a Markov Random Field (MRF) we only need to marginalize over the immediate neighbours, so the conditional distribution can be written

\begin{equation*}
p(x_i' \mid \mathbf{x}_i) = \sum_{\mathbf{x}_i \in \mathcal{N}(X_i)}^{} p(x_i', \mathbf{x}_i)
\end{equation*}
Convergence bound

If $\mathbf{P}$ is the transition matrix of the Gibbs chain, then the convergence rate of the periodic Gibbs sampler for the stationary distribution of an MRF is bounded by

\begin{equation*}
|\boldsymbol{\mu} \mathbf{P}^k - \boldsymbol{\pi}| \le \frac{1}{2} |\boldsymbol{\mu} - \boldsymbol{\pi}| \big(1 - e^{- N \delta} \big)^k
\end{equation*}

where

  • $\Delta = \sup_{l \in V} \delta_l$
  • $\delta_l = \sup \left\{ |\mathcal{E}(\mathbf{x}) - \mathcal{E}(\mathbf{y})| \mid x_j = y_j, \forall j \in V, j \ne l \right\}$
  • $\mu$ is the arbitrary initial distribution
  • $\frac{1}{2} | \boldsymbol{\mu} - \boldsymbol{\pi}|$ is the distance in variation

    \begin{equation*}
d_V(\boldsymbol{\alpha}, \boldsymbol{\beta}) = \frac{1}{2} |\boldsymbol{\alpha} - \boldsymbol{\beta}| = \frac{1}{2} \sum_{x \in \Omega}^{} | \alpha(x) - \beta(x)|
\end{equation*}

    for distributions $\alpha$ and $\beta$ on a finite state space $\Omega$

Slice sampling

TODO Elliptical Slice sampling

Particle MCMC (PMCMC)

TODO Sequential Monte-Carlo (SMC)
TODO Particle Gibbs (PG)
TODO Particle Gibbs with Backward Sampling (PGBS)
TODO Particle Gibbs with Ancestor Sampling (PGAS)

TODO Embedded MCMC (EMCMC)

Specific case: Hidden Markov-Models (HMMs)

Follows shestopaloff16_sampl_laten_states_high_dimen.

The following assumes a state-space model where

  • $x_1 \sim p(x)$ is the initial latent state
  • $p(x_i \mid x_{i - 1})$ is the transition probability
  • $p(y_i \mid x_i)$ is the emission/observation probability
  • $i = 1, \dots, n$ denotes the number of time-steps in the SSM
  • Current sequence is $x = (x_1, \dots, x_n)$
  • Update $x$ to $x'$ as follows:
    1. For each time $i = 1, \dots, n$, generate a set of $L$ "pool states", denoted by

      \begin{equation*}
\mathcal{P}_i := \left\{ x_i^{[1]}, \dots, x_i^{[L]} \right\}
\end{equation*}
      • Pool states are sampled independentlyo across different times $i$
    2. Choose $l_i \in \left\{ 1, \dots, L \right\}$ uniformly at random and let

      \begin{equation*}
x_i^{[l_i]} := x_i
\end{equation*}
    3. Sample remaining $L - 1$ pool states $x_i^{[1]}, \dots, x_i^{[l_i - 1]}, x_i^{[k_i + 1]}, \dots, x_n^{[L]}$ using Markov chain that leaves the pool density $\kappa_i$ invariant:
      1. Let $R_i(x' \mid x)$ be the transition prob of this Markov chain, with $\tilde{R}_i(x \mid x')$ the transition for this chain reversed, i.e.

        \begin{equation*}
\tilde{R}_i(x \mid x') = R_i(x' \mid x) \frac{\kappa_i(x)}{\kappa_i(x')}
\end{equation*}

        s.t.

        \begin{equation*}
\kappa_i(x) R_i(x' \mid x) = \kappa_i(x') \tilde{R}_i(x \mid x')
\end{equation*}

        for all $x$ and $x'$.

      2. For $k = l_i - 1, \dots, 1$:
        • Generate $x_i^{[k]}$ by sampling according to the reverse transition kernel $\tilde{R}_i\big(x_i^{[k]} \mid x_i^{[k + 1]}\big)$
      3. For $k = l_i + 1, \dots, L$:
        • Generate $x_i^{[k]}$ by sampling according to the forward transition kernel $R_i\big(x_i^{[k + 1]} \mid x_i^{[k]}\big)$
    4. Generate new sequence $x'$ using stochastic backwards pass:
      1. For each time $i = 1, \dots, n$, we then compute the forward probabilities $\alpha_i(x)$, with $x$ taking values in $\mathcal{P}_i$:
        1. if $i = 1$:

          \begin{equation*}
\alpha_1(x) = \frac{p(x) p(y_1 \mid x)}{\kappa_1(x)}
\end{equation*}
        2. else:

          \begin{equation*}
\alpha_i(x) = \frac{p(y_i \mid x)}{\kappa_i(x)} \sum_{l=1}^{L} p\big(x \mid x_{i - 1}^{[l]} \big) \alpha_{i - 1} \big( x_{i - 1}^{[l]} \big)
\end{equation*}
      2. Sample new state sequence $x'$ using a "stochastic backwards pass" (?):
        1. Sample $x_n'$ from $\mathcal{P}_n$ (pool states at time $n$) with probabilities prop. to $\alpha_n(x)$
        2. For $k = n - 1, \dots, 1$:
          • Sample $x_k'$ from $\mathcal{P}_k$ (pool states at time $k$) with probabilities prob. to $\alpha_{k}(x) p(x_{k + 1}' \mid x)$

Alternatively, we can replace step 4 above by instead computing the backward probabilities:

  1. For $i = n, \dots, 1$, compute backward probabilities $\beta_i(x)$ with $x \in \mathcal{P}_i$:
    1. if $i = n$:

      \begin{equation*}
\beta_i(x) = 1
\end{equation*}
    2. else:

      \begin{equation*}
\beta_i(x) := \frac{1}{\kappa_i(x)} \sum_{l=1}^{L} p \big( y_{i + 1} \mid x_{i + 1}^{[l]} \big) p \big( x_{i + 1}^{[l]} \mid x \big) \beta_{i + 1} \big( x_{i + 1}^{[l]} \big)
\end{equation*}
  2. Sample new state sequence $x'$ using a stochastic forward pass:
    1. Sample $x_1'$ from $\mathcal{P}_i$ with prob. prop. to $\beta_1(x) p(x) p(y_1 \mid x)$
    2. For $k = 2, \dots, n$:
      • Sample $x_i'$ from $\mathcal{P}_i$ with prob. prop. to $\beta_i(x) p(x \mid x_{i - 1}') p(y_i \mid x)$

TODO Unification of PMCMC and EMCMC andrieu2010particle

Notation
  • $\left\{ x_t \right\}_{t \ge 1}$ is an $\mathcal{X} \text{-valued}$ latent Markov process satisfying

    \begin{equation*}
\begin{split}
  x_1 &\sim \mu_{\theta}(\cdot) \\
  x_t \mid (x_{t - 1} = x) &\sim f_{\theta}(\cdot \mid x), \quad \forall t \ge 2
\end{split}
\end{equation*}
  • $\left\{ y_t \right\}_{t \ge 1}$ is a sequence of $\mathcal{Y} \text{-valued}$ observations which are conditionally independent given $\left\{ x_t \right\}_{t \ge 1}$ and satisfy

    \begin{equation*}
y_t \mid (x_1, \dots, x_t = x, x_{t + 1}, \dots) \sim g_{\theta}(\cdot \mid x), \quad \forall t \ge 1
\end{equation*}

    or, with a slighy abuse of notation,

    \begin{equation*}
p(y_t \mid x_1, \dots) = p(y_t \mid x_1, \dots, x_t)
\end{equation*}
  • $\theta \in \Theta$ denotes the parameters of the model
Particle Independent Metropolis-Hastings (PIMH)
  • Independent Metropolis-Hastings (IMH) refers to the fact that the proposal distribution is independent of the current state
  • Proof idea:
    1. Check that $\frac{\tilde{\pi}^N}{q^N} = \frac{\hat{Z}^N}{Z}$ when $q^N > 0$
    2. Deduce that acceptance ratio of an IMH update with target $\tilde{\pi}^N$ and proposal density $q^N$ takes the form

      \begin{equation*}
1 \wedge \frac{\hat{Z}^{N, '}}{\hat{Z}^N}
\end{equation*}

      where $\hat{Z}^{N, '}$ denotes the estimate of the normalization constant for the proposition

  • Proof
    • If we assume (1), then we can quickly see that
\begin{equation*}
1 \wedge \frac{q(x' \mid x) p(x)}{q(x \mid x') p(x')}
\end{equation*}
\begin{equation*}
\frac{q(x' \mid x) p(x)}{q(x \mid x') p(x')} = \frac{q(x') p(x)}{p(x') q(x)}
\end{equation*}
  • Construct a kernel which leaves the extended target distribution invariant, which is defined in such a way that we also leave the actual target density invariant
\begin{equation*}
Z := \int \dd{\pi (x_{1:P})}
\end{equation*}

and the particle estimate

\begin{equation*}
\hat{Z}^N := \prod_{n=1}^{P} \bigg( \frac{1}{N} \sum_{k=1}^{N} w_n \big(X_{1:n}^k \big) \bigg)
\end{equation*}

Note that

\begin{equation*}
\pi(x_{1:P}) = \prod_{n=1}^{P - 1} p(x_{n + 1} \mid x_{n})
\end{equation*}
TODO Particle marginal Metropolis-Hastings (PMMH)
  • Extended target distribution $\tilde{\pi}(\theta, b_{1:T}, \mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1}^{-b_2:T})$
  • Proposal:

    \begin{equation*}
q(\theta, \theta') \times \underbrace{\Psi_{\theta'}(\mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1})}_{\text{law of PF}} \times \underbrace{w_{\theta', T}^{b_T}}_{\text{path selection}}
\end{equation*}

    where

    • $\Psi_{\theta}(\mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1})$ is the law induced by bootstrap/resampling PF:

      \begin{equation*}
\Psi_{\theta}(\mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1}) := \prod_{i=1}^{N} \mu_{\theta}(x_1^i) \prod_{t = 2}^{T} \prod_{i=1}^{N} w_{\theta, t - 1}^{a_{t - 1}^i} f_{\theta} \big( x_t^i \mid x_{t - 1}^{a_{t - 1}^i} \big)
\end{equation*}

      where

      • $\mu_{\theta}$ is the prior for $x_1^i$
      • $f\big(x_t^i \mid x_{t - 1}^{a_{t - 1}^i} \big)$ is the probability of the i-th particle taking on the value $x_t^i$ at time-step $t$ given that it's parent (indexed by $a_{t - 1}^i$) has taken on the value $x_{t - 1}^{a_{t - 1}^i}$
      • $w_{\theta, t - 1}^{a_{t - 1}^i}$ is the probability of having chosen $a_{t - 1}^i$ as the parent in for the i-th particle at time-step $t$
  • Resulting MH acceptance step is

    \begin{equation*}
1 \wedge \frac{\hat{p}_{\theta'}(y_{1:T})p(\theta')}{\hat{p}_{\theta}(y_{1:T}) p(\theta)} \frac{q(\theta', \theta)}{q(\theta, \theta')}
\end{equation*}

    where

    \begin{equation*}
\hat{p}_{\theta}(y_{1:T}) := \prod_{t = 1}^{T} \bigg[ \frac{1}{N} \sum_{i=1}^{N} g_{\theta}\big(y_t \mid x_t^i\big) \bigg]
\end{equation*}

    is an unbiased estimate of $p_{\theta}(y_{1:T})$

  • The fact that it's correct can be shown by dividing the extended target by the probabililty of the auxillary variables:

    \begin{equation*}
\frac{\tilde{\pi}\big( \theta, b_{1:T}, \mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1}^{- b_2:T} \big)}{\Psi_{\theta}(\mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1}) w_{\theta, T}^{b_T}} = p(\theta, x_{1:T}^{b_{1:T}} \mid y_{1:T})
\end{equation*}

    since

    \begin{equation*}
\tilde{\pi} \big( \theta, b_{1:T}, \mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1}^{-b_{2:T}} \big) = p\big(\theta, x_{1:T}^{b_{1:T}}, \mathbf{x}_{1:T}^{-b_{1:T}}, \mathbf{a}_{1:T - 1}^{-b_{2:T}} \big)
\end{equation*}

    and

    \begin{equation*}
\Psi_{\theta}\big(\mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1} \big) w_{\theta, T}^{b_T} = p(\theta, \mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1}) \times p(\theta, b_T)
\end{equation*}
PMCMC methods
  • Notation
    • $N \ge 2$ denotes the number of particles
    • $x_{1:T}^{b_{1:T}}$ represents the latent-state trajectory with "ancestor path" $(b_1, \dots, b_T)$:

      \begin{equation*}
x_{1:T}^{b_{1:T}} = (x_1^{b_1}, \dots, x_{T}^{b_T})
\end{equation*}

      and thus $b_{1:T}$ is the indices of the reference path.

    • Particles are denoted by bold-face:

      \begin{equation*}
\mathbf{x}_{t} = \left\{ x_1^1, \dots, x_t^N \right\} \in \mathcal{X}^N
\end{equation*}
    • Ancestor particles

      \begin{equation*}
\mathbf{a}_{t - 1}^{-b_t} = \mathbf{a}_{t - 1} \setminus a_{t - 1}^{b_t}
\end{equation*}

      and

      \begin{equation*}
\mathbf{a}_{1:T - 1}^{-b_2:T} := \left\{ \mathbf{a}_1^{-b_2}, \dots, \mathbf{a}_{T - 1}^{-b_T} \right\}
\end{equation*}
    • Extended target distribution on $\Theta \times \mathcal{X}^{N T} \times \left\{ 1, \dots, N \right\}^{N (T - 1) + 1}$:

      \begin{equation*}
\tilde{\pi} \big( \theta, b_{1:T}, \mathbf{x}_{1:T}, \mathbf{a}_{1:T - 1}^{- b_2:T} \big) := \frac{1}{N^T} \times \underbrace{\pi \big( \theta, x_{1:T}^{b_{1:T}} \big)}_{\text{target}} \times \underbrace{\phi_{\theta} \big( \mathbf{x}_{1:T}^{-b_{1:T}}, \mathbf{a}_{1:T - 1}^{-b_{2:T}} \mid x_{1:T}^{b_{1:T}}, b_{1:T} \big)}_{\text{law of conditional PF}}
\end{equation*}
    • $\Theta$ denotes the parameter domain
    • $\mathcal{X}^{N T}$ is the domain of the hidden states in the state-space model for each of the $N$ particles
    • $\left\{ 1, \dots, N \right\}^{N (T - 1) + 1}$ represents all possible ancestor particles we can "possible generate" (you know what I mean…) since $\mathbf{a}_{0}^{b_1} = \emptyset$ for any choice of $b_1$, i.e. the first node (the root) of a path does not have an ancestor.
    • In our case of considering SSMs, we have $\pi(\theta, x_{1:T}) := p(x_{1:T}, \theta \mid y_{1:T})$
    • Conditional particle filter (CPF):

      \begin{equation*}
\phi_{\theta} \big( \mathbf{x}_{1:T}^{- b_{1:T}}, \mathbf{a}_{1:T - 1}^{-b_{2:T}} \mid x_{1:T}^{b_{1:T}}, b_{1:T} \big) := \prod_{i=1, i \ne b_1}^{N} \mu_{\theta}(x_1^i) \prod_{t=2}^{T} \prod_{i=1, i \ne b_t}^{N} w_{\theta, t - 1}^{a_{t - 1}^i} f_{\theta} \big( x_t^i \mid x_{t - 1}^{a_{t - 1}^i} \big)
\end{equation*}

      with

      \begin{equation*}
w_{\theta, t}^i := \frac{g_{\theta}(y_t \mid x_t^i)}{\sum_{j=1}^{N} g_{\theta}(y_t \mid x_t^j)}
\end{equation*}

      represents the normalized weight associated with the i-th particle at time $t$

  • Extended target distribution

    To prove correctness we take the following approach:

    1. Find kernel which leaves the extended target distribution $\tilde{pi}$ invariant
    2. Show that the PGAS kernel on the original space is a collapsed version of this extended kernel
      • Showing that a Gibbs kernel is a collapsed Gibbs can be done by considering a kernel on the extended space where some of the variables on the extended space is considered as auxillary variables
    3. For $\tilde{pi}$ we're really just interested in the path chosen $x_{1:T}^k$, and everything else is redundant information.
    4. I.e.
    • Note that $\tilde{\pi}$ is basically the product of the following distributions:
      • From $\mathbf{x}_{1:T}$ we pick the trajectory with indices $b_{1:T}$
      • We consider this jointly with the probability of generating all these other trajectories
      • Inputs:
        1. Parameters $\theta$
        2. Candidate path $b_{1:T}$
      • Note that ancestor indices $a_{1:T - 1}^{b_{2:T}}$ alone does not define a path! These define a path through already sampled set of $N$ particles $\mathbf{x}_{1:T}$. Hence we need both the particles $\mathbf{x}_{1:T}$ and the ancestor indices $a_{1:T - 1}^{b_{2:T}}$ to define a path
  • Proof of correctness

Finding a starting point

Burn-In

  • Throw away some iterations at the beginning of an MCMC run

Issues

Dependent samples

The samples of a Markov Chain are in general dependent, since we're conditioning on the current state when sampling the proposal state.

Partition function estimation

Notation

  • $\tilde{p}(\mathbf{x} ; \boldsymbol{\theta})$ is a unnormalized probability distribution
  • Probability distribution

    \begin{equation*}
p( \mathbf{x}; \boldsymbol{\theta}) = \frac{1}{Z(\boldsymbol{\theta})} \tilde{p}(\mathbf{x}; \boldsymbol{\theta})
\end{equation*}

    with

    \begin{equation*}
Z(\boldsymbol{\theta}) = \int \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \dd{\mathbf{x}} \quad \text{or} \quad \sum_{\mathbf{x}}^{} p(\mathbf{x; \boldsymbol{\theta}})
\end{equation*}

Overview

Log-likelihood is given by

\begin{equation*}
\mathcal{L} = \sum_{i=1}^{n} \log p(\mathbf{x}_i ; \boldsymbol{\theta}) = \sum_{i=1}^{n} \log \tilde{p}(\mathbf{x}_i; \boldsymbol{\theta}) - \log Z(\boldsymbol{\theta})
\end{equation*}

Gradient of $\mathcal{L}$ is then

\begin{equation*}
\frac{\partial \mathcal{L}}{\partial \theta_j} = \sum_{i=1}^{n} \frac{\partial }{\partial \theta_j} \log \tilde{p}(\mathbf{x}_i; \boldsymbol{\theta}) - \frac{\partial }{\partial \theta_j} \log Z(\boldsymbol{\theta})
\end{equation*}

These two terms are often referred to as:

  1. postive phase
  2. negative phase

The 2nd term in the summation can be written:

\begin{equation*}
\frac{\partial }{\partial \theta_j} \log Z(\boldsymbol{\theta}) = \frac{1}{Z(\boldsymbol{\theta})} \frac{\partial Z(\boldsymbol{\theta})}{\partial \theta_j} = \frac{1}{Z(\boldsymbol{\theta})} \frac{\partial }{\partial \theta_j} \int_{\mathbf{x}} \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \dd{\mathbf{x}}
\end{equation*}

Interchanging integration and gradient, which we can do under under suitable conditions:

\begin{equation*}
\frac{\partial }{\partial \theta_j} \log Z(\boldsymbol{\theta}) &= \int_{\mathbf{x}}^{} \frac{\partial \tilde{p}(\mathbf{x}; \boldsymbol{\theta})}{\partial \theta_j} \dd{\mathbf{x}} \\
&= \int_{\mathbf{x}}^{} \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \frac{\partial \log \tilde{p}(\mathbf{x}; \boldsymbol{\theta})}{\partial \theta_j} \dd{\mathbf{x}}
\end{equation*}

Hence,

\begin{equation*}
\frac{\partial \mathcal{L}}{\partial \theta_j} = \sum_{i=1}^{n} \log \tilde{p}(\mathbf{x}_i; \boldsymbol{\theta}) - \frac{\int \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \frac{\partial \log \tilde{p}(\mathbf{x} ; \boldsymbol{\theta})}{\partial \theta_j} \dd{\mathbf{x}}}{\int \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \dd{\mathbf{x}}}
\end{equation*}

Since the $Z(\boldsymbol{\theta}) = \int \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \dd{\mathbf{x}}$ is constant, we observe that the second term in the above expression can be written

\begin{equation*}
\begin{split}
  \frac{\int \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \frac{\partial \log \tilde{p}(\mathbf{x} ; \boldsymbol{\theta})}{\partial \theta_j} \dd{\mathbf{x}}}{\int \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \dd{\mathbf{x}}} 
&= \int \frac{\tilde{p}(\mathbf{x}; \boldsymbol{\theta})}{Z(\boldsymbol{\theta})} \frac{\partial \log \tilde{p}(\mathbf{x} ; \boldsymbol{\theta})}{\partial \theta_j} \dd{\mathbf{x}} \\
&= \mathbb{E}_p \big[ \log \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \big]
\end{split}
\end{equation*}

That is,

\begin{equation*}
\frac{\partial \log Z(\boldsymbol{\theta})}{\partial \theta_j} = \mathbb{E}_p \big[ \log \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \big]
\end{equation*}

And, employing MCMC algorithms, we can estimate this expectation as:

\begin{equation*}
\mathbb{E} \big[ \log \tilde{p}(\mathbf{x}; \boldsymbol{\theta}) \big] \approx \frac{1}{K} \sum_{k=1}^{K} \log \tilde{p}(\mathbf{x}_k; \boldsymbol{\theta}), \quad \mathbf{x}_k \sim p(\ \cdot \ ; \boldsymbol{\theta})
\end{equation*}

i.e. $\mathbf{x}_k$ are $K$ MCMC samples from the unnormalized $\tilde{p}$.

Finally, giving use a more tractable gradient estimate of $\mathcal{L}$:

\begin{equation*}
\frac{\partial \mathcal{L}}{\partial \theta_j} = \sum_{i=1}^{n} \frac{\partial \log \tilde{p}(\mathbf{x}_i ; \boldsymbol{\theta})}{\partial \theta_j} - \frac{1}{K} \sum_{k=1}^{K} \frac{\partial \log \tilde{p}(\mathbf{x}_k; \boldsymbol{\theta})}{\partial \theta_j}
\end{equation*}

The exact same argument holds for discrete random variables too, simply exchanging the integrand for a summation.

Problems

  • Using standard MCMC methods, we would have to perform a burnin before sampling those $K$ samples in the negative phase. This can be very expensive.

Simulated Annealing (SA)

Simulated annealing is a method of moving from a tractable distribution to a distribution of interest via a sequence of intermediate distributions, used as an inexact method of handling isolated or spurrious modes in Markov chain samplers. Isolated modes is a problem for MCMC methods since we're making small changes between each step, hence moving from one isolated mode to another is "difficult".

It employes a sequence of distributions, with probability densities given by $p_0(x)$ to $p_n(x)$, in which each $p_j$ differes only slightly from $p_{j + 1}$. The distribution $p_0$ is the one of interest. The distribution $p_n$ is designed so that the Markov chain used to sample from it allows movement between all regions of the state space.

A traditional scheme is to set

\begin{equation*}
p_j(x) \propto p_0(x)^{\beta_j}, \quad 1 = \beta_0 > \beta_1 > \cdots > \beta_n
\end{equation*}

or the geometric average:

\begin{equation*}
p_j(x) = p_0^{\beta_j} p_1^{1 - \beta_j}
\end{equation*}

An annealing run is started from some initial state, from which we first simulate a Markov chain designed to converge to $p_n$. Next we simulate some number of iterations of a Markov chain designed to converge to $p_{n - 1}$, starting from the final state of the simulation for $p_n$. Continue in this fashion, using the final state of the simulation for $p_j$ as the initial state of the simulation for $p_{j - 1}$, until we finally simulate the chain designed to converge to $p_0$.

Unfortunately, there is no reason to think that annealing will give the precisely correct result, in which each mode of $p_0$ is found with exactly the right probability.

This is of little consequence in an optimization context, where the final distribution is degenerate (at the maximum), but it is a serious flaw for the many applications in statistics and statistical physics that require a sample from a non-degenerate distribution. neal98_anneal_impor_sampl

TODO Example: SA applied to RBMs

Annealed Importance Sampling (AIS)

Annealed Importance Sampling (AIS)

Notation

  • $p_{\text{ref}} = p_0, p_1, \dots, p_N = p_{\text{target}}$ is a set of Gibbs distributions defined:

    \begin{equation*}
\begin{split}
  p_i: \quad & \Omega \to \mathbb{R} \\
  & p_i(x) = \frac{1}{Z_i} e^{- E_i(x)} = \frac{1}{Z_i} p_i^*(x)
\end{split}
\end{equation*}
  • $\Omega^* = \Omega^2 \times \Theta$, which we call the extended state space
  • The use of square brackets, e.g. $\mathcal{P}[\cdot]$, denotes a distribution over $\Omega^*$
  • We define

    \begin{equation*}
\mathcal{W}[\mathbf{x}] = - \ln \frac{\mathcal{P}[\mathbf{y}, x_N \mid x_0] p_0^*(x_0)}{\tilde{\mathcal{P}}[\mathbf{y}, x_0 \mid x_N] p_N^*(x_N)}
\end{equation*}
  • $\mathcal{P}$ (forward distribution) creates samples from $p_N$ given a sample $x_0$ from $p_0$:

    \begin{equation*}
\mathcal{P}[\mathbf{x}] = P[\mathbf{y}, x_N \mid x_0] p_0(x_0)
\end{equation*}
  • $\tilde{\mathcal{P}}$ (reverse distribution) creates samples from $p_0$ given a sample $x_N$ from $p_N$:

    \begin{equation*}
\tilde{\mathcal{P}}[\mathbf{x}] = \tilde{\mathcal{P}}[\mathbf{y}, x_0 \mid x_N] p_N(x_N)
\end{equation*}

Derivation

Assume that a set of Gibbs distributions $p_0, p_1, \dots, p_N$ to construct a pair of distributions $\mathcal{P}$ and $\tilde{\mathcal{P}}$ on $\Omega^*$ with

\begin{equation*}
\begin{split}
  \mathcal{P}[\mathbf{x}] &= P[\mathbf{y}, x_N \mid x_0] p_0(x_0) \\
  \tilde{\mathcal{P}}[\mathbf{x}] &= \tilde{\mathcal{P}}[\mathbf{y}, x_0 \mid x_N] p_N(x_N)
\end{split}
\end{equation*}
  • $\mathcal{P}$ (forward distribution) creates samples from $p_N$ given a sample $x_0$ from $p_0$
  • $\tilde{\mathcal{P}}$ (reverse distribution) creates samples from $p_0$ given a sample $x_N$ from $p_N$

Then

\begin{equation*}
\frac{\tilde{\mathcal{P}}[\mathbf{x}]}{\mathcal{P}[\mathbf{x}]} = \frac{Z_0 \tilde{\mathcal{P}}[\mathbf{y}, x_0 \mid x_N] p_N^*(x_N)}{Z_N \mathcal{P}[\mathbf{y}, x_N \mid x_0] p_0^*(x_0)} = \frac{Z_0}{Z_N} e^{-\mathcal{W}[\mathbf{x}]}
\end{equation*}

where $\mathcal{W}[\mathbf{x}] = - \ln \frac{\mathcal{P}[\mathbf{y}, x_N \mid x_0] p_0^*(x_0)}{\tilde{\mathcal{P}}[\mathbf{y}, x_0 \mid x_N] p_N^*(x_N)}$. Then, for a function $f: \Omega^* \to \mathbb{R}$, we have

\begin{equation*}
\tilde{\mathcal{P}}[\mathbf{x}] f(\mathbf{x}) = \mathcal{P}[\mathbf{x}] \frac{\tilde{\mathcal{P}}[\mathbf{x}]}{\mathcal{P}[\mathbf{x}]} f(\mathbf{x}) = \frac{Z_0}{Z_N} \mathcal{P}[\mathbf{x}] e^{- \mathcal{W}[\mathbf{x}]} f(\mathbf{x})
\end{equation*}

thus the expectation of $f$

\begin{equation*}
\left\langle f \right\rangle_{\tilde{\mathcal{P}}} = \frac{Z_0}{Z_N} \left\langle f e^{- \mathcal{W}} \right\rangle_{\mathcal{P}}
\end{equation*}

Then, letting $f = 1$, i.e. constant function, we get

(Generalized) Annealed Importance Sampling (AIS) a method of estimating the partition function using the following relation:

\begin{equation*}
\frac{Z_N}{Z_0} = \left\langle e^{- \mathcal{W}} \right\rangle_{\mathcal{P}}
\end{equation*}

where $Z_0$ is the partition function of the known "proxy" distribution $\mathcal{P}$.

The "standard" AIS method is obtained by a specific choice of $\mathcal{P}$.

Bennett's Acceptance Ratio (BAR) method

Bibliography

Bibliography

  • [mackay2003information] MacKay, Kay & Cambridge University Press, Information Theory, Inference and Learning Algorithms, Cambridge University Press (2003).
  • [neal00_slice_sampl] Neal, Slice Sampling, CoRR, (2000). link.
  • [del2004feynman] @incollectiondel2004feynman, title=Feynman-kac formulae, author=Del Moral, Pierre, keywords = particle methods,mcmc, booktitle=Feynman-Kac Formulae, pages=47-93, year=2004, publisher=Springer, file=/home/tor/Dropbox/books/machine_learning/delmoral2004.pdf:PDF
  • [chopin2020introduction] Chopin & Papaspiliopoulos, An introduction to sequential Monte Carlo,.
  • [finke2015extended] @phdthesisfinke2015extended, title=On extended state-space constructions for Monte Carlo methods, author=Finke, Axel
  • [shestopaloff16_sampl_laten_states_high_dimen] Shestopaloff & Neal, Sampling Latent States for High-Dimensional Non-Linear State Space Models With the Embedded Hmm Method, CoRR, (2016). link.
  • [andrieu2010particle] Andrieu, Doucet & Holenstein, Particle markov chain monte carlo methods, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3), 269-342 (2010).
  • [neal98_anneal_impor_sampl] Neal, Annealed Importance Sampling, CoRR, (1998). link.