# Notes on: Tosh, C. (): Mixing rates for the gibbs sampler over restricted boltzmann machines

## Table of Contents

## 1 Notation

- \(Q(x, y) = Q^t(x, y) = \text{Pr}(X_t = y \mid X_{t - 1} = x)\) is the transition probability of a Markov Chain
Total variance

\begin{equation*} ||\mu - \nu||_{\rm{TV}} = \frac{1}{2} \sum_{\omega \in \Omega}^{} | \mu(\omega) - \nu(\omega) | \end{equation*}where \(\Omega\) is the measure space.

The mixing rate of Markov Chain \(Q\) with stationary distribution \(\pi\) is the function

\begin{equation*} \tau(\varepsilon) = \min \left\{ t : \max_{x \in \Omega} || Q^t(x, \cdot) - \pi||_{\rm{TV}} < \varepsilon \right\} \end{equation*}- \(\tau_{\text{mix}} = \tau \big( 1 / 4 \big)\) is used in this paper
- \(X_{t + 1/2}\) denotes the intermediate step after sampling \(h_j\)
- \(d_v\) and \(d_h\) is some distance between visible units \(v\) and \(h\), respectively

## 2 Upper bounds for the Gibbs sampler

Let

- \(a\in \mathbb{R}^n\)
- \(b \in \mathbb{R}^m\)
- \(W \in \mathbb{R}^{n \times m}\)

be parameters of an RBM, s.t.

\begin{equation*} ||W||_1 \ ||W^T||_1 < 4 \end{equation*}
Then the *Gibbs sampler* satisfies

A **Markovian coupling** of a Markov Chain \(Z_t\) over probability space \(\Omega\) with transition "matrix" \(Q\) is a Markov chain \(\big( X_t, Y_t \big)\) over \(\Omega \times \Omega\) whose transitions satisfy

To relate *couplings of Markov chains* to *mixing time*, we have the following lemma.

Let \((X_t, Y_t)\) be a Markovian coupling for a Markov chain \(Z_t\) s.t. there exists a function \(\tau_{\text{couple}} : (0, 1) \to \mathbb{N}\) satisfying

\begin{equation*} \text{Pr}(X_{\tau_{\text{couple}}(\varepsilon)} \ne Y_{\tau_{\text{couple}}(\varepsilon)} \mid X_0 = x, Y_0 = y) \le \varepsilon, \quad \forall x, y \in \Omega, \quad \forall \varepsilon > 0 \end{equation*}In this case, there exists a mixing rate for \(Z_t\) such that

\begin{equation*} \tau_{\text{mix}} \le \tau_{\text{couple}}(1 / 4) \end{equation*}Lemma 2.2 is basically saying that there exists a mixing rate, dependent on \(\varepsilon > 0\) such that the probability of \(X_t = Y_t\) is very likely.

Or nicer: the two chains \(X_t\) and \(Y_t\) converges to an "arbitrary large probability".

Let \(a, b, W\) be as in Theorem 2.1.

Then there exists a Markovian coupling \((X_t, Y_t)\) of the Gibbs sampler such that

\begin{equation*} \begin{split} \mathbb{E} \Big[ d_h \big( X_{t + 1 / 2}, Y_{t + 1/2} \big) \mid X_t, Y_t \Big] &\le \frac{1}{2} ||W^T||_1 d_v(X_t, Y_t) \\ \mathbb{E} \Big[ d_v \big( X_{t + 1}, Y_{t + 1} \big) \mid X_t, Y_t \Big] &\le \frac{1}{2} ||W||_1 d_h \big( X_{t + 1/2}, Y_{t + 1 / 2} \big) \end{split} \end{equation*}## 3 Changes to work with GRBMs

`[ ]`

Distance metric: cannot use Hamming distance (probably use \(\ell_2\))`[ ]`

Problems arising if we use different metrics on the visibles and hiddens?`[ ]`

Define*valid coupling*for continuous case to prove Lemma 2.3- Step through one transition
- Only necessary to alter definition for visibles \(V_j\)
- Need to satisfy marginalization over least probably chain (e.g. \(\text{Pr}\big(X_{t + 1}(v_j)\big)\))

`[ ]`

Prove inequality for Lemma 2.3 for Gaussian case for visibles \(V_j\)- Can re-use proof of inequality for coupled probability of hiddens \(p_X^{(\mu)}\) and \(p_y^{(\mu)}\)