# Notes on: Tosh, C. (): Mixing rates for the gibbs sampler over restricted boltzmann machines

## 1 Notation

• $$Q(x, y) = Q^t(x, y) = \text{Pr}(X_t = y \mid X_{t - 1} = x)$$ is the transition probability of a Markov Chain
• Total variance

\begin{equation*} ||\mu - \nu||_{\rm{TV}} = \frac{1}{2} \sum_{\omega \in \Omega}^{} | \mu(\omega) - \nu(\omega) | \end{equation*}

where $$\Omega$$ is the measure space.

• The mixing rate of Markov Chain $$Q$$ with stationary distribution $$\pi$$ is the function

\begin{equation*} \tau(\varepsilon) = \min \left\{ t : \max_{x \in \Omega} || Q^t(x, \cdot) - \pi||_{\rm{TV}} < \varepsilon \right\} \end{equation*}
• $$\tau_{\text{mix}} = \tau \big( 1 / 4 \big)$$ is used in this paper
• $$X_{t + 1/2}$$ denotes the intermediate step after sampling $$h_j$$
• $$d_v$$ and $$d_h$$ is some distance between visible units $$v$$ and $$h$$, respectively

## 2 Upper bounds for the Gibbs sampler

Let

• $$a\in \mathbb{R}^n$$
• $$b \in \mathbb{R}^m$$
• $$W \in \mathbb{R}^{n \times m}$$

be parameters of an RBM, s.t.

\begin{equation*} ||W||_1 \ ||W^T||_1 < 4 \end{equation*}

Then the Gibbs sampler satisfies

\begin{equation*} \tau_{\text{mix}} \le 1 + \frac{\ln 4n}{\ln 4 - \ln \big( ||W||_1 \ ||W^T||_1 \big)} \end{equation*}

A Markovian coupling of a Markov Chain $$Z_t$$ over probability space $$\Omega$$ with transition "matrix" $$Q$$ is a Markov chain $$\big( X_t, Y_t \big)$$ over $$\Omega \times \Omega$$ whose transitions satisfy

\begin{equation*} \begin{split} \text{Pr} \big( X_{t + 1} = x' \mid X_t = x, Y_t = y \big) &= Q(x, x') \\ \text{Pr} \big( Y_{t + 1} = y' \mid X_t = x, Y_t = y \big) &= Q(y, y') \end{split} \end{equation*}

To relate couplings of Markov chains to mixing time, we have the following lemma.

Let $$(X_t, Y_t)$$ be a Markovian coupling for a Markov chain $$Z_t$$ s.t. there exists a function $$\tau_{\text{couple}} : (0, 1) \to \mathbb{N}$$ satisfying

\begin{equation*} \text{Pr}(X_{\tau_{\text{couple}}(\varepsilon)} \ne Y_{\tau_{\text{couple}}(\varepsilon)} \mid X_0 = x, Y_0 = y) \le \varepsilon, \quad \forall x, y \in \Omega, \quad \forall \varepsilon > 0 \end{equation*}

In this case, there exists a mixing rate for $$Z_t$$ such that

\begin{equation*} \tau_{\text{mix}} \le \tau_{\text{couple}}(1 / 4) \end{equation*}

Lemma 2.2 is basically saying that there exists a mixing rate, dependent on $$\varepsilon > 0$$ such that the probability of $$X_t = Y_t$$ is very likely.

Or nicer: the two chains $$X_t$$ and $$Y_t$$ converges to an "arbitrary large probability".

Let $$a, b, W$$ be as in Theorem 2.1.

Then there exists a Markovian coupling $$(X_t, Y_t)$$ of the Gibbs sampler such that

\begin{equation*} \begin{split} \mathbb{E} \Big[ d_h \big( X_{t + 1 / 2}, Y_{t + 1/2} \big) \mid X_t, Y_t \Big] &\le \frac{1}{2} ||W^T||_1 d_v(X_t, Y_t) \\ \mathbb{E} \Big[ d_v \big( X_{t + 1}, Y_{t + 1} \big) \mid X_t, Y_t \Big] &\le \frac{1}{2} ||W||_1 d_h \big( X_{t + 1/2}, Y_{t + 1 / 2} \big) \end{split} \end{equation*}

## 3 Changes to work with GRBMs

• [ ] Distance metric: cannot use Hamming distance (probably use $$\ell_2$$)
• [ ] Problems arising if we use different metrics on the visibles and hiddens?
• [ ] Define valid coupling for continuous case to prove Lemma 2.3
• Step through one transition
• Only necessary to alter definition for visibles $$V_j$$
• Need to satisfy marginalization over least probably chain (e.g. $$\text{Pr}\big(X_{t + 1}(v_j)\big)$$)
• [ ] Prove inequality for Lemma 2.3 for Gaussian case for visibles $$V_j$$
• Can re-use proof of inequality for coupled probability of hiddens $$p_X^{(\mu)}$$ and $$p_y^{(\mu)}$$