Notes on: Tosh, C. (): Mixing rates for the gibbs sampler over restricted boltzmann machines

Table of Contents

1 Notation

  • \(Q(x, y) = Q^t(x, y) = \text{Pr}(X_t = y \mid X_{t - 1} = x)\) is the transition probability of a Markov Chain
  • Total variance

    \begin{equation*} ||\mu - \nu||_{\rm{TV}} = \frac{1}{2} \sum_{\omega \in \Omega}^{} | \mu(\omega) - \nu(\omega) | \end{equation*}

    where \(\Omega\) is the measure space.

  • The mixing rate of Markov Chain \(Q\) with stationary distribution \(\pi\) is the function

    \begin{equation*} \tau(\varepsilon) = \min \left\{ t : \max_{x \in \Omega} || Q^t(x, \cdot) - \pi||_{\rm{TV}} < \varepsilon \right\} \end{equation*}
  • \(\tau_{\text{mix}} = \tau \big( 1 / 4 \big)\) is used in this paper
  • \(X_{t + 1/2}\) denotes the intermediate step after sampling \(h_j\)
  • \(d_v\) and \(d_h\) is some distance between visible units \(v\) and \(h\), respectively

2 Upper bounds for the Gibbs sampler

Let

  • \(a\in \mathbb{R}^n\)
  • \(b \in \mathbb{R}^m\)
  • \(W \in \mathbb{R}^{n \times m}\)

be parameters of an RBM, s.t.

\begin{equation*} ||W||_1 \ ||W^T||_1 < 4 \end{equation*}

Then the Gibbs sampler satisfies

\begin{equation*} \tau_{\text{mix}} \le 1 + \frac{\ln 4n}{\ln 4 - \ln \big( ||W||_1 \ ||W^T||_1 \big)} \end{equation*}

A Markovian coupling of a Markov Chain \(Z_t\) over probability space \(\Omega\) with transition "matrix" \(Q\) is a Markov chain \(\big( X_t, Y_t \big)\) over \(\Omega \times \Omega\) whose transitions satisfy

\begin{equation*} \begin{split} \text{Pr} \big( X_{t + 1} = x' \mid X_t = x, Y_t = y \big) &= Q(x, x') \\ \text{Pr} \big( Y_{t + 1} = y' \mid X_t = x, Y_t = y \big) &= Q(y, y') \end{split} \end{equation*}

To relate couplings of Markov chains to mixing time, we have the following lemma.

Let \((X_t, Y_t)\) be a Markovian coupling for a Markov chain \(Z_t\) s.t. there exists a function \(\tau_{\text{couple}} : (0, 1) \to \mathbb{N}\) satisfying

\begin{equation*} \text{Pr}(X_{\tau_{\text{couple}}(\varepsilon)} \ne Y_{\tau_{\text{couple}}(\varepsilon)} \mid X_0 = x, Y_0 = y) \le \varepsilon, \quad \forall x, y \in \Omega, \quad \forall \varepsilon > 0 \end{equation*}

In this case, there exists a mixing rate for \(Z_t\) such that

\begin{equation*} \tau_{\text{mix}} \le \tau_{\text{couple}}(1 / 4) \end{equation*}

Lemma 2.2 is basically saying that there exists a mixing rate, dependent on \(\varepsilon > 0\) such that the probability of \(X_t = Y_t\) is very likely.

Or nicer: the two chains \(X_t\) and \(Y_t\) converges to an "arbitrary large probability".

Let \(a, b, W\) be as in Theorem 2.1.

Then there exists a Markovian coupling \((X_t, Y_t)\) of the Gibbs sampler such that

\begin{equation*} \begin{split} \mathbb{E} \Big[ d_h \big( X_{t + 1 / 2}, Y_{t + 1/2} \big) \mid X_t, Y_t \Big] &\le \frac{1}{2} ||W^T||_1 d_v(X_t, Y_t) \\ \mathbb{E} \Big[ d_v \big( X_{t + 1}, Y_{t + 1} \big) \mid X_t, Y_t \Big] &\le \frac{1}{2} ||W||_1 d_h \big( X_{t + 1/2}, Y_{t + 1 / 2} \big) \end{split} \end{equation*}

3 Changes to work with GRBMs

  • [ ] Distance metric: cannot use Hamming distance (probably use \(\ell_2\))
  • [ ] Problems arising if we use different metrics on the visibles and hiddens?
  • [ ] Define valid coupling for continuous case to prove Lemma 2.3
    • Step through one transition
    • Only necessary to alter definition for visibles \(V_j\)
      • Need to satisfy marginalization over least probably chain (e.g. \(\text{Pr}\big(X_{t + 1}(v_j)\big)\))
  • [ ] Prove inequality for Lemma 2.3 for Gaussian case for visibles \(V_j\)
    • Can re-use proof of inequality for coupled probability of hiddens \(p_X^{(\mu)}\) and \(p_y^{(\mu)}\)