RBM: Restricted Boltzmann Machines

Table of Contents

Overview

  • One visible layer of observed binary units $\mathbf{x}$
  • One hidden layer of unobserved binary units $\mathbf{h}$
  • Connections between every node in the different layers, i.e. restricted means no connections from hidden to hidden, nor from visible to visible, but connections between all hidden to visible pairs.

    rbm.png

  • Models Maximum Entropy distributions, and because of this, RBMs were originally motivated by Hinton as a "soft-constraint learner", where it was the intention that it would learn a distribution which satisfied the underlying constraints (see [BROKEN LINK: No match for fuzzy expression: *Principle%20of%20Maximum%20Entropy]).ackley_1985

Notation

  • $\mathbf{v}$ and $v_j$ denotes visible units
  • $\mathbf{h}$ and $h_{\mu}$ denotes hidden units
  • $\mathbf{c}$ and $c_j$ will used to denote bias of the visible units
  • $\mathbf{b}$ and $b_{\mu}$ will be used to denote the bias of the hidden units
  • $\left\langle \cdot \right\rangle$ denotes the empirical expectation
  • The so-called free-energy:

    \begin{equation*}
F(\mathbf{v}) = - \log \sum_{\mathbf{h}}^{} \exp \big( - E(\mathbf{v}, \mathbf{h}) \big)
\end{equation*}
  • $\tilde{\mathbf{v}}^{(t)}$ and $\tilde{\mathbf{h}}^{(t)}$ denotes the t-th sample from a sampling chain for the visible and hidden variables, respectively.
  • $\mathbf{h}_{-\alpha}$ denote the vector $\mathbf{h}$ but excluding the entry at index $\alpha$:

    \begin{equation*}
\mathbf{h}_{- \alpha} = \big( h_1, \dots, h_{\alpha - 1}, h_{\alpha + 1}, \dots, h_{|\mathcal{H}|} \big)
\end{equation*}

Factor graph view

A Boltzmann Machine is an energy-based model consisting of a set of hidden units $\mathcal{H} = \{ H_{\mu} \}$ and a set of visible units $\mathcal{V} = \{ V_j \}$, where by "units" we mean random variables, taking on the values $\mathbf{h}$ and $\mathbf{v}$, respectively.

An Boltzmann Machine assumes the following joint probability distribution of the visible and hidden units:

\begin{equation*}
p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp \Big( - E(\mathbf{v}, \mathbf{h}) \Big)
\end{equation*}

with $Z$ being the partition function (normalization factor) and the energy function

\begin{equation*}
E(\mathbf{v}, \mathbf{h}) = - \mathbf{c}^T \mathbf{v} - \mathbf{b}^T \mathbf{h} - \mathbf{v}^T \mathbf{W} \mathbf{h} - \mathbf{v}^T \mathbf{\Sigma}_{\mathcal{V}} \mathbf{v} - \mathbf{h}^T \mathbf{\Sigma}_{\mathcal{H}} \mathbf{h}
\end{equation*}

where $\mathbf{\Sigma}_{\mathcal{V}}$ and $\mathbf{\Sigma}_{\mathcal{H}}$ denote the convariance matrices for the visible and hidden units, respectively.

A Restricted Boltzmann Machine (RBM) is an energy-based model consisting of a set of hidden units $\mathcal{H} = \{ H_{\mu} \}$ and a set of visible units $\mathcal{V} = \{ V_j \}$, whereby "units" we mean random variables, taking on the values $\mathbf{h}$ and $\mathbf{v}$, respectively.

The restricted part of the name comes from the fact that we assume independence between the hidden units and the visible units, i.e.

\begin{equation*}
\begin{split}
  p(h_{\mu} \mid h_1, \dots, h_{\mu - 1}, h_{\mu + 1}, \dots, h_{|\mathcal{H}|}) &= p(h_{\mu}) \\
  p(v_j \mid v_1, \dots, v_{j - 1}, v_{j + 1}, \dots, v_{|\mathcal{V}|}) &= p(v_j)
\end{split}
\end{equation*}

An RBM assumes the following joint probability distribution of the visible and hidden units:

\begin{equation*}
p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp \Big( - E(\mathbf{v}, \mathbf{h}) \Big)
\end{equation*}

with $Z$ being the partition function (normalization factor) and the energy function

\begin{equation*}
\begin{split}
  E(\mathbf{v}, \mathbf{h}) &= - \mathbf{c}^T \mathbf{v} - \mathbf{b}^T \mathbf{h} - \mathbf{v}^T \mathbf{W} \mathbf{h} \\
  &= - c_j v_j - b_{\mu} h_{\mu} - v_j W_{j \mu} h_{\mu}
\end{split}
\end{equation*}

implicitly summing over repeating indices.

Training

Log-likelihood using gradient ascent

Notation

  • Used in the derivation of $p(V_i = v_i \mid \mathbf{h})$

    \begin{equation*}
\gamma_i = b_i + \sigma_i \sum_{\mu=1}^{|\mathcal{H}|} W_{i \mu} h_{\mu}
\end{equation*}

Gradient of log-likelihood of energy-based models

RBMs assumes the following model

\begin{equation*}
p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \exp \Big( - E(\mathbf{v}, \mathbf{h}) \Big)
\end{equation*}

with

\begin{equation*}
\begin{split}
  E(\mathbf{v}, \mathbf{h}) &= - \mathbf{c}^T \mathbf{v} - \mathbf{b}^T \mathbf{h} - \mathbf{v}^T \mathbf{W} \mathbf{h} \\
  &= - c_j v_j - b_{\mu} h_{\mu} - v_j W_{j \mu} h_{\mu} \\
\end{split}
\end{equation*}

The minus-minus signs might seem redundant (and they are), but is convention due to the Boltzmann distribution found in physics.

where we're implicitly summing over repeating indices. First we observe that

\begin{equation*}
p(\mathbf{v}) = \sum_{\mathbf{h}}^{} p(\mathbf{v}, \mathbf{h}) = \frac{1}{Z} \sum_{\mathbf{h}}^{} \exp \Big( - E(\mathbf{v}, \mathbf{h}) \Big)
\end{equation*}

Taking the log of this, we have

\begin{equation*}
\log p(\mathbf{v}) = \log \bigg( \sum_{\mathbf{h}}^{} \exp \Big( - E(\mathbf{v}, \mathbf{h} \Big) \bigg) - \log Z
\end{equation*}

Now suppose we're given a set of samples $\{ \mathbf{v}^{(n)}, n = 1, \dots, N \}$, then the likelihood (assuming i.i.d.) is given by

\begin{equation*}
p\left(\big\{ c_j, b_{\mu}, W_{j \mu} \big\} \mid \big\{ \mathbf{v}^{(n)} \big\}\right) = \prod_{n = 1}^N p\left(\mathbf{v}^{(n)} \mid \big\{ c_j, b_{\mu}, W_{j \mu} \big\}\right)
\end{equation*}

Taking the log of this expression, and using the above expression for $p(\mathbf{v})$, we have

\begin{equation*}
\begin{split}
  \mathcal{L}(\big\{ c_j, b_{\mu}, W_{j \mu} \big\}) &= \sum_{n=1}^{N} \Bigg[ \log \bigg( \sum_{\mathbf{h}}^{} p \Big( \mathbf{v}^{(n)}, \mathbf{h} \Big) \bigg) - \log Z \Bigg] \\
  &= \sum_{n=1}^{N} \Bigg[ \log \bigg( \sum_{\mathbf{h}}^{} p \Big( \mathbf{v}^{(n)}, \mathbf{h} \Big) \bigg) \Bigg] - N \log Z
\end{split}
\end{equation*}

Let $\theta \in \big\{ c_j, b_{\mu}, W_{j \mu} \big\}$, taking the partial derivative wrt. $\theta$ for the n-th term we have

\begin{equation*}
\begin{split}
  \frac{\partial }{\partial \theta} \Bigg[ \log \bigg( \sum_{\mathbf{h}}^{} p \Big( \mathbf{v}^{(n)}, \mathbf{h} \Big) - \log Z \Bigg] &= - \frac{\sum_{\mathbf{h}}^{} \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} p \Big( \mathbf{v}^{(n)}, \mathbf{h} \Big)}{\sum_{\mathbf{h}}^{} p \Big( \mathbf{v}^{(n)}, \mathbf{h} \Big)} - \frac{1}{Z} \frac{\partial Z}{\partial \theta}
\end{split}
\end{equation*}

The first term can be written as an expectation

\begin{equation*}
\frac{\sum_{\mathbf{h}}^{} \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h})}{\partial \theta} p \Big( \mathbf{v}^{(n)}, \mathbf{h} \Big)}{\sum_{\mathbf{h}}^{} p \Big( \mathbf{v}^{(n)}, \mathbf{h} \Big)} = \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h})}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \Bigg]
\end{equation*}

since on the LHS we're marginalizing over all $\mathbf{h}$ and then normalizing wrt. the same distribution we just summed over. For the second term recall that $Z = \sum_{\mathbf{v}, \mathbf{h}}^{} \exp \Big( - E(\mathbf{v}, \mathbf{h}) \Big)$, therefore

\begin{equation*}
\frac{1}{Z} \frac{\partial Z}{\partial \theta} = - \frac{1}{Z} \sum_{\mathbf{v}, \mathbf{h}}^{} \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \exp \Big( - E(\mathbf{v}, \mathbf{h}) \Big) = - \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \Bigg]
\end{equation*}

Substituting it all back into the partial derivative of the log-likelihood, we end get

\begin{equation*}
\frac{\partial \mathcal{L}}{\partial \theta} = - \sum_{n=1}^{N} \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v}^{(n)}\Bigg] + N \  \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \Bigg]
\end{equation*}

Where the expectations are all over the probability distribution defined by the model.

Since maximizing $\mathcal{L}$ is equivalent to maximizing $\mathcal{L} / N$ we instead consider

\begin{equation*}
\frac{1}{N} \frac{\partial \mathcal{L}}{\partial \theta} = - \frac{1}{N} \sum_{n=1}^{N} \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v}^{(n)}\Bigg] + \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \Bigg]
\end{equation*}

Dividing by $N$, observe that the first term can be written

\begin{equation*}
\frac{1}{N} \sum_{n=1}^{N} \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v}^{(n)}\Bigg] = \left\langle \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v} \Bigg] \right\rangle_{\left\{ \mathbf{v}^{(n)} \right\}}
\end{equation*}

Giving us

\begin{equation*}
\frac{1}{N} \frac{\partial \mathcal{L}}{\partial \theta} = - \left\langle \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v} \Bigg] \right\rangle_{\left\{ \mathbf{v}^{(n)} \right\}} + \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \Bigg]
\end{equation*}

Clearly the second term is almost always intractable since we would have to sum over all possible $\mathbf{v}$ and $\mathbf{h}$, but the first term might also be intractable as we would have to marginalize over all the hidden states $\mathbf{h}$ with $\mathbf{v}$ fixed to $\mathbf{v}^{(n)}$.

When working with binary hidden variables, i.e. $H_{\mu}$ are all assumed to be Bernoulli random variables, then we have

\begin{equation*}
\mathbb{E} [ \mathbf{h} \mid \mathbf{v} ] = \prod_{\mu = 1}^{|\mathcal{H}|} p(H_{\mu} = 1 \mid \mathbf{v})
\end{equation*}

In general, when the hiddens are not necessarily Bernoulli, we would also like to sample the hidden states $\mathbf{h}$ to approximate the conditional expectation. It would make sense to sample some $M$ number of hidden states $\mathbf{h}$ for each observed $\mathbf{v}^{(n)}$, i.e.

\begin{equation*}
\left\langle \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v} \Bigg] \right\rangle_{\left\{ \mathbf{v}^{(n)} \right\}} = \frac{1}{N} \sum_{n=1}^{N} \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v}^{(n)}\Bigg] \approx \frac{1}{N} \sum_{n=1}^{N} \frac{1}{M} \sum_{m=1}^{M} \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h}^{(m)})}{\partial \theta}
\end{equation*}

In fact, as mentioned on p.45 in Fischer_2015, this method (but letting $M = 1$) is often used, but, in the Bernoulli case, taking the expectation reduces the variance of our estimate compared to also sampling $\mathbf{h}$.

You will often see the notation

\begin{equation*}
\left\langle \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \right\rangle_{\left\{ \mathbf{v}^{(n)} \right\}}
= \left\langle \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \right\rangle_{\text{data}}
\end{equation*}

which we will use from now on. The $\{ \mathbf{v}^{(n)} \}$ was used to make it explicit that we're computing the expectation over the visible units, NOT the joint expectation by sampling $\mathbf{h}$ for each of the observed visible units.

For the second term we also use the empirical expectation as an approximation:

\begin{equation*}
\mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \Bigg] \approx \frac{1}{N} \frac{1}{M} \sum_{n=1}^{N}  \sum_{m=1}^{M} \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h}^{(m)})}{\partial \theta}
\end{equation*}

Finally giving us a tractable approximation to the gradient of the log-likelihood:

\begin{equation*}
\frac{1}{N} \frac{\partial \mathcal{L}}{\partial \theta} \approx - \left\langle \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \right\rangle_{\text{data}} + \left\langle \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \right\rangle_{\text{model}}
\end{equation*}

Making the variable $\theta$ takes on explicit, we have

\begin{equation*}
\begin{split}
  \frac{1}{N} \frac{\partial \mathcal{L}}{\partial c_j} &= \left\langle v_j \right\rangle_{\text{data}} - \left\langle v_j \right\rangle_{\text{model}} \\
  \frac{1}{N} \frac{\partial \mathcal{L}}{\partial b_{\mu}} &= \left\langle h_{\mu} \right\rangle_{\text{data}} - \left\langle h_{\mu} \right\rangle_{\text{model}} \\
  \frac{1}{N} \frac{\partial \mathcal{L}}{\partial W_{j \mu}} &= \left\langle v_j h_{\mu} \right\rangle_{\text{data}} - \left\langle v_j h_{\mu} \right\rangle_{\text{model}}
\end{split}
\end{equation*}

Observe that the signs have switched, which is simply due to the fact that $E(\mathbf{v}, \mathbf{h})$ depends negatively on all the variables $\{ v_j, h_{\mu}, W_{j \mu} \}$.

Now we can make use of any nice stochastic gradient ascent method.

Bernoulli RBM

In the case of a Bernoulli RBM, both the visible and hidden variables only take on the values $\{ 0, 1 \}$. Therefore,

\begin{equation*}
\begin{split}
  \mathbb{E}\Big[h_{\mu} \mid \mathbf{v}^{(n)} \Big] &= 1 \cdot p\Big(H_{\mu} = 1 \mid \mathbf{v}^{(n)} \Big) + 0 \cdot p\Big(H_{\mu} = 0 \mid \mathbf{v}^{(n)} \Big) \\
  &= p\Big(H_{\mu} = 1 \mid \mathbf{v}^{(n)} \Big)
\end{split}
\end{equation*}

Hence, for the hidden units, we have

\begin{equation*}
\left\langle h_{\mu} \right\rangle_{\text{data}} = \frac{1}{N} \sum_{n=1}^{N} \mathbb{E} \Big[ h_{\mu} \mid \mathbf{v}^{(n)} \Big] = \frac{1}{N} \sum_{n=1}^{N} p \Big( H_{\mu} = 1 \mid \mathbf{v}^{(n)} \Big)
\end{equation*}

and

\begin{equation*}
\left\langle h_{\mu} \right\rangle_{\text{model}} = \mathbb{E}[h_{\mu}] = \sum_{\mathbf{v}}^{} p \Big( H_{\mu} = 1 \mid \mathbf{v}\Big) p (\mathbf{v} )
\end{equation*}

For the visible units we have

\begin{equation*}
\left\langle v_j \right\rangle_{\text{data}} = \frac{1}{N} \sum_{n=1}^{N} \mathbb{E}\Big[ v_j \mid \mathbf{v}^{(n)} \Big] = \frac{1}{N} \sum_{n=1}^{N} v_j^{(n)}
\end{equation*}

and

\begin{equation*}
\left\langle v_j \right\rangle_{\text{model}} = \mathbb{E}[v_j] = \sum_{\mathbf{v}}^{} v_j \ p(\mathbf{v})
\end{equation*}

Thus, we end up with the gradients of the negative log-likelihood

\begin{equation*}
\begin{split}
  \frac{1}{N} \frac{\partial \mathcal{L}}{\partial c_j} &= \frac{1}{N} \sum_{n=1}^{N} v_j^{(n)} - \sum_{\mathbf{v}}^{} v_j \ p(\mathbf{v}) \\
  \frac{1}{N} \frac{\partial \mathcal{L}}{\partial b_{\mu}} &= \frac{1}{N} \sum_{n=1}^{N} p \Big( H_{\mu} = 1 \mid \mathbf{v}^{(n)} \Big) - \sum_{\mathbf{v}}^{} p \big( H_{\mu} = 1 \mid \mathbf{v} \big) p(\mathbf{v}) \\
  \frac{1}{N} \frac{\partial \mathcal{L}}{\partial W_{j \mu}} &= \frac{1}{N} \sum_{n=1}^{N} v_j^{(n)} \ p \Big( H_{\mu} = 1 \mid \mathbf{v}^{(n)} \Big) - \sum_{\mathbf{v}}^{} v_j \ p\Big( H_{\mu} = 1 \mid \mathbf{v} \Big) p(\mathbf{v})
\end{split}
\end{equation*}

In these expressions, the first terms are indeed tractable but the second terms still pose a problem. Therefore we turn to MCMC methods for estimating the expectations in the second terms.

A good method would (block) Gibbs sampling, since the visible and hidden units are independent, and we could thus sample all the visible or hidden units simultaneously. To obtain unbiased estimates we would then run the sampler for a large number of steps. Unfortunately this is expensive, which is where Contrastive Divergence comes in.

Expression for $p(H_{\mu} = 1 \mid \mathbf{v})$

Let $\mathbf{h}_{-\alpha}$ denote the vector $\mathbf{h}$ but excluding the entry at index $\alpha$, then

\begin{equation*}
\begin{split}
  p(H_\alpha = 1 \mid \mathbf{v}) &= p(H_\alpha = 1 \mid \mathbf{v}, \mathbf{h}_{-\alpha}) \\
  &= \frac{p(H_{\alpha} = 1, \mathbf{v} \mid \mathbf{h}_{-\alpha})}{p(\mathbf{v} \mid \mathbf{h}_{-\alpha})} \\
  &= \frac{\textcolor{green}{p(H_{\alpha} = 1, \mathbf{v} \mid \mathbf{h}_{-\alpha})}}{\textcolor{blue}{p(\mathbf{v} \mid H_{\alpha} = 1, \mathbf{h}_{-\alpha}) + p(\mathbf{v} \mid H_{\alpha} = 0, \mathbf{h}_{-\alpha})}}
\end{split}
\end{equation*}

where in the last step we simply marginalize of $H_{\alpha}$.

The numerator will be the exponential of the following three terms added together

\begin{equation*}
b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha}, \qquad \sum_{j=1}^{|\mathcal{V}|} c_j v_j, \qquad \sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}
\end{equation*}

and the denominator will be

\begin{equation*}
\begin{cases}
  b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} & \text{if } H_{\alpha} = 1 \\
  0 & \text{if } H_{\alpha} = 0
\end{cases}, \qquad
\sum_{j=1}^{|\mathcal{V}|} c_j v_j, \qquad
\sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}
\end{equation*}

To be explicit, the denominator is

\begin{equation*}
\exp \Bigg( 0 + \sum_{j=1}^{|\mathcal{V}|} c_j v_j + \sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}\Bigg) +
\exp \Bigg( b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} + \sum_{j=1}^{|\mathcal{V}|} c_j v_j + \sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}\Bigg)
\end{equation*}

which we can write

\begin{equation*}
\Bigg( 1 + \exp \bigg( b_{\alpha} h_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \bigg) \Bigg) \exp \bigg( \sum_{j=1}^{|\mathcal{V}|} c_j v_j + \sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}\bigg)
\end{equation*}

Taking the logarithm (for readability's sake), we end up with

\begin{equation*}
\log \Bigg( 1 + \exp \bigg( b_{\alpha} h_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \bigg) \Bigg) 
+ \bigg( \sum_{j=1}^{|\mathcal{V}|} c_j v_j + \sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}\bigg)
\end{equation*}

Hence,

\begin{equation*}
\begin{split}
  \log p \big( H_{\alpha} = 1 \mid \mathbf{v} \big) =& \quad \textcolor{green}{\Bigg[ b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} + \sum_{j=1}^{|\mathcal{V}|} c_j v_j + \sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}\Bigg]} \\
  & - \textcolor{blue}{\Bigg[ \log \Bigg( 1 + \exp \bigg( b_{\alpha} h_{\alpha} 
  + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \bigg) \Bigg) 
+ \bigg( \sum_{j=1}^{|\mathcal{V}|} c_j v_j + \sum_{j=1}^{|\mathcal{V}|} \sum_{\mu = 1, \mu \ne \alpha}^{|\mathcal{H}|} v_j W_{j \mu} h_{\mu}\bigg) \Bigg]}
\end{split}
\end{equation*}

Observe that the two last terms in these expression cancels each other, thus

\begin{equation*}
  \log p \big( H_{\alpha} = 1 \mid \mathbf{v} \big) = \textcolor{green}{\Bigg[ b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \Bigg]}
  -  \textcolor{blue}{\Bigg[ \log \Bigg( 1 + \exp \bigg( b_{\alpha} h_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \bigg) \Bigg) \Bigg]}
\end{equation*}

Finally, taking the exponential again,

\begin{equation*}
\begin{split}
  p \big( H_{\alpha} = 1 \mid \mathbf{v} \big) &= \frac{\textcolor{green}{\exp \Big( b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \Big)}}{\textcolor{blue}{1 + \exp \Big( b_{\alpha} b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \Big)}} \\
  &= \frac{1}{1 + \exp \Big( - b_{\alpha} - \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \Big)} \\
  &= \text{sigmoid} \bigg( b_{\alpha} + \sum_{j=1}^{|\mathcal{V}|} v_j W_{j \alpha} \bigg)
\end{split}
\end{equation*}
TODO Computing $p(\mathbf{v})$
\begin{equation*}
\log \prod_{j = 1}^{|\mathcal{V}|} e^{c_j v_j} = \sum_{j=1}^{|\mathcal{V}|} c_j v_j = \mathbf{c}^T \mathbf{v}
\end{equation*}
\begin{equation*}
\log \prod_{\mu = 1}^{|\mathcal{H}|} \Bigg( 1 + \exp \Big( b_{\mu} + \sum_{j=1}^{|\mathcal{V}|} W_{j \mu} v_j \Big) \Bigg)
\end{equation*}
\begin{equation*}
\sum_{\mu=1}^{|\mathcal{H}|} \log \Bigg( 1 + \exp \Big( b_{\mu} + \sum_{j=1}^{|\mathcal{V}|} W_{j \mu} v_j \Big) \Bigg)
\end{equation*}
\begin{equation*}
\log \Bigg( 1 + \exp \Big( \mathbf{b} + \mathbf{W}^T \mathbf{v} \Big) \Bigg)
\end{equation*}

Gaussian RBM

Suppose now that $V_j$ are now continuous variables but $H_{\mu}$ are still bernoulli. We assume the following Hamiltonian / energy function

\begin{equation*}
E(\mathbf{v}, \mathbf{h}) = \frac{(v_j - b_j)^2}{2 \sigma_j^2} - c_{\mu} h_{\mu} - \frac{v_j}{\sigma_j} W_{j \mu} h_{\mu}
\end{equation*}

where, as usual, we assume summation over the repeated indices.

  • $\sigma_j$ denotes the variance of $V_j$
  • We use the term $v_j / \sigma_j$ in the "covariance" term, as it makes sense to remove the variance in $v_j$ which does not arise due to its relationship with the hiddens.
Computing $p(V_i = v_i \mid \mathbf{h})$ and $p(H_{\alpha} = h_{\alpha} \mid \mathbf{v})$

First observe that

\begin{equation*}
p(V_i = v_i \mid \mathbf{h}) = p(V_i = v_i \mid \mathbf{h}, \mathbf{v}_{-i})
\end{equation*}

And,

\begin{equation*}
p(V_i = v_i \mid \mathbf{h}, \mathbf{v}_{-i}) = \frac{p(V_i = v_i, \mathbf{h} \mid \mathbf{v}_{-i})}{\sum_{v_i}^{} p(V_i = v_i, \mathbf{h} \mid \mathbf{v}_{-i})}
\end{equation*}

We know that

\begin{equation*}
\begin{split}
  p(V_i = v_i, \mathbf{h} \mid \mathbf{v}_{-i}) 
  &= \exp \Big( - E(v_i, \mathbf{v}_{-i}, \mathbf{h}) \Big) \\
  &= \exp \Bigg[ \Bigg( - \frac{(v_i - b_i)^2}{2 \sigma_i^2} + \sum_{\mu = 1}^{|\mathcal{H}|} \frac{v_i}{\sigma_i} W_{i \mu} h_{\mu} \Bigg) + \\
  & \qquad \quad  \Bigg(  \sum_{j=1, j \ne i}^{|\mathcal{V}|}- \frac{(v_j - b_j)^2}{2 \sigma_j^2}  + \frac{v_j}{\sigma_j} \sum_{\mu = 1}^{|\mathcal{H}|} W_{j \mu} h_{\mu} + \sum_{\mu = 1}^{|\mathcal{H}|} c_{\mu} h_{\mu} \Bigg) \Bigg]
\end{split}
\end{equation*}

where we've separated the terms depending on $v_i$ and $v_j, j \ne i$. Why? In the denominator, we're integrating over all possible values of $v_i$, hence any factor not dependent on $v_i$ we can move out of the integral. And since we find the same factor in the numerator, we can simply cancel the second term in the above expression. This leaves us with

\begin{equation*}
p(V_i = v_i \mid \mathbf{h}, \mathbf{v}_{-i}) = \frac{\exp \Bigg( - \frac{(v_i - b_i)^2}{2 \sigma_i^2} + \frac{v_i}{\sigma_i} \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu} \Bigg) }{\int \exp \Bigg( - \frac{(v_i - b_i)^2}{2 \sigma_i^2} + \frac{v_i}{\sigma_i} \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu} \Bigg) \ d v_i}
\end{equation*}

Now, writing out the term in the exponent,

\begin{equation*}
\begin{split}
  - \frac{(v_i - b_i)^2}{2 \sigma_i^2} + \frac{v_i}{\sigma_i} \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu} 
  &= - \frac{1}{2 \sigma_i^2} \Bigg( v_i^2 - 2 v_i b_i + b_i^2 - 2 v_i \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu} \Bigg) \\
  &= - \frac{1}{2 \sigma_i^2} \Bigg( v_i^2 - 2 v_i \Big(b_i + \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu} \Big) + b_i^2 \Bigg) \\
  &= - \frac{1}{2 \sigma_i^2} \Bigg( v_i^2 - 2 v_i \Big(b_i + \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu} \Big) \Bigg) - \frac{b_i^2}{2 \sigma_i^2}
\end{split}
\end{equation*}

For readability's sake, let

\begin{equation*}
\gamma_i = b_i + \sigma_i \sum_{\mu=1}^{|\mathcal{H}|} W_{i \mu} h_{\mu}
\end{equation*}

Then we can "complete the square" in the first term, which gives us

\begin{equation*}
- \frac{(v_i - b_i)^2}{2 \sigma_i^2} + \frac{v_i}{\sigma_i} \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu} 
= - \frac{1}{2 \sigma_i^2} ( v_i - \gamma_i )^2 + \frac{1}{2 \sigma_i^2} \gamma_i^2 - \frac{b_i^2}{2 \sigma_i^2}
= - \frac{1}{2 \sigma_i^2} ( v_i - \gamma_i )^2 + \frac{1}{2 \sigma_i^2} (\gamma_i^2 - b_i^2)
\end{equation*}

Finally, subsituting this back into our expression for $p(V_i = v_i, \mathbf{h} \mid \mathbf{v}_{-i})$, keeping in mind that $\gamma_i$ does not depend on $v_i$ and can therefore pulled out of the integral, we get

\begin{equation*}
p(V_i = v_i \mid \mathbf{h}, \mathbf{v}_{-i}) = \frac{\exp \Big( - \frac{(v_i - \gamma_i)^2}{2 \sigma_i^2} \Big)}{\int \exp \Big( - \frac{(v_i - \gamma_i)^2}{2 \sigma_i^2} \Big) \ d v_i}
\end{equation*}

where we've cancelled the exponentials of $(\gamma_i^2 - b_i^2) / 2 \sigma_i^2$ due to the independence mentioned above. But HOLD ON; these exponentials are simply Gaussians with mean $\gamma_i$!, hence

\begin{equation*}
\int \exp \Big( - \frac{(v_i - \gamma_i)^2}{2 \sigma_i^2} \Big) \ d v_i = \sqrt{2 \pi} \sigma_i
\end{equation*}

Thus,

\begin{equation*}
p(V_i = v_i \mid \mathbf{h}) = \frac{1}{\sqrt{2 \pi} \sigma_i} \exp \bigg( - \frac{(v_i - \gamma_i)^2}{2 \sigma_i^2} \bigg)
\end{equation*}

Or equivalently,

\begin{equation*}
V_i \sim \mathcal{N} \Big( \gamma_i, \sigma_i^2 \Big) = \mathcal{N} \Big( b_i + \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu}, \ \sigma_i^2 \Big)
\end{equation*}

And we now understand why we call the resulting RBM of using this specific energy function a Gaussian RBM!

A final note is that

\begin{equation*}
p(\mathbf{v} \mid \mathbf{h}) = \prod_{j=1}^{|\mathcal{V}|} p(V_j = v_j \mid \mathbf{h})
\end{equation*}

due to the independence the visible units $V_j$ conditioned on the hidden units, as usual in an RBM. Hence

\begin{equation*}
p(\mathbf{v} \mid \mathbf{h}) = \prod_{j=1}^{|\mathcal{V}|} \frac{1}{\sqrt{2 \pi} \sigma_j} \exp \Bigg( - \frac{(v_j - \gamma_j)^2}{2 \sigma_j^2} \Bigg)
\end{equation*}

Or more numerically more conveniently,

\begin{equation*}
\log p(\mathbf{v} \mid \mathbf{h}) = - \frac{|\mathcal{H}|}{2} \log (2 \pi) - \sum_{j=1}^{|\mathcal{H}|} \log \sigma_j + \frac{(v_j - \gamma_j)^2}{2 \sigma_j^2}
\end{equation*}
  • What if $h_{\mu}$ also are Gaussian?

    What happens if we instead have the following energy function

    \begin{equation*}
E(\mathbf{v}, \mathbf{h}) = \frac{(v_j - b_j)^2}{2 \sigma_j^2} + \frac{(h_{\mu} - c_{\mu})^2}{2 \sigma_{\mu}^2} - \frac{v_j}{\sigma_j} W_{j \mu} \frac{h_{\mu}}{\sigma_{\mu}}
\end{equation*}

    Going back to our expression for the distribution of $V_i$

    \begin{equation*}
V_i \mid \mathbf{h} \sim \mathcal{N} \Big( b_i + \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} h_{\mu}, \quad \sigma_i^2 \Big)
\end{equation*}

    The only dependence on $h_{\mu}$ arises from the multiplication with $W_{i \mu}$ in the energy function, hence the distribution of the visible units when the hidden can also take on continuous values, is simply

    \begin{equation*}
V_i \mid \mathbf{h} \sim \mathcal{N}\Big(b_i + \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} \frac{h_{\mu}}{\sigma_{\mu}}, \  \sigma_i^2 \Big)
\end{equation*}

    The only addition being the factor of $1 / \sigma_{\mu}$ in the sum.

What about $p(H_{\alpha} = h_{\alpha} \mid \mathbf{v})$?

Looking at our derivation for $p(V_i = v_i \mid \mathbf{h})$, the derivation for the hiddens given the visibles is exactly the same, substituting $b_i$ with $c_{\alpha}$ and $h_{\mu} / \sigma_{\mu}$ with $v_j / \sigma_j$, giving us

\begin{equation*}
H_{\alpha} \mid \mathbf{v} \sim \mathcal{N} \Big( c_{\alpha} + \sigma_{\alpha} \sum_{j=1}^{|\mathcal{V}|} \frac{v_j}{\sigma_j} W_{j \alpha} \Big)
\end{equation*}

General expressions for both Bernoulli and Gaussians

If you've had a look at the derivations for the conditional probabilties $p(H_{\alpha} = h_{\alpha} \mid \mathbf{v})$ and $p(V_i = v_i \mid \mathbf{h})$ when $H_{\alpha}$ and $V_i$ are Bernoulli or Gaussian, you might have observed that the expressions in the exponentials are very similar. In fact, we have the following

\begin{equation*}
V_i \mid \mathbf{h} \sim
\begin{cases}
  \text{sigmoid} \Big( b_i + \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} \frac{h_{\mu}}{\sigma_{\mu}} \Big) & \text{if } V_i \mid \mathbf{h} \text{ Bernoulli} \\
  \mathcal{N}\Big(b_i + \sigma_i \sum_{\mu = 1}^{|\mathcal{H}|} W_{i \mu} \frac{h_{\mu}}{\sigma_{\mu}}\Big) & \text{if } V_i \mid \mathbf{h} \text{ Gaussian}
\end{cases}
\end{equation*}

with

\begin{equation*}
\sigma_{\alpha} = 
\begin{cases}
  1 & \text{if } H_{\alpha} \mid \mathbf{v} \text{ Bernoulli} \\
  \sqrt{\text{Var}(H_{\alpha})} & \text{if } H_{\alpha} \mid \mathbf{v} \text{ Gaussian}
\end{cases}
\end{equation*}

and for the hiddens

\begin{equation*}
H_{\alpha} \mid \mathbf{h} \sim
\begin{cases}
  \text{sigmoid} \Big( c_{\alpha} + \sigma_{\alpha} \sum_{j = 1}^{|\mathcal{V}|} \frac{v_{j}}{\sigma_{j}} W_{j \alpha} \Big) & \text{if } H_{\alpha} \mid \mathbf{h} \text{ Bernoulli} \\
  \mathcal{N}\Big(c_{\alpha} + \sigma_{\alpha} \sum_{j = 1}^{|\mathcal{V}|} \frac{v_{j}}{\sigma_{j}} W_{j \alpha} \Big) & \text{if } H_{\alpha} \mid \mathbf{h} \text{ Gaussian}
\end{cases}
\end{equation*}

with

\begin{equation*}
\sigma_{i} = 
\begin{cases}
  1 & \text{if } V_{i} \mid \mathbf{h} \text{ Bernoulli} \\
  \sqrt{\text{Var}(V_{i})} & \text{if } V_{i} \mid \mathbf{h} \text{ Gaussian}
\end{cases}
\end{equation*}

If we want to implement an RBM, allowing the possibility of using both Bernoulli and Gaussian hidden and / or visible units, we can!

Still need to make appropriate corrections in the free energy computation, i.e. when computing $F(\mathbf{v}) = - \log p(\mathbf{v})$.

Contrastive Divergence (CD)

The idea of Contrastive Divergence is to approximate the gradient of the log-likelihood we saw earlier

\begin{equation*}
\frac{1}{N} \frac{\partial \mathcal{L}}{\partial \theta} = - \frac{1}{N} \sum_{n=1}^{N} \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}^{(n)}, \mathbf{h})}{\partial \theta} \ \Bigg| \ \mathbf{v}^{(n)}\Bigg] + \mathbb{E} \Bigg[ \frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial \theta} \Bigg]
\end{equation*}

by

  1. Initialize "gradients" $\Delta \boldsymbol{\theta} = 0$
  2. For $n = 1, \dots, N$:
    1. Initialize sampler with training instance $\mathbf{v}^{(n)}$ (instead of using a random initialization)
    2. For $t = 0, \dots, k - 1$ do
      1. $\tilde{\mathbf{h}}^{(t)} \sim p(\mathbf{h} \mid \tilde{\mathbf{v}}^{(t)})$, where entries are of course independent of each other (due to biparite structure of an RBM)
      2. $\tilde{\mathbf{v}}^{(t + 1)} \sim p(\mathbf{v} \mid \tilde{\mathbf{h}}^{(t)})$, where entries are of course independent of each other (due to biparite structure of an RBM)
    3. Compute gradients replacing $\mathbf{v}$ with $\tilde{\mathbf{v}}^{(t)}$.
    4. $\Delta \boldsymbol{\theta} \leftarrow \Delta \boldsymbol{\theta} + \nabla_{n, \mathbf{v} = \mathbf{v}_k} \mathcal{L}$
  3. Update parameters:

    \begin{equation*}
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \Delta \boldsymbol{\theta}
\end{equation*}

    where $\alpha$ is the learning rate

\begin{algorithm*}[H]\label{alg:contrastive-divergence}
  \KwData{Observations of the visible units: $\{ \mathbf{v}^{(n)}, n = 1, \dots, N \}$}
  \KwResult{Estimated parameters for an RBM: $(\mathbf{b}, \mathbf{c}, \mathbf{W})$}
  $b_j := 0$ for $j = 1, \dots, |\mathcal{V}|$ \\
  $c_{\mu} := 0$ for $\mu = 1, \dots, |\mathcal{H}|$ \\
  $W_{j\mu} \sim \mathcal{N}(0, 0.01)$ for $(j, \mu) \in \left\{ 1, \dots, |\mathcal{V}| \right\} \times \left\{ 1, \dots, |\mathcal{H}| \right\}$ \\
  \While{not converged}{
    $\Delta b_j := 0$ \\
    $\Delta c_{\mu} := 0$ \\
    $\Delta W_{j \mu} := 0$ \\
    \For{$n = 1, \dots, N$}{
      \tcp{initialize sampling procedure}
      $\hat{\mathbf{v}} := \mathbf{v}^{(n)}$ \\
      \tcp{sample using Gibbs sampling}
      \For{$t = 1, \dots, k$}{
        $\hat{\mathbf{h}} \sim p(\mathbf{h} \mid \hat{\mathbf{v}})$ \\
        $\hat{\mathbf{v}} \sim p(\mathbf{v} \mid \hat{\mathbf{h}})$
      }
      \tcp{accumulate changes}
      $\Delta b_j \leftarrow \Delta b_j + v_j^{(n)} - \hat{v}_j$ \\
      $\Delta c_{\mu} \leftarrow \Delta c_{\mu} + p\big(H_{\mu} = 1 \mid \mathbf{v}^{(n)} \big) - p \big( H_{\mu} = 1 \mid  \hat{\mathbf{v}} \big)$ \\
      $\Delta W_{j \mu} \leftarrow \Delta W_{j \mu} + v_j^{(n)} p\big(H_{\mu} = 1 \mid \mathbf{v}^{(n)} \big) -  \hat{v}_j p \big( H_{\mu} = 1 \mid \hat{\mathbf{v}} \big)$ \\
    }
    \tcp{update the parameters of the RBM using average gradient}
    $b_j \leftarrow b_j + \frac{\Delta b_j}{N}$ \\
    $h_{\mu} \leftarrow h_{\mu} + \frac{\Delta h_{\mu}}{N}$ \\
    $W_{j \mu} \leftarrow W_{j \mu} + \frac{\Delta W_{j \mu}}{N}$ \\
  }
  \caption{\textbf{Contrastive Divergence (CD-k)} with $k$ sampling steps.}
\end{algorithm*}
1

This is basically the algorithm presented in Fischer_2015, but in my opinion this ought to be the average gradient, not the change summed up. Feel like there is a factor of $\frac{1}{N}$ missing here.

Of course, one could just absorb this into the learning rate, as long as the number of samples in each batch is the same.

Why not compute the expectation as one samples these $k$ steps, why only do it at the end of the chain?

That is, when performing the approximation for the log-likelihood for a single training example $\mathbf{v}^{(n)}$

\begin{equation*}
\frac{\partial \log p(\mathbf{v}^{(n)})}{\partial \theta} \approx - \frac{\partial F(\mathbf{v})}{\partial \theta} + \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \bigg[ \frac{\partial F(\mathbf{v})}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \bigg]
\end{equation*}

by replacing the expectation by sampling $\tilde{\mathbf{v}}^{(k)} \sim p\big(\tilde{\mathbf{v}}^{(k)} \mid \mathbf{v}^{(n)}\big)$, i.e.

\begin{equation*}
\mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \bigg[ \frac{\partial F(\mathbf{v})}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \bigg] \approx \frac{\partial F(\tilde{\mathbf{v}}^{(k)})}{\partial \theta}
\end{equation*}

where $\tilde{\mathbf{v}}^{(k)}$ denotes the t-th sample from a sampler chain, initialized by the training example $\mathbf{v}^{(n)}$.

This leaves us with the update rule (for a single training example):

\begin{equation*}
\Delta \theta = - \frac{\partial F(\mathbf{v}^{(n)})}{\partial \theta} + \frac{\partial F(\tilde{\mathbf{v}}^{(k)})}{\partial \theta}
\end{equation*}

Why not compute the expectation over all these steps? Instead of stochastically approximating this expectation by a single example (the final sample in the chain), we could use the following approximation to the expectation

\begin{equation*}
\mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \bigg[ \frac{\partial F(\mathbf{v})}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \bigg] = \frac{1}{k} \sum_{t=1}^{k} \frac{\partial F(\tilde{\mathbf{v}}^{(t)})}{\partial \theta}
\end{equation*}

I guess some reasons could be:

  • more biased estimate (the more steps we let the sampler take, the more likely our sample is to be unbiased)
  • more expensive to compute gradient than sampling

Theoretical justification

Most, if not all, is straight up copied from bengio_2009.

Consider a Gibbs chain $\tilde{\mathbf{v}}^{(1)} \to \tilde{\mathbf{h}}^{(1)}, \tilde{\mathbf{v}}^{(2)}, \tilde{\mathbf{h}}^{(2)}, \dots$ starting at a data point, i.e. $\tilde{\mathbf{v}}^{(1)} = \mathbf{v}^{(n)}$ for some $n$. Then the log-likelihood at any step $t$ of the chain can be written

\begin{equation*}
\log p\big(\mathbf{v}^{(n)}\big) = \log \frac{p\big(\mathbf{v}^{(n)}\big)}{p \big( \tilde{\mathbf{v}}^{(k)} \big)} + \log p \big( \tilde{\mathbf{v}}^{(k)} \big)
\end{equation*}

and since this is true for any path,

\begin{equation*}
\log p \big( \mathbf{v}^{(n)} \big) = \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Bigg[ \log \frac{p \big( \mathbf{v}^{(n)} \big)}{p \big( \tilde{\mathbf{v}}^{(k)} \big)} \ \Big| \ \mathbf{v}^{(n)} \Bigg] + \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Big[ \log p \big( \tilde{\mathbf{v}}^{(k)} \mid \mathbf{v}^{(n)} \big) \Big]
\end{equation*}

where the expectations are over Markov chain sample paths, conditioned on the starting sample $\mathbf{v}^{(n)}$.

The first equation is obvious, while the second equation requires rewriting

\begin{equation*}
\log p\big(\mathbf{v}^{(n)} \big) = \sum_{\tilde{\mathbf{v}}^{(k)}}^{} p \big( \tilde{\mathbf{v}}^{(k)} \mid \mathbf{v}^{(n)} \big) \log p \big( \mathbf{v}^{(n)} \big)
\end{equation*}

and substituting in the first equation.

For any model $p(\mathbf{y})$ with parameters $\theta$

\begin{equation*}
\mathbb{E} \Bigg[ \frac{\partial p(\mathbf{y})}{\partial \theta} \Bigg] = 0
\end{equation*}

when the expected value is taken accordin to $p(\mathbf{y})$.

\begin{equation*}
\begin{split}
  \mathbb{E} \Bigg[ \frac{\partial p(\mathbf{y})}{\partial \theta} \Bigg] &= \sum_{\mathbf{y}}^{} p(\mathbf{y}) \frac{\partial \log p(\mathbf{y})}{\partial \theta} \\
  &= \sum_{\mathbf{y}}^{} \frac{p(\mathbf{y})}{p(\mathbf{y})} \frac{\partial p(\mathbf{y})}{\partial \theta} \\
  &= \frac{\partial }{\partial \theta} \sum_{\mathbf{y}}^{} p(\mathbf{y}) \\
  &= \frac{\partial }{\partial \theta} 1 = 0
\end{split}
\end{equation*}

(Observe that for this to work with continuous variables we need to ensure that we're allowed to interchange the partial derivative and the integration symbol)

Consider a Gibbs chain $\tilde{\mathbf{v}}^{(1)} \to \tilde{\mathbf{h}}^{(1)}, \tilde{\mathbf{v}}^{(2)}, \tilde{\mathbf{h}}^{(2)}, \dots$ starting at a data point, i.e. $\tilde{\mathbf{v}}^{(1)} = \mathbf{v}^{(n)}$ for some $n$.

The log-likelihood gradient can be written

\begin{equation*}
\frac{\partial \log p \big( \mathbf{v}^{(n)} \big)}{\partial \theta} = - \frac{\partial F \big( \mathbf{v}^{(n)} \big)}{\partial \theta} + \mathbb{E} \Bigg[ \frac{\partial F \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \Bigg] + \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Bigg[ \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \Bigg]
\end{equation*}

and

\begin{equation*}
\lim_{k \to \infty} \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Bigg[ \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \Bigg] = 0
\end{equation*}

Taking the derivative of the expression in Lemma 1, we get

\begin{equation*}
\begin{split}
  \frac{\partial \log p \big( \mathbf{v}^{(n)} \big)}{\partial \theta} &= \frac{\partial }{\partial \theta} \log \frac{p \big( \mathbf{v}^{(n)} \big)}{p \big( \tilde{\mathbf{v}}^{(k)} \big)} + \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \\
  &= \frac{\partial }{\partial \theta} \log \frac{\exp \Big( -F \big( \mathbf{v}^{(n)} \big) \Big)}{\exp \Big( - F \big( \tilde{\mathbf{v}}^{(k)} \big) \Big)} + \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \\
  &= - \frac{\partial F \big( \mathbf{v}^{(n)} \big)}{\partial \theta} + \frac{\partial F \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} + \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta}
\end{split}
\end{equation*}

Taking the expectations wrt. the Markov chain conditional on $\mathbf{v}^{(n)}$, we get

\begin{equation*}
\frac{\partial \log p\big( \mathbf{v}^{(n)} \big)}{\partial \theta} = - \frac{\partial F \big( \mathbf{v}^{(n)} \big)}{\partial \theta} + \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Bigg[ \frac{\partial F \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \Bigg] + \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Bigg[ \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \Bigg]
\end{equation*}

To show that the last term vanishes for $k \to \infty$, we use the assumed convergence of the chain, which can be written

\begin{equation*}
p\big( \tilde{\mathbf{V}}^{(k)} = \mathbf{v} \mid \tilde{\mathbf{V}}^{(1)} = \mathbf{v}^{(n)} \big) = p \big( \mathbf{v} \big) + \varepsilon_t(\mathbf{v})
\end{equation*}

with

\begin{equation*}
\sum_{\mathbf{v}}^{} \varepsilon_k(\mathbf{v}) = 0 \quad \text{and} \quad \lim_{k \to \infty} \varepsilon_k(\mathbf{v}) = 0
\end{equation*}

We can then rewrite the last expectation as

\begin{equation*}
\begin{split}
  \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Bigg[ \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)}\Bigg]
  &= \sum_{\tilde{\mathbf{v}}^{(k)}}^{} p \big( \tilde{\mathbf{v}}^{(k)} \mid \mathbf{v}^{(n)} \big) \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \\
  &= \sum_{\tilde{\mathbf{v}}^{(k)}}^{} \Big( p \big( \tilde{\mathbf{v}}^{(k)} \big)  + \varepsilon_k \big( \tilde{\mathbf{v}}^{(k)} \big) \Big) \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \\
  &= \sum_{\tilde{\mathbf{v}}^{(k)}}^{} p \big( \tilde{\mathbf{v}}^{(k)} \big) \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} + \sum_{\tilde{\mathbf{v}}^{(k)}}^{} \varepsilon_k \big( \tilde{\mathbf{v}}^{(k)} \big) \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta}
\end{split}
\end{equation*}

Using Lemma 2, the first sum vanishes, and we're left with

\begin{equation*}
\begin{split}
  \left| \mathbb{E}_{\tilde{\mathbf{v}}^{(k)}} \Bigg[ \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \ \Big| \ \mathbf{v}^{(n)} \Bigg] \right| 
  &\le \sum_{\tilde{\mathbf{v}}^{(k)}}^{} \left| \varepsilon_k \big( \tilde{\mathbf{v}}^{(k)} \big) \right| \left| \frac{\partial \log p \big( \tilde{\mathbf{v}}^{(k)} \big)}{\partial \theta} \right| \\
  &\le \Bigg( N_{\mathbf{v}} \max_{\mathbf{v}} \left| \frac{\partial \log p \big( \mathbf{v} \big)}{\partial \theta} \right| \Bigg) \varepsilon_k
\end{split}
\end{equation*}

where $N_{\mathbf{v}}$ denotes the number of possible discrete configurations $\mathbf{V}$ can take on, and we've let

\begin{equation*}
\varepsilon_k = \max_{\tilde{\mathbf{v}}^{(k)}} \left| \varepsilon_k \big( \tilde{\mathbf{v}}^{(k)} \big) \right| \implies \lim_{k \to \infty} \varepsilon_k = 0
\end{equation*}

Hence this expectation goes to zero, completing our proof.

The proof above also tells us something important: the error due to the second expectation in CD-k has an upper bound which depends linearly on the sampler-error $\varepsilon_k$.

Therefore the mixing-rate of the chain is directly related to the error of CD-k.

(Important to remember that there is of course also errors introduced in the stochastic approximation to the first part of the expectation too.)

Persistent Contrastive Divergence (PCD)

Persistent Contrastive Divergence (PCD) is obtained from CD approximation by replacing the sample $v^{(k)}$ by a sample from a Gibbs chain that is independent of the sample $v^{(0)}$ of the training distribution.

This corresponds to standard CD without reinitializing the visible units of the Markov chain with a training sample each time we want to draw a sample $v^{(k)}$.

\begin{algorithm*}[H]\label{alg:persistent-contrastive-divergence}
  \KwData{Observations of the visible units: $\{   \mathbf{v}^{(n)}, n = 1, \dots, N  \}$}
  \KwResult{Estimated parameters for an RBM: $(\mathbf{b}, \mathbf{c}, \mathbf{W})$}
  $b_j := 0$ for $j = 1, \dots, |\mathcal{V}|$ \\
  $c_{\mu} := 0$ for $\mu = 1, \dots, |\mathcal{H}|$ \\
  $W_{j\mu} \sim \mathcal{N}(0, 0.01)$ for $(j, \mu) \in \left\{ 1, \dots, |\mathcal{V}| \right\} \times \left\{ 1, \dots, |\mathcal{H}| \right\}$ \\
  \While{not converged}{
    $\Delta b_j := 0$ \\
    $\Delta c_{\mu} := 0$ \\
    $\Delta W_{j \mu} := 0$ \\
    \tcp{initialize sampling procedure BEFORE training loop}
    $\hat{\mathbf{v}} := \mathbf{v}^{(n)}$ for some $n \in \left\{ 1, \dots, N \right\}$ \\
    \For{$n = 1, \dots, N$}{
      \tcp{sample using Gibbs sampling}
      \tcp{using final sample from previous chain as starting point}
      \For{$t = 1, \dots, k$}{
        $\hat{\mathbf{h}} \sim p(\mathbf{h} \mid \hat{\mathbf{v}})$ \\
        $\hat{\mathbf{v}} \sim p(\mathbf{v} \mid \hat{\mathbf{h}})$
      }
      \tcp{accumulate changes}
      $\Delta b_j \leftarrow \Delta b_j + v_j^{(n)} - \hat{v}_j$ \\
      $\Delta c_{\mu} \leftarrow \Delta c_{\mu} + p\big(H_{\mu} = 1 \mid \mathbf{v}^{(n)} \big) - p \big( H_{\mu} = 1 \mid  \hat{\mathbf{v}} \big)$ \\
      $\Delta W_{j \mu} \leftarrow \Delta W_{j \mu} + v_j^{(n)} p\big(H_{\mu} = 1 \mid \mathbf{v}^{(n)} \big) -  \hat{v}_j p \big( H_{\mu} = 1 \mid \hat{\mathbf{v}} \big)$ \\
    }
    \tcp{update the parameters of the RBM using average gradient}
    $b_j \leftarrow b_j + \frac{\Delta b_j}{N}$ \\
    $h_{\mu} \leftarrow h_{\mu} + \frac{\Delta h_{\mu}}{N}$ \\
    $W_{j \mu} \leftarrow W_{j \mu} + \frac{\Delta W_{j \mu}}{N}$ \\
  }
  \caption{\textbf{Persistent Contrastive Divergence (PCD-k)} with $k$ steps. Initialize next chain from final state of previous chain.}
\end{algorithm*}
1

Parallel Tempering (PT)

Parallel Tempering (PT) introduces supplementary Gibbs chains that sample from more and more smoothed replicas of the original distribution.

Formally, given an ordered set of $M$ temperatures $1 = T_1 < T_2 < \dots < T_M$, and we define a set of $M$ Markov chains with stationary distributions

\begin{equation*}
p_r(\mathbf{v}, \mathbf{h}) = \frac{1}{Z_r} e^{- \frac{1}{T_r} E(\mathbf{v}, \mathbf{h})}
\end{equation*}

for $r = 1, \dots, M$ where $Z_r = \sum_{\mathbf{v}, \mathbf{h}}^{} e^{- \frac{1}{T_r} E(\mathbf{v}, \mathbf{h})}$ is the corresponding partition function, and $p_1$ has exactly the model distribution.

  • With increasing temperature, $p_r$ becomes more and more equally distribution
  • As $T \to \infty$ we have $p_r \to U$, i.e. uniform distribution
    • Subsequent samples of the Gibbs chain are independent of each other, thus stationary distribution is reached after just one sampling step

In each step of the algorithm, we run $k$ Gibbs sampling steps in each tempered Markov chain yielding samples $(\mathbf{v}_1, \mathbf{h}_1), \dots, (\mathbf{v}_M, \mathbf{h}_M)$.

Then, two neighboring Gibbs chains with temperatures $T_r$ and $T_{r - 1}$ may exchange "particles" $(\mathbf{v}_r, \mathbf{h}_r)$ and $(\mathbf{v}_{r - 1}, \mathbf{h}_{r - 1})$ with probability (based on Metropolis ratio),

\begin{equation*}
\min \left\{ 1, \frac{p_r(\mathbf{v}_{r - 1}, \mathbf{h}_{r - 1}) p_{r - 1}(\mathbf{v}_r, \mathbf{h}_r)}{p_r(\mathbf{v}_r, \mathbf{h}_r) p_{r - 1}(\mathbf{v}_{r - 1}, \mathbf{h}_{r - 1})} \right\}
\end{equation*}

which gives, for RBMs,

\begin{equation*}
\min \left\{ 1, \exp \Bigg[ \Bigg( \frac{1}{T_r} - \frac{1}{T_{r - 1}} \Bigg) \bigg( E(\mathbf{v}_r, \mathbf{h}_r) - E(\mathbf{v}_{r - 1}, \mathbf{h}_{r - 1}) \bigg) \Bigg] \right\}
\end{equation*}

After performing these "swaps" between chains (which increase the mixing rate), we take the sample $\mathbf{v}_1$ from the original chain with $T_1 = 1$ as the sample from RBM distribution.

This procedure is repeated $L$ times, yielding the samples $\mathbf{v}_{1, 1}, \dots, \mathbf{v}_{1, L}$ used for the approximation of the expectation under the RBM distribution in the log-likelihood gradient. Usually $L$ is the number of samples in the batch of training data.

Compared to CD, PT produces more quickly mixing Markov chain, thus less biased gradient approximation, at the cost of computation.

Estimating partition function

Bibliography

Bibliography

  • [ackley_1985] Ackley, Hinton, & Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, 9(1), 147–169 (1985). link. doi.
  • [Fischer_2015] Fischer, Training Restricted Boltzmann Machines, KI - Künstliche Intelligenz, 29(4), 441–444 (2015). link. doi.
  • [bengio_2009] Bengio & Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation, 21(6), 1601–1621 (2009). link. doi.