RBM: Restricted Boltzmann Machines
Table of Contents
Overview
- One visible layer of observed binary units

- One hidden layer of unobserved binary units

Connections between every node in the different layers, i.e. restricted means no connections from hidden to hidden, nor from visible to visible, but connections between all hidden to visible pairs.
- Models Maximum Entropy distributions, and because of this, RBMs were originally motivated by Hinton as a "soft-constraint learner", where it was the intention that it would learn a distribution which satisfied the underlying constraints (see [BROKEN LINK: No match for fuzzy expression: *Principle%20of%20Maximum%20Entropy]).ackley_1985
Notation
and
denotes visible units
and
denotes hidden units
and
will used to denote bias of the visible units
and
will be used to denote the bias of the hidden units
denotes the empirical expectationThe so-called free-energy:
and
denotes the t-th sample from a sampling chain for the visible and hidden variables, respectively.
denote the vector
but excluding the entry at index
:
Factor graph view
A Boltzmann Machine is an energy-based model consisting of a set of hidden units
and a set of visible units
, where by "units" we mean random variables, taking on the values
and
, respectively.
An Boltzmann Machine assumes the following joint probability distribution of the visible and hidden units:
with
being the partition function (normalization factor) and the energy function
where
and
denote the convariance matrices for the visible and hidden units, respectively.
A Restricted Boltzmann Machine (RBM) is an energy-based model consisting of a set of hidden units
and a set of visible units
, whereby "units" we mean random variables, taking on the values
and
, respectively.
The restricted part of the name comes from the fact that we assume independence between the hidden units and the visible units, i.e.
An RBM assumes the following joint probability distribution of the visible and hidden units:
with
being the partition function (normalization factor) and the energy function
implicitly summing over repeating indices.
Training
Log-likelihood using gradient ascent
Notation
Used in the derivation of
Gradient of log-likelihood of energy-based models
RBMs assumes the following model
with
The minus-minus signs might seem redundant (and they are), but is convention due to the Boltzmann distribution found in physics.
where we're implicitly summing over repeating indices. First we observe that
Taking the log of this, we have
Now suppose we're given a set of samples
, then the likelihood (assuming i.i.d.) is given by
Taking the log of this expression, and using the above expression for
, we have
Let
, taking the partial derivative wrt.
for the n-th term we have
The first term can be written as an expectation
since on the LHS we're marginalizing over all
and then normalizing wrt. the same distribution we just summed over.
For the second term recall that
, therefore
Substituting it all back into the partial derivative of the log-likelihood, we end get
Where the expectations are all over the probability distribution defined by the model.
Since maximizing
is equivalent to maximizing
we instead consider
Dividing by
, observe that the first term can be written
Giving us
Clearly the second term is almost always intractable since we would have to sum over all possible
and
, but the first term might also be intractable as we would have to marginalize over all the hidden states
with
fixed to
.
When working with binary hidden variables, i.e.
are all assumed to be Bernoulli random variables, then we have
In general, when the hiddens are not necessarily Bernoulli, we would also like to sample the hidden states
to approximate the conditional expectation. It would make sense to sample some
number of hidden states
for each observed
, i.e.
In fact, as mentioned on p.45 in Fischer_2015, this method (but letting
) is often used, but, in the Bernoulli case, taking the expectation reduces the variance of our estimate compared to also sampling
.
You will often see the notation
which we will use from now on. The
was used to make it explicit that we're computing the expectation over the visible units, NOT the joint expectation by sampling
for each of the observed visible units.
For the second term we also use the empirical expectation as an approximation:
Finally giving us a tractable approximation to the gradient of the log-likelihood:
Making the variable
takes on explicit, we have
Observe that the signs have switched, which is simply due to the fact that
depends negatively on all the variables
.
Now we can make use of any nice stochastic gradient ascent method.
Bernoulli RBM
In the case of a Bernoulli RBM, both the visible and hidden variables only take on the values
. Therefore,
Hence, for the hidden units, we have
and
For the visible units we have
and
Thus, we end up with the gradients of the negative log-likelihood
In these expressions, the first terms are indeed tractable but the second terms still pose a problem. Therefore we turn to MCMC methods for estimating the expectations in the second terms.
A good method would (block) Gibbs sampling, since the visible and hidden units are independent, and we could thus sample all the visible or hidden units simultaneously. To obtain unbiased estimates we would then run the sampler for a large number of steps. Unfortunately this is expensive, which is where Contrastive Divergence comes in.
Expression for 
Let
denote the vector
but excluding the entry at index
, then
where in the last step we simply marginalize of
.
The numerator will be the exponential of the following three terms added together
and the denominator will be
To be explicit, the denominator is
which we can write
Taking the logarithm (for readability's sake), we end up with
Hence,
Observe that the two last terms in these expression cancels each other, thus
Finally, taking the exponential again,
TODO Computing 
Gaussian RBM
Suppose now that
are now continuous variables but
are still bernoulli. We assume the following Hamiltonian / energy function
where, as usual, we assume summation over the repeated indices.
denotes the variance of 
- We use the term
in the "covariance" term, as it makes sense to remove the variance in
which does not arise due to its relationship with the hiddens.
Computing
and 
First observe that
And,
We know that
where we've separated the terms depending on
and
. Why? In the denominator, we're integrating over all possible values of
, hence any factor not dependent on
we can move out of the integral. And since we find the same factor in the numerator, we can simply cancel the second term in the above expression. This leaves us with
Now, writing out the term in the exponent,
For readability's sake, let
Then we can "complete the square" in the first term, which gives us
Finally, subsituting this back into our expression for
, keeping in mind that
does not depend on
and can therefore pulled out of the integral, we get
where we've cancelled the exponentials of
due to the independence mentioned above. But HOLD ON; these exponentials are simply Gaussians with mean
!, hence
Thus,
Or equivalently,
And we now understand why we call the resulting RBM of using this specific energy function a Gaussian RBM!
A final note is that
due to the independence the visible units
conditioned on the hidden units, as usual in an RBM. Hence
Or more numerically more conveniently,
- What if
also are Gaussian?
What happens if we instead have the following energy function
Going back to our expression for the distribution of
The only dependence on
arises from the multiplication with
in the energy function, hence the distribution of the visible units when the hidden can also take on continuous values, is simply
The only addition being the factor of
in the sum.
What about
?
Looking at our derivation for
, the derivation for the hiddens given the visibles is exactly the same, substituting
with
and
with
, giving us
General expressions for both Bernoulli and Gaussians
If you've had a look at the derivations for the conditional probabilties
and
when
and
are Bernoulli or Gaussian, you might have observed that the expressions in the exponentials are very similar. In fact, we have the following
with
and for the hiddens
with
If we want to implement an RBM, allowing the possibility of using both Bernoulli and Gaussian hidden and / or visible units, we can!
Still need to make appropriate corrections in the free energy computation, i.e. when computing
.
Contrastive Divergence (CD)
The idea of Contrastive Divergence is to approximate the gradient of the log-likelihood we saw earlier
by
- Initialize "gradients"

- For
:
- Initialize sampler with training instance
(instead of using a random initialization) - For
do
, where entries are of course independent of each other (due to biparite structure of an RBM)
, where entries are of course independent of each other (due to biparite structure of an RBM)
- Compute gradients replacing
with
. 
- Initialize sampler with training instance
Update parameters:
where
is the learning rate
1
This is basically the algorithm presented in Fischer_2015, but in my opinion this ought to be the average gradient, not the change summed up. Feel like there is a factor of
missing here.
Of course, one could just absorb this into the learning rate, as long as the number of samples in each batch is the same.
Why not compute the expectation as one samples these
steps, why only do it at the end of the chain?
That is, when performing the approximation for the log-likelihood for a single training example
by replacing the expectation by sampling
, i.e.
where
denotes the t-th sample from a sampler chain, initialized by the training example
.
This leaves us with the update rule (for a single training example):
Why not compute the expectation over all these steps? Instead of stochastically approximating this expectation by a single example (the final sample in the chain), we could use the following approximation to the expectation
I guess some reasons could be:
- more biased estimate (the more steps we let the sampler take, the more likely our sample is to be unbiased)
- more expensive to compute gradient than sampling
Theoretical justification
Most, if not all, is straight up copied from bengio_2009.
Consider a Gibbs chain
starting at a data point, i.e.
for some
.
Then the log-likelihood at any step
of the chain can be written
and since this is true for any path,
where the expectations are over Markov chain sample paths, conditioned on the starting sample
.
The first equation is obvious, while the second equation requires rewriting
and substituting in the first equation.
For any model
with parameters
when the expected value is taken accordin to
.
(Observe that for this to work with continuous variables we need to ensure that we're allowed to interchange the partial derivative and the integration symbol)
Consider a Gibbs chain
starting at a data point, i.e.
for some
.
The log-likelihood gradient can be written
and
Taking the derivative of the expression in Lemma 1, we get
Taking the expectations wrt. the Markov chain conditional on
, we get
To show that the last term vanishes for
, we use the assumed convergence of the chain, which can be written
with
We can then rewrite the last expectation as
Using Lemma 2, the first sum vanishes, and we're left with
where
denotes the number of possible discrete configurations
can take on, and we've let
Hence this expectation goes to zero, completing our proof.
The proof above also tells us something important: the error due to the second expectation in CD-k has an upper bound which depends linearly on the sampler-error
.
Therefore the mixing-rate of the chain is directly related to the error of CD-k.
(Important to remember that there is of course also errors introduced in the stochastic approximation to the first part of the expectation too.)
Persistent Contrastive Divergence (PCD)
Persistent Contrastive Divergence (PCD) is obtained from CD approximation by replacing the sample
by a sample from a Gibbs chain that is independent of the sample
of the training distribution.
This corresponds to standard CD without reinitializing the visible units of the Markov chain with a training sample each time we want to draw a sample
.
1
Parallel Tempering (PT)
Parallel Tempering (PT) introduces supplementary Gibbs chains that sample from more and more smoothed replicas of the original distribution.
Formally, given an ordered set of
temperatures
, and we define a set of
Markov chains with stationary distributions
for
where
is the corresponding partition function, and
has exactly the model distribution.
- With increasing temperature,
becomes more and more equally distribution - As
we have
, i.e. uniform distribution
- Subsequent samples of the Gibbs chain are independent of each other, thus stationary distribution is reached after just one sampling step
In each step of the algorithm, we run
Gibbs sampling steps in each tempered Markov chain yielding samples
.
Then, two neighboring Gibbs chains with temperatures
and
may exchange "particles"
and
with probability (based on Metropolis ratio),
which gives, for RBMs,
After performing these "swaps" between chains (which increase the mixing rate), we take the sample
from the original chain with
as the sample from RBM distribution.
This procedure is repeated
times, yielding the samples
used for the approximation of the expectation under the RBM distribution in the log-likelihood gradient. Usually
is the number of samples in the batch of training data.
Compared to CD, PT produces more quickly mixing Markov chain, thus less biased gradient approximation, at the cost of computation.
Estimating partition function
Bibliography
Bibliography
- [ackley_1985] Ackley, Hinton, & Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, 9(1), 147–169 (1985). link. doi.
- [Fischer_2015] Fischer, Training Restricted Boltzmann Machines, KI - Künstliche Intelligenz, 29(4), 441–444 (2015). link. doi.
- [bengio_2009] Bengio & Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation, 21(6), 1601–1621 (2009). link. doi.