RBM: Restricted Boltzmann Machines

Overview

• One visible layer of observed binary units
• One hidden layer of unobserved binary units
• Connections between every node in the different layers, i.e. restricted means no connections from hidden to hidden, nor from visible to visible, but connections between all hidden to visible pairs.

• Models Maximum Entropy distributions, and because of this, RBMs were originally motivated by Hinton as a "soft-constraint learner", where it was the intention that it would learn a distribution which satisfied the underlying constraints (see [BROKEN LINK: No match for fuzzy expression: *Principle%20of%20Maximum%20Entropy]).ackley_1985

Notation

• and denotes visible units
• and denotes hidden units
• and will used to denote bias of the visible units
• and will be used to denote the bias of the hidden units
• denotes the empirical expectation
• The so-called free-energy:

• and denotes the t-th sample from a sampling chain for the visible and hidden variables, respectively.
• denote the vector but excluding the entry at index :

Factor graph view

A Boltzmann Machine is an energy-based model consisting of a set of hidden units and a set of visible units , where by "units" we mean random variables, taking on the values and , respectively.

An Boltzmann Machine assumes the following joint probability distribution of the visible and hidden units:

with being the partition function (normalization factor) and the energy function

where and denote the convariance matrices for the visible and hidden units, respectively.

A Restricted Boltzmann Machine (RBM) is an energy-based model consisting of a set of hidden units and a set of visible units , whereby "units" we mean random variables, taking on the values and , respectively.

The restricted part of the name comes from the fact that we assume independence between the hidden units and the visible units, i.e.

An RBM assumes the following joint probability distribution of the visible and hidden units:

with being the partition function (normalization factor) and the energy function

implicitly summing over repeating indices.

Training

Notation

• Used in the derivation of

Gradient of log-likelihood of energy-based models

RBMs assumes the following model

with

The minus-minus signs might seem redundant (and they are), but is convention due to the Boltzmann distribution found in physics.

where we're implicitly summing over repeating indices. First we observe that

Taking the log of this, we have

Now suppose we're given a set of samples , then the likelihood (assuming i.i.d.) is given by

Taking the log of this expression, and using the above expression for , we have

Let , taking the partial derivative wrt. for the n-th term we have

The first term can be written as an expectation

since on the LHS we're marginalizing over all and then normalizing wrt. the same distribution we just summed over. For the second term recall that , therefore

Substituting it all back into the partial derivative of the log-likelihood, we end get

Where the expectations are all over the probability distribution defined by the model.

Since maximizing is equivalent to maximizing we instead consider

Dividing by , observe that the first term can be written

Giving us

Clearly the second term is almost always intractable since we would have to sum over all possible and , but the first term might also be intractable as we would have to marginalize over all the hidden states with fixed to .

When working with binary hidden variables, i.e. are all assumed to be Bernoulli random variables, then we have

In general, when the hiddens are not necessarily Bernoulli, we would also like to sample the hidden states to approximate the conditional expectation. It would make sense to sample some number of hidden states for each observed , i.e.

In fact, as mentioned on p.45 in Fischer_2015, this method (but letting ) is often used, but, in the Bernoulli case, taking the expectation reduces the variance of our estimate compared to also sampling .

You will often see the notation

which we will use from now on. The was used to make it explicit that we're computing the expectation over the visible units, NOT the joint expectation by sampling for each of the observed visible units.

For the second term we also use the empirical expectation as an approximation:

Finally giving us a tractable approximation to the gradient of the log-likelihood:

Making the variable takes on explicit, we have

Observe that the signs have switched, which is simply due to the fact that depends negatively on all the variables .

Now we can make use of any nice stochastic gradient ascent method.

Bernoulli RBM

In the case of a Bernoulli RBM, both the visible and hidden variables only take on the values . Therefore,

Hence, for the hidden units, we have

and

For the visible units we have

and

Thus, we end up with the gradients of the negative log-likelihood

In these expressions, the first terms are indeed tractable but the second terms still pose a problem. Therefore we turn to MCMC methods for estimating the expectations in the second terms.

A good method would (block) Gibbs sampling, since the visible and hidden units are independent, and we could thus sample all the visible or hidden units simultaneously. To obtain unbiased estimates we would then run the sampler for a large number of steps. Unfortunately this is expensive, which is where Contrastive Divergence comes in.

Expression for

Let denote the vector but excluding the entry at index , then

where in the last step we simply marginalize of .

The numerator will be the exponential of the following three terms added together

and the denominator will be

To be explicit, the denominator is

which we can write

Taking the logarithm (for readability's sake), we end up with

Hence,

Observe that the two last terms in these expression cancels each other, thus

Finally, taking the exponential again,

Gaussian RBM

Suppose now that are now continuous variables but are still bernoulli. We assume the following Hamiltonian / energy function

where, as usual, we assume summation over the repeated indices.

• denotes the variance of
• We use the term in the "covariance" term, as it makes sense to remove the variance in which does not arise due to its relationship with the hiddens.
Computing and

First observe that

And,

We know that

where we've separated the terms depending on and . Why? In the denominator, we're integrating over all possible values of , hence any factor not dependent on we can move out of the integral. And since we find the same factor in the numerator, we can simply cancel the second term in the above expression. This leaves us with

Now, writing out the term in the exponent,

Then we can "complete the square" in the first term, which gives us

Finally, subsituting this back into our expression for , keeping in mind that does not depend on and can therefore pulled out of the integral, we get

where we've cancelled the exponentials of due to the independence mentioned above. But HOLD ON; these exponentials are simply Gaussians with mean !, hence

Thus,

Or equivalently,

And we now understand why we call the resulting RBM of using this specific energy function a Gaussian RBM!

A final note is that

due to the independence the visible units conditioned on the hidden units, as usual in an RBM. Hence

Or more numerically more conveniently,

• What if also are Gaussian?

What happens if we instead have the following energy function

Going back to our expression for the distribution of

The only dependence on arises from the multiplication with in the energy function, hence the distribution of the visible units when the hidden can also take on continuous values, is simply

The only addition being the factor of in the sum.

Looking at our derivation for , the derivation for the hiddens given the visibles is exactly the same, substituting with and with , giving us

General expressions for both Bernoulli and Gaussians

If you've had a look at the derivations for the conditional probabilties and when and are Bernoulli or Gaussian, you might have observed that the expressions in the exponentials are very similar. In fact, we have the following

with

and for the hiddens

with

If we want to implement an RBM, allowing the possibility of using both Bernoulli and Gaussian hidden and / or visible units, we can!

Still need to make appropriate corrections in the free energy computation, i.e. when computing .

Contrastive Divergence (CD)

The idea of Contrastive Divergence is to approximate the gradient of the log-likelihood we saw earlier

by

2. For :
1. Initialize sampler with training instance (instead of using a random initialization)
2. For do
1. , where entries are of course independent of each other (due to biparite structure of an RBM)
2. , where entries are of course independent of each other (due to biparite structure of an RBM)
3. Compute gradients replacing with .
3. Update parameters:

where is the learning rate

1

This is basically the algorithm presented in Fischer_2015, but in my opinion this ought to be the average gradient, not the change summed up. Feel like there is a factor of missing here.

Of course, one could just absorb this into the learning rate, as long as the number of samples in each batch is the same.

Why not compute the expectation as one samples these steps, why only do it at the end of the chain?

That is, when performing the approximation for the log-likelihood for a single training example

by replacing the expectation by sampling , i.e.

where denotes the t-th sample from a sampler chain, initialized by the training example .

This leaves us with the update rule (for a single training example):

Why not compute the expectation over all these steps? Instead of stochastically approximating this expectation by a single example (the final sample in the chain), we could use the following approximation to the expectation

I guess some reasons could be:

• more biased estimate (the more steps we let the sampler take, the more likely our sample is to be unbiased)
• more expensive to compute gradient than sampling

Theoretical justification

Most, if not all, is straight up copied from bengio_2009.

Consider a Gibbs chain starting at a data point, i.e. for some . Then the log-likelihood at any step of the chain can be written

and since this is true for any path,

where the expectations are over Markov chain sample paths, conditioned on the starting sample .

The first equation is obvious, while the second equation requires rewriting

and substituting in the first equation.

For any model with parameters

when the expected value is taken accordin to .

(Observe that for this to work with continuous variables we need to ensure that we're allowed to interchange the partial derivative and the integration symbol)

Consider a Gibbs chain starting at a data point, i.e. for some .

The log-likelihood gradient can be written

and

Taking the derivative of the expression in Lemma 1, we get

Taking the expectations wrt. the Markov chain conditional on , we get

To show that the last term vanishes for , we use the assumed convergence of the chain, which can be written

with

We can then rewrite the last expectation as

Using Lemma 2, the first sum vanishes, and we're left with

where denotes the number of possible discrete configurations can take on, and we've let

Hence this expectation goes to zero, completing our proof.

The proof above also tells us something important: the error due to the second expectation in CD-k has an upper bound which depends linearly on the sampler-error .

Therefore the mixing-rate of the chain is directly related to the error of CD-k.

(Important to remember that there is of course also errors introduced in the stochastic approximation to the first part of the expectation too.)

Persistent Contrastive Divergence (PCD)

Persistent Contrastive Divergence (PCD) is obtained from CD approximation by replacing the sample by a sample from a Gibbs chain that is independent of the sample of the training distribution.

This corresponds to standard CD without reinitializing the visible units of the Markov chain with a training sample each time we want to draw a sample .

1

Parallel Tempering (PT)

Parallel Tempering (PT) introduces supplementary Gibbs chains that sample from more and more smoothed replicas of the original distribution.

Formally, given an ordered set of temperatures , and we define a set of Markov chains with stationary distributions

for where is the corresponding partition function, and has exactly the model distribution.

• With increasing temperature, becomes more and more equally distribution
• As we have , i.e. uniform distribution
• Subsequent samples of the Gibbs chain are independent of each other, thus stationary distribution is reached after just one sampling step

In each step of the algorithm, we run Gibbs sampling steps in each tempered Markov chain yielding samples .

Then, two neighboring Gibbs chains with temperatures and may exchange "particles" and with probability (based on Metropolis ratio),

which gives, for RBMs,

After performing these "swaps" between chains (which increase the mixing rate), we take the sample from the original chain with as the sample from RBM distribution.

This procedure is repeated times, yielding the samples used for the approximation of the expectation under the RBM distribution in the log-likelihood gradient. Usually is the number of samples in the batch of training data.

Compared to CD, PT produces more quickly mixing Markov chain, thus less biased gradient approximation, at the cost of computation.

Bibliography

• [ackley_1985] Ackley, Hinton, & Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, 9(1), 147–169 (1985). link. doi.
• [Fischer_2015] Fischer, Training Restricted Boltzmann Machines, KI - Künstliche Intelligenz, 29(4), 441–444 (2015). link. doi.
• [bengio_2009] Bengio & Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation, 21(6), 1601–1621 (2009). link. doi.