RBM: Restricted Boltzmann Machines
Table of Contents
Overview
- One visible layer of observed binary units
- One hidden layer of unobserved binary units
Connections between every node in the different layers, i.e. restricted means no connections from hidden to hidden, nor from visible to visible, but connections between all hidden to visible pairs.
- Models Maximum Entropy distributions, and because of this, RBMs were originally motivated by Hinton as a "soft-constraint learner", where it was the intention that it would learn a distribution which satisfied the underlying constraints (see [BROKEN LINK: No match for fuzzy expression: *Principle%20of%20Maximum%20Entropy]).ackley_1985
Notation
- and denotes visible units
- and denotes hidden units
- and will used to denote bias of the visible units
- and will be used to denote the bias of the hidden units
- denotes the empirical expectation
The so-called free-energy:
- and denotes the t-th sample from a sampling chain for the visible and hidden variables, respectively.
denote the vector but excluding the entry at index :
Factor graph view
A Boltzmann Machine is an energy-based model consisting of a set of hidden units and a set of visible units , where by "units" we mean random variables, taking on the values and , respectively.
An Boltzmann Machine assumes the following joint probability distribution of the visible and hidden units:
with being the partition function (normalization factor) and the energy function
where and denote the convariance matrices for the visible and hidden units, respectively.
A Restricted Boltzmann Machine (RBM) is an energy-based model consisting of a set of hidden units and a set of visible units , whereby "units" we mean random variables, taking on the values and , respectively.
The restricted part of the name comes from the fact that we assume independence between the hidden units and the visible units, i.e.
An RBM assumes the following joint probability distribution of the visible and hidden units:
with being the partition function (normalization factor) and the energy function
implicitly summing over repeating indices.
Training
Log-likelihood using gradient ascent
Notation
Used in the derivation of
Gradient of log-likelihood of energy-based models
RBMs assumes the following model
with
The minus-minus signs might seem redundant (and they are), but is convention due to the Boltzmann distribution found in physics.
where we're implicitly summing over repeating indices. First we observe that
Taking the log of this, we have
Now suppose we're given a set of samples , then the likelihood (assuming i.i.d.) is given by
Taking the log of this expression, and using the above expression for , we have
Let , taking the partial derivative wrt. for the n-th term we have
The first term can be written as an expectation
since on the LHS we're marginalizing over all and then normalizing wrt. the same distribution we just summed over. For the second term recall that , therefore
Substituting it all back into the partial derivative of the log-likelihood, we end get
Where the expectations are all over the probability distribution defined by the model.
Since maximizing is equivalent to maximizing we instead consider
Dividing by , observe that the first term can be written
Giving us
Clearly the second term is almost always intractable since we would have to sum over all possible and , but the first term might also be intractable as we would have to marginalize over all the hidden states with fixed to .
When working with binary hidden variables, i.e. are all assumed to be Bernoulli random variables, then we have
In general, when the hiddens are not necessarily Bernoulli, we would also like to sample the hidden states to approximate the conditional expectation. It would make sense to sample some number of hidden states for each observed , i.e.
In fact, as mentioned on p.45 in Fischer_2015, this method (but letting ) is often used, but, in the Bernoulli case, taking the expectation reduces the variance of our estimate compared to also sampling .
You will often see the notation
which we will use from now on. The was used to make it explicit that we're computing the expectation over the visible units, NOT the joint expectation by sampling for each of the observed visible units.
For the second term we also use the empirical expectation as an approximation:
Finally giving us a tractable approximation to the gradient of the log-likelihood:
Making the variable takes on explicit, we have
Observe that the signs have switched, which is simply due to the fact that depends negatively on all the variables .
Now we can make use of any nice stochastic gradient ascent method.
Bernoulli RBM
In the case of a Bernoulli RBM, both the visible and hidden variables only take on the values . Therefore,
Hence, for the hidden units, we have
and
For the visible units we have
and
Thus, we end up with the gradients of the negative log-likelihood
In these expressions, the first terms are indeed tractable but the second terms still pose a problem. Therefore we turn to MCMC methods for estimating the expectations in the second terms.
A good method would (block) Gibbs sampling, since the visible and hidden units are independent, and we could thus sample all the visible or hidden units simultaneously. To obtain unbiased estimates we would then run the sampler for a large number of steps. Unfortunately this is expensive, which is where Contrastive Divergence comes in.
Expression for
Let denote the vector but excluding the entry at index , then
where in the last step we simply marginalize of .
The numerator will be the exponential of the following three terms added together
and the denominator will be
To be explicit, the denominator is
which we can write
Taking the logarithm (for readability's sake), we end up with
Hence,
Observe that the two last terms in these expression cancels each other, thus
Finally, taking the exponential again,
TODO Computing
Gaussian RBM
Suppose now that are now continuous variables but are still bernoulli. We assume the following Hamiltonian / energy function
where, as usual, we assume summation over the repeated indices.
- denotes the variance of
- We use the term in the "covariance" term, as it makes sense to remove the variance in which does not arise due to its relationship with the hiddens.
Computing and
First observe that
And,
We know that
where we've separated the terms depending on and . Why? In the denominator, we're integrating over all possible values of , hence any factor not dependent on we can move out of the integral. And since we find the same factor in the numerator, we can simply cancel the second term in the above expression. This leaves us with
Now, writing out the term in the exponent,
For readability's sake, let
Then we can "complete the square" in the first term, which gives us
Finally, subsituting this back into our expression for , keeping in mind that does not depend on and can therefore pulled out of the integral, we get
where we've cancelled the exponentials of due to the independence mentioned above. But HOLD ON; these exponentials are simply Gaussians with mean !, hence
Thus,
Or equivalently,
And we now understand why we call the resulting RBM of using this specific energy function a Gaussian RBM!
A final note is that
due to the independence the visible units conditioned on the hidden units, as usual in an RBM. Hence
Or more numerically more conveniently,
- What if also are Gaussian?
What happens if we instead have the following energy function
Going back to our expression for the distribution of
The only dependence on arises from the multiplication with in the energy function, hence the distribution of the visible units when the hidden can also take on continuous values, is simply
The only addition being the factor of in the sum.
What about ?
Looking at our derivation for , the derivation for the hiddens given the visibles is exactly the same, substituting with and with , giving us
General expressions for both Bernoulli and Gaussians
If you've had a look at the derivations for the conditional probabilties and when and are Bernoulli or Gaussian, you might have observed that the expressions in the exponentials are very similar. In fact, we have the following
with
and for the hiddens
with
If we want to implement an RBM, allowing the possibility of using both Bernoulli and Gaussian hidden and / or visible units, we can!
Still need to make appropriate corrections in the free energy computation, i.e. when computing .
Contrastive Divergence (CD)
The idea of Contrastive Divergence is to approximate the gradient of the log-likelihood we saw earlier
by
- Initialize "gradients"
- For :
- Initialize sampler with training instance (instead of using a random initialization)
- For do
- , where entries are of course independent of each other (due to biparite structure of an RBM)
- , where entries are of course independent of each other (due to biparite structure of an RBM)
- Compute gradients replacing with .
Update parameters:
where is the learning rate
This is basically the algorithm presented in Fischer_2015, but in my opinion this ought to be the average gradient, not the change summed up. Feel like there is a factor of missing here.
Of course, one could just absorb this into the learning rate, as long as the number of samples in each batch is the same.
Why not compute the expectation as one samples these steps, why only do it at the end of the chain?
That is, when performing the approximation for the log-likelihood for a single training example
by replacing the expectation by sampling , i.e.
where denotes the t-th sample from a sampler chain, initialized by the training example .
This leaves us with the update rule (for a single training example):
Why not compute the expectation over all these steps? Instead of stochastically approximating this expectation by a single example (the final sample in the chain), we could use the following approximation to the expectation
I guess some reasons could be:
- more biased estimate (the more steps we let the sampler take, the more likely our sample is to be unbiased)
- more expensive to compute gradient than sampling
Theoretical justification
Most, if not all, is straight up copied from bengio_2009.
Consider a Gibbs chain starting at a data point, i.e. for some . Then the log-likelihood at any step of the chain can be written
and since this is true for any path,
where the expectations are over Markov chain sample paths, conditioned on the starting sample .
The first equation is obvious, while the second equation requires rewriting
and substituting in the first equation.
For any model with parameters
when the expected value is taken accordin to .
(Observe that for this to work with continuous variables we need to ensure that we're allowed to interchange the partial derivative and the integration symbol)
Consider a Gibbs chain starting at a data point, i.e. for some .
The log-likelihood gradient can be written
and
Taking the derivative of the expression in Lemma 1, we get
Taking the expectations wrt. the Markov chain conditional on , we get
To show that the last term vanishes for , we use the assumed convergence of the chain, which can be written
with
We can then rewrite the last expectation as
Using Lemma 2, the first sum vanishes, and we're left with
where denotes the number of possible discrete configurations can take on, and we've let
Hence this expectation goes to zero, completing our proof.
The proof above also tells us something important: the error due to the second expectation in CD-k has an upper bound which depends linearly on the sampler-error .
Therefore the mixing-rate of the chain is directly related to the error of CD-k.
(Important to remember that there is of course also errors introduced in the stochastic approximation to the first part of the expectation too.)
Persistent Contrastive Divergence (PCD)
Persistent Contrastive Divergence (PCD) is obtained from CD approximation by replacing the sample by a sample from a Gibbs chain that is independent of the sample of the training distribution.
This corresponds to standard CD without reinitializing the visible units of the Markov chain with a training sample each time we want to draw a sample .
Parallel Tempering (PT)
Parallel Tempering (PT) introduces supplementary Gibbs chains that sample from more and more smoothed replicas of the original distribution.
Formally, given an ordered set of temperatures , and we define a set of Markov chains with stationary distributions
for where is the corresponding partition function, and has exactly the model distribution.
- With increasing temperature, becomes more and more equally distribution
- As we have , i.e. uniform distribution
- Subsequent samples of the Gibbs chain are independent of each other, thus stationary distribution is reached after just one sampling step
In each step of the algorithm, we run Gibbs sampling steps in each tempered Markov chain yielding samples .
Then, two neighboring Gibbs chains with temperatures and may exchange "particles" and with probability (based on Metropolis ratio),
which gives, for RBMs,
After performing these "swaps" between chains (which increase the mixing rate), we take the sample from the original chain with as the sample from RBM distribution.
This procedure is repeated times, yielding the samples used for the approximation of the expectation under the RBM distribution in the log-likelihood gradient. Usually is the number of samples in the batch of training data.
Compared to CD, PT produces more quickly mixing Markov chain, thus less biased gradient approximation, at the cost of computation.
Estimating partition function
Bibliography
Bibliography
- [ackley_1985] Ackley, Hinton, & Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, 9(1), 147–169 (1985). link. doi.
- [Fischer_2015] Fischer, Training Restricted Boltzmann Machines, KI - Künstliche Intelligenz, 29(4), 441–444 (2015). link. doi.
- [bengio_2009] Bengio & Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation, 21(6), 1601–1621 (2009). link. doi.