RBM: Restricted Boltzmann Machines

Table of Contents

Overview

  • One visible layer of observed binary units restricted_boltzmann_machines_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png
  • One hidden layer of unobserved binary units restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png
  • Connections between every node in the different layers, i.e. restricted means no connections from hidden to hidden, nor from visible to visible, but connections between all hidden to visible pairs.

    rbm.png

  • Models Maximum Entropy distributions, and because of this, RBMs were originally motivated by Hinton as a "soft-constraint learner", where it was the intention that it would learn a distribution which satisfied the underlying constraints (see Principle of Maximum Entropy).ackley_1985

Notation

  • restricted_boltzmann_machines_4a04e51217d16433a28813f41f74e8ed6b7f931f.png and restricted_boltzmann_machines_1bf6b6b25c0eeaeb6ada96b941136a791fc2ef4f.png denotes visible units
  • restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png and restricted_boltzmann_machines_34eed064bf20173c63d22852259b7bafd4d0f1a3.png denotes hidden units
  • restricted_boltzmann_machines_00ada20d7e149a775bac48dd84363b1c03ecb345.png and restricted_boltzmann_machines_3f72e71251eead9276475e5468bb9e4c16d0c5ad.png will used to denote bias of the visible units
  • restricted_boltzmann_machines_5d1bec66359813fe81f08ce2bb174b8a266d97eb.png and restricted_boltzmann_machines_9c18fc588f29fbe76ad232ffab92902366b1deac.png will be used to denote the bias of the hidden units
  • restricted_boltzmann_machines_55bd77ab93d13a1c8f4671f69f93604f972546cf.png denotes the empirical expectation
  • The so-called free-energy:

    restricted_boltzmann_machines_6aea29b693851dc8efc3347c2902750d89fda485.png

  • restricted_boltzmann_machines_e7f0a1e3af1a87c1d03e48c592e72abed3ad34ae.png and restricted_boltzmann_machines_d28ff54adf5335dc08a543f9bbb265f271c7a11b.png denotes the t-th sample from a sampling chain for the visible and hidden variables, respectively.
  • restricted_boltzmann_machines_7d4f834cadf1cd69f489a2d9cf3aa9f223545c00.png denote the vector restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png but excluding the entry at index restricted_boltzmann_machines_6e9d20113f7ef024a71fc3edcc10e77417d06393.png:

    restricted_boltzmann_machines_20d4aa57687a6449a4342d0cc5588ab80ba9262f.png

Factor graph view

A Boltzmann Machine is an energy-based model consisting of a set of hidden units restricted_boltzmann_machines_8afe88484d8cc005ab0e4cb3a8ae3857cbb876a9.png and a set of visible units restricted_boltzmann_machines_3b5f956644e68d2b460227519a5ad599ee233a8d.png, where by "units" we mean random variables, taking on the values restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png and restricted_boltzmann_machines_4a04e51217d16433a28813f41f74e8ed6b7f931f.png, respectively.

An Boltzmann Machine assumes the following joint probability distribution of the visible and hidden units:

restricted_boltzmann_machines_00e22f7b800a0440f24b21322c320cfe586ee1a4.png

with restricted_boltzmann_machines_0aa0470120f07b61ef9e985225cf9bbd6b79177c.png being the partition function (normalization factor) and the energy function

restricted_boltzmann_machines_b55f0ff44326378f5415ddfa0423db98781a6e14.png

where restricted_boltzmann_machines_bd902e345ad9904440316da8182fedcea0914dd2.png and restricted_boltzmann_machines_bc88a8c62a3301b4948bf682dd569b82b7764f2d.png denote the convariance matrices for the visible and hidden units, respectively.

A Restricted Boltzmann Machine (RBM) is an energy-based model consisting of a set of hidden units restricted_boltzmann_machines_8afe88484d8cc005ab0e4cb3a8ae3857cbb876a9.png and a set of visible units restricted_boltzmann_machines_3b5f956644e68d2b460227519a5ad599ee233a8d.png, whereby "units" we mean random variables, taking on the values restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png and restricted_boltzmann_machines_4a04e51217d16433a28813f41f74e8ed6b7f931f.png, respectively.

The restricted part of the name comes from the fact that we assume independence between the hidden units and the visible units, i.e.

restricted_boltzmann_machines_be2828332eba99eb8a125e1ee036d80d29a0bb58.png

An RBM assumes the following joint probability distribution of the visible and hidden units:

restricted_boltzmann_machines_00e22f7b800a0440f24b21322c320cfe586ee1a4.png

with restricted_boltzmann_machines_0aa0470120f07b61ef9e985225cf9bbd6b79177c.png being the partition function (normalization factor) and the energy function

restricted_boltzmann_machines_91c87c5339b0b101adc1ec59a60b09a45d5caaad.png

implicitly summing over repeating indices.

Training

Log-likelihood using gradient ascent

Notation

  • Used in the derivation of restricted_boltzmann_machines_93e7b9a22c58637c66e960a00d47a8775d22742e.png

    restricted_boltzmann_machines_299548310d6389a9faefb6a38c97beb77462cf7c.png

Gradient of log-likelihood of energy-based models

RBMs assumes the following model

restricted_boltzmann_machines_00e22f7b800a0440f24b21322c320cfe586ee1a4.png

with

restricted_boltzmann_machines_d23f5f58d348aae94d0044d8ab951e9c0e242d70.png

The minus-minus signs might seem redundant (and they are), but is convention due to the Boltzmann distribution found in physics.

where we're implicitly summing over repeating indices. First we observe that

restricted_boltzmann_machines_3ca89c3de515d733d5f15a195512db69b89015f8.png

Taking the log of this, we have

restricted_boltzmann_machines_4df85e685eac58c1f48aa04d3a2b78989bd0ea71.png

Now suppose we're given a set of samples restricted_boltzmann_machines_52cd3251a25cfd613ff2b3d0981f4c43cda7ab87.png, then the likelihood (assuming i.i.d.) is given by

restricted_boltzmann_machines_4e0c47271aa21d529dc45f2c40f983628c9d6fc4.png

Taking the log of this expression, and using the above expression for restricted_boltzmann_machines_213ac3ae61754f7997cc4c1d6fd5c2b766ff14eb.png, we have

restricted_boltzmann_machines_6ede8331df56c95153d223dd9e416c5661b05847.png

Let restricted_boltzmann_machines_5283500c618e65a9f09e5c845bd00ac505b75433.png, taking the partial derivative wrt. restricted_boltzmann_machines_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png for the n-th term we have

restricted_boltzmann_machines_76e6467265164b96c9bf4a224b04d0fc18f95181.png

The first term can be written as an expectation

restricted_boltzmann_machines_94f519bee7f2729d30aceb21d902021e912fc737.png

since on the LHS we're marginalizing over all restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png and then normalizing wrt. the same distribution we just summed over. For the second term recall that restricted_boltzmann_machines_9cced769b3c06495e64b7eacdcee418691f343b8.png, therefore

restricted_boltzmann_machines_865c98418be68b6214dccd175ffde1f1546d86f2.png

Substituting it all back into the partial derivative of the log-likelihood, we end get

restricted_boltzmann_machines_b77ddba6e0607433abc6001af4ed41c41fb71434.png

Where the expectations are all over the probability distribution defined by the model.

Since maximizing restricted_boltzmann_machines_4c8645cdaa1e14abf0ddb246a95777de85158726.png is equivalent to maximizing restricted_boltzmann_machines_4fdad5b4f21d26efb6d00afeeaab0d96cccdd007.png we instead consider

restricted_boltzmann_machines_de2d1b858145ba648a3baea72fa5bc434310c5bb.png

Dividing by restricted_boltzmann_machines_e10e2b430f95617381cdd6d6b52aed29fb971dff.png, observe that the first term can be written

restricted_boltzmann_machines_320e2d9dde3c2b18ae8b8553ccd875a965b21d90.png

Giving us

restricted_boltzmann_machines_57ea1c7d8b0aa597eadde908eaa34188826f4988.png

Clearly the second term is almost always intractable since we would have to sum over all possible restricted_boltzmann_machines_4a04e51217d16433a28813f41f74e8ed6b7f931f.png and restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png, but the first term might also be intractable as we would have to marginalize over all the hidden states restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png with restricted_boltzmann_machines_4a04e51217d16433a28813f41f74e8ed6b7f931f.png fixed to restricted_boltzmann_machines_c66549ed26bde7555732d1f54e3a6d75ccaed39b.png.

When working with binary hidden variables, i.e. restricted_boltzmann_machines_32dc4c1191e37f894a9700d5290832111ac23a8b.png are all assumed to be Bernoulli random variables, then we have

restricted_boltzmann_machines_fdfc31282d6fb21ede6ac5f65582a8c3807d5ac0.png

In general, when the hiddens are not necessarily Bernoulli, we would also like to sample the hidden states restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png to approximate the conditional expectation. It would make sense to sample some restricted_boltzmann_machines_f75d7681f9d0acecef24eb637ee201aa5b39199a.png number of hidden states restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png for each observed restricted_boltzmann_machines_c66549ed26bde7555732d1f54e3a6d75ccaed39b.png, i.e.

restricted_boltzmann_machines_e114d5a4b1f5166e2c6dece6640de23716236253.png

In fact, as mentioned on p.45 in Fischer_2015, this method (but letting restricted_boltzmann_machines_812652d2af5b7ce5d0be74fdb179f0f703445fb7.png) is often used, but, in the Bernoulli case, taking the expectation reduces the variance of our estimate compared to also sampling restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png.

You will often see the notation

restricted_boltzmann_machines_0e4070714a787bf56b5db1b207373b3fb4fcc7ce.png

which we will use from now on. The restricted_boltzmann_machines_fd24c0c12af72649d2b0a09385456e96ad0c3efd.png was used to make it explicit that we're computing the expectation over the visible units, NOT the joint expectation by sampling restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png for each of the observed visible units.

For the second term we also use the empirical expectation as an approximation:

restricted_boltzmann_machines_b243609a184750424ddf47c673f4869ffc23bd3e.png

Finally giving us a tractable approximation to the gradient of the log-likelihood:

restricted_boltzmann_machines_2e1f20f052d5163c99e9ac374d67e716ef780989.png

Making the variable restricted_boltzmann_machines_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png takes on explicit, we have

restricted_boltzmann_machines_c27a242bd36867342ff66a88298e8b635644ad3e.png

Observe that the signs have switched, which is simply due to the fact that restricted_boltzmann_machines_b651de6f4b71e72c5883c659832634d24c9e2d45.png depends negatively on all the variables restricted_boltzmann_machines_0b74c8782ca7c91db33125665fa253690acb1a1b.png.

Now we can make use of any nice stochastic gradient ascent method.

Bernoulli RBM

In the case of a Bernoulli RBM, both the visible and hidden variables only take on the values restricted_boltzmann_machines_86ee7005c60234fa4ddb7d5d0f9067c9e0441c21.png. Therefore,

restricted_boltzmann_machines_4cf1033236a90455f0c80427cc9891172aea07f2.png

Hence, for the hidden units, we have

restricted_boltzmann_machines_f763cbdf163c6b7dd08d543533464d3ef83bad2a.png

and

restricted_boltzmann_machines_ecff8f6e4fb9ef5c5b440d57de5f92fb8a90004e.png

For the visible units we have

restricted_boltzmann_machines_0a9ef3dd669b21ad5c0e6fcc96070c9955aa8409.png

and

restricted_boltzmann_machines_4536b921cd3fac2e2a62cd991280c09bc9a4365a.png

Thus, we end up with the gradients of the negative log-likelihood

restricted_boltzmann_machines_02d4d3ee53b564932e75bb4db1b266ab5f772416.png

In these expressions, the first terms are indeed tractable but the second terms still pose a problem. Therefore we turn to MCMC methods for estimating the expectations in the second terms.

A good method would (block) Gibbs sampling, since the visible and hidden units are independent, and we could thus sample all the visible or hidden units simultaneously. To obtain unbiased estimates we would then run the sampler for a large number of steps. Unfortunately this is expensive, which is where Contrastive Divergence comes in.

Expression for restricted_boltzmann_machines_3cf38d3b4d4501f3e4d3909cef73c6f0061aa5a1.png

Let restricted_boltzmann_machines_7d4f834cadf1cd69f489a2d9cf3aa9f223545c00.png denote the vector restricted_boltzmann_machines_9f28b9e62b39c940fee3cd5009401cf0884f245c.png but excluding the entry at index restricted_boltzmann_machines_6e9d20113f7ef024a71fc3edcc10e77417d06393.png, then

restricted_boltzmann_machines_160735441cbfe841606937128db42ab9da2c839e.png

where in the last step we simply marginalize of restricted_boltzmann_machines_5420fc22b823d5e6657c6c603787103069cf11e4.png.

The numerator will be the exponential of the following three terms added together

restricted_boltzmann_machines_def7fe6ba1509954373d3255daf92bf4feb414bb.png

and the denominator will be

restricted_boltzmann_machines_8d6115e1ffdf959ed0d5bf0adabe2eedfab76497.png

To be explicit, the denominator is

restricted_boltzmann_machines_0e2215d80b18aed9035d14e3044459fa67ca6304.png

which we can write

restricted_boltzmann_machines_b4283af55f0f21fafba50357453cf669c9d0d556.png

Taking the logarithm (for readability's sake), we end up with

restricted_boltzmann_machines_c1b3683bdf8ffd4f921f6aa7cf613cfa12a5ffed.png

Hence,

restricted_boltzmann_machines_7a1a43037bc206ca805222f43377de01e3e3d4d0.png

Observe that the two last terms in these expression cancels each other, thus

restricted_boltzmann_machines_32566f05ff5d0ce1895208259b6d878989aa8a6d.png

Finally, taking the exponential again,

restricted_boltzmann_machines_d38422bdfe9165ed76b4dabda04e3ff2641d19c3.png

TODO Computing restricted_boltzmann_machines_213ac3ae61754f7997cc4c1d6fd5c2b766ff14eb.png

restricted_boltzmann_machines_387bd0ceb5c5c9d5964fbd32a6040cb98f19edf1.png

restricted_boltzmann_machines_8101c8d061075659e80fb84966261c9d02a76778.png

restricted_boltzmann_machines_147baae9744268fdf64fa53294586c4cf147d867.png

restricted_boltzmann_machines_bb2935aa428c07479f52e838117a9011f8d3b993.png

Gaussian RBM

Suppose now that restricted_boltzmann_machines_7be1c77f49540ea25d6c8e84a7e6c55d01f1bd9b.png are now continuous variables but restricted_boltzmann_machines_32dc4c1191e37f894a9700d5290832111ac23a8b.png are still bernoulli. We assume the following Hamiltonian / energy function

restricted_boltzmann_machines_ab643bde496cdad17ab750dcd3dcded470f6cc4f.png

where, as usual, we assume summation over the repeated indices.

  • restricted_boltzmann_machines_fe1aabcc49053d3291c093753eb687a693302e33.png denotes the variance of restricted_boltzmann_machines_7be1c77f49540ea25d6c8e84a7e6c55d01f1bd9b.png
  • We use the term restricted_boltzmann_machines_041a82a44eefceead280d69a5c599092e19102b3.png in the "covariance" term, as it makes sense to remove the variance in restricted_boltzmann_machines_1bf6b6b25c0eeaeb6ada96b941136a791fc2ef4f.png which does not arise due to its relationship with the hiddens.
Computing restricted_boltzmann_machines_93e7b9a22c58637c66e960a00d47a8775d22742e.png and restricted_boltzmann_machines_80368ceab7ca8737d81dc5cf8e968216f2c9f782.png

First observe that

restricted_boltzmann_machines_ae48d444c48f8854a7a472e7973639be78638e21.png

And,

restricted_boltzmann_machines_9591edd6f6649ab11da1ad9329a04022d3c20f3d.png

We know that

restricted_boltzmann_machines_959a5b4e0d37c31dde5e09c0ec142b43d6fc2167.png

where we've separated the terms depending on restricted_boltzmann_machines_5d4395016d30850d955a0eb3b9c99ae498007d3d.png and restricted_boltzmann_machines_7163ab24066e8c641362b7c629283e791539e6a1.png. Why? In the denominator, we're integrating over all possible values of restricted_boltzmann_machines_5d4395016d30850d955a0eb3b9c99ae498007d3d.png, hence any factor not dependent on restricted_boltzmann_machines_5d4395016d30850d955a0eb3b9c99ae498007d3d.png we can move out of the integral. And since we find the same factor in the numerator, we can simply cancel the second term in the above expression. This leaves us with

restricted_boltzmann_machines_2ddb365fde8182ed7b7ac898784e157f073a96b0.png

Now, writing out the term in the exponent,

restricted_boltzmann_machines_85268fb9f4ec3aa35d7d453937762d5970c18d2b.png

For readability's sake, let

restricted_boltzmann_machines_299548310d6389a9faefb6a38c97beb77462cf7c.png

Then we can "complete the square" in the first term, which gives us

restricted_boltzmann_machines_33a37ef3f414bd1dd58d0e3d085f38379729243d.png

Finally, subsituting this back into our expression for restricted_boltzmann_machines_9bf7088e4b60ffd31ae37d3769ad4157122b8813.png, keeping in mind that restricted_boltzmann_machines_451f7170fd5f2f9e1dc762872f0fff42095d4174.png does not depend on restricted_boltzmann_machines_5d4395016d30850d955a0eb3b9c99ae498007d3d.png and can therefore pulled out of the integral, we get

restricted_boltzmann_machines_cbd62d76ed3028a1eda3a3de34ec5ce961e05a84.png

where we've cancelled the exponentials of restricted_boltzmann_machines_d27cb10488b2366d7ed01c18122c84f808f7f1f5.png due to the independence mentioned above. But HOLD ON; these exponentials are simply Gaussians with mean restricted_boltzmann_machines_451f7170fd5f2f9e1dc762872f0fff42095d4174.png!, hence

restricted_boltzmann_machines_ab983149fb38139bcb159ac5bf66cac29b4064e5.png

Thus,

restricted_boltzmann_machines_573fd6df72ad50209a0a2269955eacf582cb55fb.png

Or equivalently,

restricted_boltzmann_machines_50aa0c86fd3ba05ece2a3016e563e9d1aa8bb857.png

And we now understand why we call the resulting RBM of using this specific energy function a Gaussian RBM!

A final note is that

restricted_boltzmann_machines_d1c5b92d04436f6de1642797449a8179180561b0.png

due to the independence the visible units restricted_boltzmann_machines_7be1c77f49540ea25d6c8e84a7e6c55d01f1bd9b.png conditioned on the hidden units, as usual in an RBM. Hence

restricted_boltzmann_machines_cf941c5d0485cf16bdacf25506b2d8fd81b56dde.png

Or more numerically more conveniently,

restricted_boltzmann_machines_2d07f96192fa09789b863c417f17d2c105f11f65.png

  • What if restricted_boltzmann_machines_34eed064bf20173c63d22852259b7bafd4d0f1a3.png also are Gaussian?

    What happens if we instead have the following energy function

    restricted_boltzmann_machines_afb3894fe8bf1d04bd7df59649f01402f3f32c98.png

    Going back to our expression for the distribution of restricted_boltzmann_machines_bf65c97216269d74aa3f3ec35f4dd9dda8336b87.png

    restricted_boltzmann_machines_6cfc3cc3f3f5dcf92232ad9a483aee3e5fc120b9.png

    The only dependence on restricted_boltzmann_machines_34eed064bf20173c63d22852259b7bafd4d0f1a3.png arises from the multiplication with restricted_boltzmann_machines_9b9bf6c57afc34b0a752e270e96f6fc655ceda8f.png in the energy function, hence the distribution of the visible units when the hidden can also take on continuous values, is simply

    restricted_boltzmann_machines_6aedff0f691e8ab7820590435474efe0d9992cc7.png

    The only addition being the factor of restricted_boltzmann_machines_fb5d7bb914140a684b402d8c9f9541e6af541164.png in the sum.

What about restricted_boltzmann_machines_80368ceab7ca8737d81dc5cf8e968216f2c9f782.png?

Looking at our derivation for restricted_boltzmann_machines_93e7b9a22c58637c66e960a00d47a8775d22742e.png, the derivation for the hiddens given the visibles is exactly the same, substituting restricted_boltzmann_machines_f74052048d67c9e5264d4eda9679c41ed6ebbe2a.png with restricted_boltzmann_machines_fbdd812f114f00f2aae34efe9adb88e1eef29a2b.png and restricted_boltzmann_machines_0970999afe86bdadd070aed1d7b1618d32a2a97c.png with restricted_boltzmann_machines_041a82a44eefceead280d69a5c599092e19102b3.png, giving us

restricted_boltzmann_machines_6e68199ad8c003063fde69bab2e4a06507245978.png

General expressions for both Bernoulli and Gaussians

If you've had a look at the derivations for the conditional probabilties restricted_boltzmann_machines_80368ceab7ca8737d81dc5cf8e968216f2c9f782.png and restricted_boltzmann_machines_93e7b9a22c58637c66e960a00d47a8775d22742e.png when restricted_boltzmann_machines_5420fc22b823d5e6657c6c603787103069cf11e4.png and restricted_boltzmann_machines_bf65c97216269d74aa3f3ec35f4dd9dda8336b87.png are Bernoulli or Gaussian, you might have observed that the expressions in the exponentials are very similar. In fact, we have the following

restricted_boltzmann_machines_a0432eb986509ab57d25c50fbbc33b892eede945.png

with

restricted_boltzmann_machines_160886e40ef7293f8c021e3baf984d5ce9b7c11f.png

and for the hiddens

restricted_boltzmann_machines_dca9ecaa0d000a7c5c9e3547abd22c1cd4906730.png

with

restricted_boltzmann_machines_1c1a7cffb50d297d3af9f6aeb536524d05d181e4.png

If we want to implement an RBM, allowing the possibility of using both Bernoulli and Gaussian hidden and / or visible units, we can!

Still need to make appropriate corrections in the free energy computation, i.e. when computing restricted_boltzmann_machines_6d5fe914e73a4dd09e7e7db57d208431fc31ce5c.png.

Contrastive Divergence (CD)

The idea of Contrastive Divergence is to approximate the gradient of the log-likelihood we saw earlier

restricted_boltzmann_machines_de2d1b858145ba648a3baea72fa5bc434310c5bb.png

by

  1. Initialize "gradients" restricted_boltzmann_machines_a9e58a2bf6bf46464810709b75f410435a42342e.png
  2. For restricted_boltzmann_machines_0b5314110d90e6fc3e5188e64692716722c0f8ff.png:
    1. Initialize sampler with training instance restricted_boltzmann_machines_c66549ed26bde7555732d1f54e3a6d75ccaed39b.png (instead of using a random initialization)
    2. For restricted_boltzmann_machines_f61fd97fc95c67cd9f8d198b33ab98e1ccc34cf5.png do
      1. restricted_boltzmann_machines_b89cf21378b59013091e3865dd9050b08e53e3e7.png, where entries are of course independent of each other (due to biparite structure of an RBM)
      2. restricted_boltzmann_machines_f006528fb1c4d89598c8a062fa79dd3324bf295a.png, where entries are of course independent of each other (due to biparite structure of an RBM)
    3. Compute gradients replacing restricted_boltzmann_machines_4a04e51217d16433a28813f41f74e8ed6b7f931f.png with restricted_boltzmann_machines_e7f0a1e3af1a87c1d03e48c592e72abed3ad34ae.png.
    4. restricted_boltzmann_machines_af00c9fe62cef56216433c7e316c3f67e5d3ccb5.png
  3. Update parameters:

    restricted_boltzmann_machines_64c07cf02ae4fdd6698e174bb42c317b74b8a925.png

    where restricted_boltzmann_machines_6e9d20113f7ef024a71fc3edcc10e77417d06393.png is the learning rate

restricted_boltzmann_machines_62419690039317852480937778ea73bc823cb685.png

This is basically the algorithm presented in Fischer_2015, but in my opinion this ought to be the average gradient, not the change summed up. Feel like there is a factor of restricted_boltzmann_machines_3c53011d4dec9a6a2bbb109a54185fd7a39de9ca.png missing here.

Of course, one could just absorb this into the learning rate, as long as the number of samples in each batch is the same.

Why not compute the expectation as one samples these restricted_boltzmann_machines_094b02afce734f4ce51933d0093ef3d2da9f8123.png steps, why only do it at the end of the chain?

That is, when performing the approximation for the log-likelihood for a single training example restricted_boltzmann_machines_c66549ed26bde7555732d1f54e3a6d75ccaed39b.png

restricted_boltzmann_machines_b5be3c78ac617f765de19991a9840b62020aedb0.png

by replacing the expectation by sampling restricted_boltzmann_machines_30454c3a1a75d37e94838db44cfddf0ed9fd626c.png, i.e.

restricted_boltzmann_machines_39e63f98a6fcaf76b3377827057dac8f94f2e137.png

where restricted_boltzmann_machines_be72c42eac42798b25b0b72e516f1240cf4e8a61.png denotes the t-th sample from a sampler chain, initialized by the training example restricted_boltzmann_machines_c66549ed26bde7555732d1f54e3a6d75ccaed39b.png.

This leaves us with the update rule (for a single training example):

restricted_boltzmann_machines_7180f11e1ae3fdf25c1d1dd9bde0c98ab298174c.png

Why not compute the expectation over all these steps? Instead of stochastically approximating this expectation by a single example (the final sample in the chain), we could use the following approximation to the expectation

restricted_boltzmann_machines_4704169e471fdcb762fd296bc53330fc622c104b.png

I guess some reasons could be:

  • more biased estimate (the more steps we let the sampler take, the more likely our sample is to be unbiased)
  • more expensive to compute gradient than sampling

Theoretical justification

Most, if not all, is straight up copied from bengio_2009.

Consider a Gibbs chain restricted_boltzmann_machines_36870569113a975884391471ff757e9487c9036d.png starting at a data point, i.e. restricted_boltzmann_machines_8c3be50eeda037c0a24fd1fbee4fbbb4ccbf98f0.png for some restricted_boltzmann_machines_f07b932b12c91bca424fd428d383495413b5b6a9.png. Then the log-likelihood at any step restricted_boltzmann_machines_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png of the chain can be written

restricted_boltzmann_machines_79e62ee6254bb616b14e582ca2042cc1d36af6a3.png

and since this is true for any path,

restricted_boltzmann_machines_fc86d77bc4b574574bf72c79e72de5061285a853.png

where the expectations are over Markov chain sample paths, conditioned on the starting sample restricted_boltzmann_machines_c66549ed26bde7555732d1f54e3a6d75ccaed39b.png.

The first equation is obvious, while the second equation requires rewriting

restricted_boltzmann_machines_614e64723ea28e7dd04b8a0acea1d55a94de3e46.png

and substituting in the first equation.

For any model restricted_boltzmann_machines_27df5e9ef5615feabed472b95153365c260b26c7.png with parameters restricted_boltzmann_machines_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png

restricted_boltzmann_machines_bab7effb1156b763d505a42101f9bf3d3ebc07c6.png

when the expected value is taken accordin to restricted_boltzmann_machines_27df5e9ef5615feabed472b95153365c260b26c7.png.

restricted_boltzmann_machines_8861092c554f3318df1b8ba043de60f9b94127e8.png

(Observe that for this to work with continuous variables we need to ensure that we're allowed to interchange the partial derivative and the integration symbol)

Consider a Gibbs chain restricted_boltzmann_machines_36870569113a975884391471ff757e9487c9036d.png starting at a data point, i.e. restricted_boltzmann_machines_8c3be50eeda037c0a24fd1fbee4fbbb4ccbf98f0.png for some restricted_boltzmann_machines_f07b932b12c91bca424fd428d383495413b5b6a9.png.

The log-likelihood gradient can be written

restricted_boltzmann_machines_86cfeb9e79a52309fd31cbcddcc4e3fc5cbae3a4.png

and

restricted_boltzmann_machines_cdc01dbe358219b5d52cd9ef3fa3d4b4e9787408.png

Taking the derivative of the expression in Lemma 1, we get

restricted_boltzmann_machines_d5abfcbe3582f89ef6761eb9e4007784f44e51d7.png

Taking the expectations wrt. the Markov chain conditional on restricted_boltzmann_machines_c66549ed26bde7555732d1f54e3a6d75ccaed39b.png, we get

restricted_boltzmann_machines_24631619e9acaadd1abf9ea86d214d3a704069eb.png

To show that the last term vanishes for restricted_boltzmann_machines_dd0975ef2936ae4c4c0957250ef6d67cbb4c0b5c.png, we use the assumed convergence of the chain, which can be written

restricted_boltzmann_machines_4e90f412ab3566b4fa814c7dbad86add99f010a8.png

with

restricted_boltzmann_machines_6d06996cc709e80882e4a099db25162dc1ee04bd.png

We can then rewrite the last expectation as

restricted_boltzmann_machines_9574d745c847219229e16935a6a1f1505bae8de0.png

Using Lemma 2, the first sum vanishes, and we're left with

restricted_boltzmann_machines_6d674844adfcb3bf6aa0a6bcfccdcb91b546cb63.png

where restricted_boltzmann_machines_894233aaabe26b236aaa453fa4c8940b24ed0880.png denotes the number of possible discrete configurations restricted_boltzmann_machines_9a1307ee8aac6e441a3e714ffea771aa39cf1d70.png can take on, and we've let

restricted_boltzmann_machines_962d1cce5dd78932942826db9d55ac3f538529fb.png

Hence this expectation goes to zero, completing our proof.

The proof above also tells us something important: the error due to the second expectation in CD-k has an upper bound which depends linearly on the sampler-error restricted_boltzmann_machines_a0134306c9c55bf3fbc8e08102f4e7d02abe9a9c.png.

Therefore the mixing-rate of the chain is directly related to the error of CD-k.

(Important to remember that there is of course also errors introduced in the stochastic approximation to the first part of the expectation too.)

Persistent Contrastive Divergence (PCD)

Persistent Contrastive Divergence (PCD) is obtained from CD approximation by replacing the sample restricted_boltzmann_machines_638113ce2a090c61e6d22e57ec762502c2a8db23.png by a sample from a Gibbs chain that is independent of the sample restricted_boltzmann_machines_f1b34bcf5f1b26d92d2b2b3cfe3439538e118170.png of the training distribution.

This corresponds to standard CD without reinitializing the visible units of the Markov chain with a training sample each time we want to draw a sample restricted_boltzmann_machines_638113ce2a090c61e6d22e57ec762502c2a8db23.png.

restricted_boltzmann_machines_8c657b14bd94dac980f5c3ec2ef6d64c60657dc9.png

Parallel Tempering (PT)

Parallel Tempering (PT) introduces supplementary Gibbs chains that sample from more and more smoothed replicas of the original distribution.

Formally, given an ordered set of restricted_boltzmann_machines_f75d7681f9d0acecef24eb637ee201aa5b39199a.png temperatures restricted_boltzmann_machines_299f2693cc90046f2a417fc972a46255b269a727.png, and we define a set of restricted_boltzmann_machines_f75d7681f9d0acecef24eb637ee201aa5b39199a.png Markov chains with stationary distributions

restricted_boltzmann_machines_ed6b392be6405968752d543c3cc38ef5d12348b1.png

for restricted_boltzmann_machines_0325b476bcf3fd8d983030d63d2f3618833d8b05.png where restricted_boltzmann_machines_cdbbede1662b9c9e73eb2f7cc05475087d4c1478.png is the corresponding partition function, and restricted_boltzmann_machines_cf613eb1fcf0ddc18b70a13b525e848a6272db9e.png has exactly the model distribution.

  • With increasing temperature, restricted_boltzmann_machines_e8b5eece4e04e00714d1e86bce39bfe5a225f3b8.png becomes more and more equally distribution
  • As restricted_boltzmann_machines_8b83a90cdce128d789510a5af82c4e3aad38181c.png we have restricted_boltzmann_machines_48ca3e4a4f52487af4c8bdc77344cd18aa38d5ec.png, i.e. uniform distribution
    • Subsequent samples of the Gibbs chain are independent of each other, thus stationary distribution is reached after just one sampling step

In each step of the algorithm, we run restricted_boltzmann_machines_094b02afce734f4ce51933d0093ef3d2da9f8123.png Gibbs sampling steps in each tempered Markov chain yielding samples restricted_boltzmann_machines_4ebc45ffc00a8808934d650eca32780987c360cf.png.

Then, two neighboring Gibbs chains with temperatures restricted_boltzmann_machines_a17de13333aa255b0bee0abb925937a58b93ce4b.png and restricted_boltzmann_machines_684c313934ec89aa7564da4a96a109ee10b5ff4c.png may exchange "particles" restricted_boltzmann_machines_566abaed2eb5f3969b93cc20b843f28c87e7af02.png and restricted_boltzmann_machines_6cfb32ad7d8af9575468bc4611898ff3384f5381.png with probability (based on Metropolis ratio),

restricted_boltzmann_machines_cb9a6f9ca206312057608678db287aa745a88fe0.png

which gives, for RBMs,

restricted_boltzmann_machines_aa47ea23016afe081b5e777091a1d3f13bfbd8ba.png

After performing these "swaps" between chains (which increase the mixing rate), we take the sample restricted_boltzmann_machines_5706c52223a45dbe4c778ac8e9def2217af83dfd.png from the original chain with restricted_boltzmann_machines_ae12134fac3bf491466e4f75f885275e87e0383b.png as the sample from RBM distribution.

This procedure is repeated restricted_boltzmann_machines_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png times, yielding the samples restricted_boltzmann_machines_993440389369a08f2c6045ee4d2df89a68db8d4a.png used for the approximation of the expectation under the RBM distribution in the log-likelihood gradient. Usually restricted_boltzmann_machines_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png is the number of samples in the batch of training data.

Compared to CD, PT produces more quickly mixing Markov chain, thus less biased gradient approximation, at the cost of computation.

Estimating partition function

Bibliography

  • [ackley_1985] Ackley, Hinton, & Sejnowski, A Learning Algorithm for Boltzmann Machines*, Cognitive Science, 9(1), 147–169 (1985). link. doi.
  • [Fischer_2015] Fischer, Training Restricted Boltzmann Machines, KI - Künstliche Intelligenz, 29(4), 441–444 (2015). link. doi.
  • [bengio_2009] Bengio & Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Computation, 21(6), 1601–1621 (2009). link. doi.