Gaussian Processes for Machine Learning

Table of Contents

Introduction

Gaussian processes is a generalization of Gaussian probability distribution to continuous case where we deal with these infinite number of functions by only considering a finite number of points.

Two common appraoches to deal with supervised learning:

  1. Restrict the class of functions that we consider
  2. Prior probability to every possible function
    • How do we compare a infinite number of functions?

A Pictorial Introduction to Bayesian Modelling

1D regression

  • Here we specify a prior over functions specified by a particular Gaussian process which favours smooth functions
  • We assume a 0 mean for every x , since we don't have any prior knowledge
  • We then add two data points gaussian_processes_for_machine_learning_cc05cbc8af7e7cae000b5ba3da8f558189ceb444.png and only consider functions passing through these two points, ending up with the posterior in 1.1 b

figure1.1.png

Figure 1: From rasmussen2006gaussian

Regression

There are multiple ways to interpret a Gaussian process (GP)

Weight-space view

Notation

  • gaussian_processes_for_machine_learning_39516ee8e7a57df648abb0632a6b79b346aa6665.png is gaussian_processes_for_machine_learning_176130422efb42c4c3e6d5f8b226ab96567a2ab0.png of gaussian_processes_for_machine_learning_f07b932b12c91bca424fd428d383495413b5b6a9.png observations
  • gaussian_processes_for_machine_learning_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png = input vector of dimension gaussian_processes_for_machine_learning_b689cba8d7566f6adaf605a844e193a27e155078.png
  • gaussian_processes_for_machine_learning_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png = scalar output/target
  • gaussian_processes_for_machine_learning_0207be880056b9a69e22e729dd37bced29cd174a.png = gaussian_processes_for_machine_learning_4303a8284be33e82c7cc97e422aa58120f1edd69.png design matrix with all gaussian_processes_for_machine_learning_f07b932b12c91bca424fd428d383495413b5b6a9.png input vectors as column vectors gaussian_processes_for_machine_learning_1637525cf02f6fd6d9c4b23a99fde4e4c0783292.png gaussian_processes_for_machine_learning_a358eaa3accbf5d74e9cebe66db39f3c2fe207fc.png

Standard Linear Model

Bayesian analysis of the standard linear regression model with Gaussian noise:

gaussian_processes_for_machine_learning_50db70c6d17f92dcb7fb1c4f8145ee770ab72ae9.png

  • Often a bias is included, but this can easily be implemented by adding a extra element to the input vector gaussian_processes_for_machine_learning_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png, whose value is always gaussian_processes_for_machine_learning_7ba43ff00864c097bf5a587573e23164e2417a0b.png
  • The noise assumption (gaussian_processes_for_machine_learning_03d96c52df5ea6a510607e2260e0745d11e4bd85.png to be normally distributed) together with the model direcly gives us the likelihood, ergo the probability density of the observations given the parameters.

Because of (assumed) independence between samples:

gaussian_processes_for_machine_learning_2be53240116adc4715f7cd0b41ffee5b2377a165.png

In the Bayesian formalism we need to specify a prior over the parameters, expressing our beliefs about the parameters before considering the observations. We simply give a zero mean Gaussian with covariance matrix gaussian_processes_for_machine_learning_87a8e2b6e24ace9aed5c89816b464966d0c96c5b.png on the weights

gaussian_processes_for_machine_learning_c5d039e16d5a0833a18ffb21b7097bd5607d7c3d.png

Inference in a Bayesian linear model is based on the posterior distribution over the weights:

gaussian_processes_for_machine_learning_91e1acf366c14f062ca6b2d8f72cc7440071fb4d.png

where the normalization constant or marginal likelihood is given by

gaussian_processes_for_machine_learning_2bf9f5398e8c48a728fef50c69b03489def70045.png

If we can obtain gaussian_processes_for_machine_learning_6642f949e7aba739383edd60ff4c80b6f2f4eef1.png without explicitly calculating this integral, this is what they call "integrating out the weights".

For now we ignore the marginal distribution in the denominator, and then we have

gaussian_processes_for_machine_learning_37ab6a18e63cd7cbcad328e6f2a7c7e19db972c5.png

where gaussian_processes_for_machine_learning_d5ae7d42cbab073b3f2252495ead4cf9f3cf9ad9.png. Notice how this has the form of a Gaussian posterior with mean gaussian_processes_for_machine_learning_8f611bf55d4faa8ef4a033d113dd1879098ee371.png and covariance matrix gaussian_processes_for_machine_learning_7ca4290f9134259d13aba5852fe2d84655ccbf1d.png:

gaussian_processes_for_machine_learning_24abb9487acdcb3a9d60e29708f29ee000ab55d6.png

where gaussian_processes_for_machine_learning_cafa4a4559545e544af6449214be316229fd0690.png.

I suppose this is legit without having to compute the marginal likelihood since we can easily obtain the normalization constant for a Gaussian Distribution.

For this model, and indeed for any Guassian posterior, the mean of the posterior gaussian_processes_for_machine_learning_956b153eca1af83db391da446e00c4a9c22ffa61.png is also its mode, which is also called the maximum a posteriori (MAP) estimate of gaussian_processes_for_machine_learning_4016a81e52b1f12c3e1d907335d404f43d7cf519.png.

The meaning of the log prior and the MAP estimate substantially differs between Bayesian and non-Bayesian settings:

  • Bayesian gaussian_processes_for_machine_learning_1637525cf02f6fd6d9c4b23a99fde4e4c0783292.png MAP estimate plays no special role
  • Non-Bayesian gaussian_processes_for_machine_learning_1637525cf02f6fd6d9c4b23a99fde4e4c0783292.png negative prior is sometimes used as penalty term, and MAP point is known as the penalized maxium likelihood estimate of the weights. In fact, this is what is known as ridge regression, because of the quadratic term gaussian_processes_for_machine_learning_b2834f177c1cfa3a00cf4dec37c8fa8a48095266.png from the log prior.

To make predictions for some input gaussian_processes_for_machine_learning_8a14a9b2ee4f9dd099483b58c9ed538c3f4833da.png, we average over all the parameter values, weighted by their posterior probability. Thus the predictive distribution for gaussian_processes_for_machine_learning_568815d641bf5df5e1ae99f424ae1b1df056320e.png, is given by averaging (or rather, taking the expectation for gaussian_processes_for_machine_learning_77e2cb76d6b007cb4d2bd950da404550bcb75f30.png over the weights gaussian_processes_for_machine_learning_fe480f547555026d210b55b5d4ef758235f32832.png) the output of all possible linear models w.r.t. the Gaussian posterior

gaussian_processes_for_machine_learning_1bdf1c8ef37bc277fc6a2dba6996ef931f8adee8.png

And even the predictive distribution is Gaussian! With a mean given by the posterior mean of the weights from eq. (2.8) multiplied by the new input. All hail Gauss!

The above step demonstrates a crucial part: we can consider all possible weights simply by "integrating out" or "marginalizing over" the weights.

Also note how the predictive variance is a quadratic of the new input, and so the prediction will be more uncertain as the magnitude of test input grows.

Projections of Inputs into Feature Space

  • Can overcome linearity by projecting features into higher-dimensional space allowing us to implement polynomial regression
  • If projections are independent of parameters gaussian_processes_for_machine_learning_4016a81e52b1f12c3e1d907335d404f43d7cf519.png => model is still linear in the parameters and therefore analytically solvable/tractable!
  • Data which is not linearly separable in the original data space may become linearly separable in a high dimensional feature space.

Function-space view

Our other approach is to consider inference directly in the function space, i.e. use a GP to describe a distribution over functions.

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A GP is completely specified by its mean function gaussian_processes_for_machine_learning_1c13dc35f1195ee0d8525de1bcfa1b8399f62f1a.png and covariance function gaussian_processes_for_machine_learning_b846f011bea9a431361946f60babf365de968202.png . For a real process gaussian_processes_for_machine_learning_689c66a323bc8a7e92d22e177edbabd1cb957d7c.png

gaussian_processes_for_machine_learning_4dabe66e6a16207e47e3c2cf6cb8c221367c4289.png

and we write the Gaussian process as

gaussian_processes_for_machine_learning_2222fe7590f7179f2384f9c1eaa13be0b98c7fe7.png

Usually, for notational simplicity we will take gaussian_processes_for_machine_learning_a5de19317632489ecf4694a6d5086c4fffd58cf2.png, though this need-not be done.

Consistency requirement / marginalization property

GP is defined as collection of rvs gaussian_processes_for_machine_learning_1637525cf02f6fd6d9c4b23a99fde4e4c0783292.png a consistency requirement. This simply means that if the GP e.g. specifies gaussian_processes_for_machine_learning_136011fd56a7ee04ea162adb48ca086b9b69f861.png, gaussian_processes_for_machine_learning_1381b2388e8dce1f0241f5828b491f40eef2f2a3.png

The defintion does not exclude GPs with finite index sets (which would just be Gaussian distributions).

Simple example

Example of GP can be obtained from our Bayesian linear regression model gaussian_processes_for_machine_learning_004ea5cf1515d0910baf4502fe74837822cffd4c.png with prior gaussian_processes_for_machine_learning_585f02b9a2c10b8ba7be12187bbf36911ff41640.png.

gaussian_processes_for_machine_learning_9eda720e8b99055f6253a864e8faf2cce95c4417.png

Ergo, gaussian_processes_for_machine_learning_689c66a323bc8a7e92d22e177edbabd1cb957d7c.png and gaussian_processes_for_machine_learning_efb8a156954b01219dca33ffd3cf98a1ce938de3.png are jointly Gaussian with zero mean and covariance given by gaussian_processes_for_machine_learning_745b061d04037ed09a88778cfc0a4a04859e40da.png.

Predictions using noise-free observations

  • We know gaussian_processes_for_machine_learning_21666db901d88fc0e359c65a9514cd989f22b9d8.png
  • gaussian_processes_for_machine_learning_97c3cb721cd6d76849d88aff3353c3110783a9ae.png = vector of training outputs
  • gaussian_processes_for_machine_learning_0bab343fe9451c9b848b46368b074f6ea9182f32.png = vector of test outputs
  • Join distribution of training and test outputs according to the prior is then

gaussian_processes_for_machine_learning_69551413be7b0f4867ae72fda3f74fafc841b181.png

  • If gaussian_processes_for_machine_learning_f07b932b12c91bca424fd428d383495413b5b6a9.png training points and gaussian_processes_for_machine_learning_517322461ad3a9f0b03dfd72a815b3514f7efe00.png test points, the matrix gaussian_processes_for_machine_learning_a6f88faa8932434569b5a044ea5666172754f84d.png = covariaces evaluated at all pairs of training and test points, and similarily for gaussian_processes_for_machine_learning_a84fca0cb0aae516e9c3767c33103368e6b6f366.png, etc.
  • Restrict this joint prior distribution to contain only those functions which agree with the observed data points
    • A lot of functions mate gaussian_processes_for_machine_learning_1637525cf02f6fd6d9c4b23a99fde4e4c0783292.png not efficient
    • In probabilistic terms this operation is simple: equivalent of conditioning on the joint Gaussian prior distribution on the observations to give

gaussian_processes_for_machine_learning_ccd61db18ffb3a74213bc1888f9f746e4df2d5d3.png

  • Now we can sample values for gaussian_processes_for_machine_learning_60e04fb235b9107aab42172b79d5ffdc28923571.png (corresponding to test inputs gaussian_processes_for_machine_learning_ba97c9198d05c47141f110d6f72b3f020908ca5f.png) from the joint posterior distribution by evaluating the mean and covariance matrix from the eqn. above.

Appendix A: Stuff to know

Sampling from a Multivariate Gaussian

Assuming we have access to a Gaussian scalar generator.

Want samples gaussian_processes_for_machine_learning_005b7e951f095ae146670af3be2834decb89d64a.png with arbitrary mean gaussian_processes_for_machine_learning_161472f3911b772e594bb13f81712f01743ace0f.png and covariance gaussian_processes_for_machine_learning_d80b633218f04f7a25f6a9dd7e4617a38e88f263.png.

  1. Compute Cholesky decomposition (also known as "matrix square root") gaussian_processes_for_machine_learning_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png of the positive (semi? not sure) definite symmetric covariance matrix gaussian_processes_for_machine_learning_ff9c8c9d78b9630d640f5b1b15eaeb60440d2456.png
    • gaussian_processes_for_machine_learning_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png is lower triangular matrix
  2. Generate gaussian_processes_for_machine_learning_cc5e06fab8c9dd7cc5bea994d8661a24d814491c.png by multiple separate calls to the scalar Gaussian generator
  3. Compute gaussian_processes_for_machine_learning_3c7318e90c2e2208badca0d04fdbde52fae12d8a.png

    • Has desired distribution with mean gaussian_processes_for_machine_learning_fe52c58a96549528260fe0d7d04caf390dd47865.png and covariance

    gaussian_processes_for_machine_learning_cfb0c482a69a05b87aa2868a943b665d31aabd3d.png, by independence of gaussian_processes_for_machine_learning_9aff87d25cc15fd1fe2df999fb21866262499dd7.png

Cholesky Decomposition

  • Decomposes a symmetric, positive definite matrix gaussian_processes_for_machine_learning_8939ac39ed16887f3b3c033a47d4ab60d120251a.png into a product of a lower triangular matrix gaussian_processes_for_machine_learning_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png and its transpose:

    gaussian_processes_for_machine_learning_f7e24e8c0c40b426d7da449cc3b9bc890d588489.png

  • Useful for solving linear systems with symmetric, positive definite coefficient matrix gaussian_processes_for_machine_learning_8939ac39ed16887f3b3c033a47d4ab60d120251a.png.

    1. Want to solve gaussian_processes_for_machine_learning_e4d09320d148643c05b08a10387633f7b165bc5e.png for gaussian_processes_for_machine_learning_3c314f80373742988ad542a6d4ce66a111b17847.png
    2. Solve the triangular system gaussian_processes_for_machine_learning_a1b9c3671037bcb821b098f5ecc781b3f9d6341f.png by forward substitution
    3. Solve triangular system gaussian_processes_for_machine_learning_56fb79251b815df305f06635065764c3953d64c7.png by backward substition
    4. Solution is gaussian_processes_for_machine_learning_0aa7b4dfab19117cf0a2f869a1ab030cf4f05199.png
    • gaussian_processes_for_machine_learning_e9f17c5fccc218ea1d76288f94be808300031059.png = vector gaussian_processes_for_machine_learning_3c314f80373742988ad542a6d4ce66a111b17847.png which solves gaussian_processes_for_machine_learning_e4d09320d148643c05b08a10387633f7b165bc5e.png

Complexity

  • Both forward and backward substitution requires gaussian_processes_for_machine_learning_2e7810a34d494aa7900a7fbd9b5bebbc8c1462d4.png operations, when gaussian_processes_for_machine_learning_8939ac39ed16887f3b3c033a47d4ab60d120251a.png is of size gaussian_processes_for_machine_learning_812a7be1c6b4e6d0c54540febd3d26b36f71738b.png.
  • The computation of the Cholesky factor gaussian_processes_for_machine_learning_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png is considered numerically extremely stable and takes time gaussian_processes_for_machine_learning_7ccbb1627891a08be422869bff30bd6eb37c2fce.png.

Forward/backward substitution

For a lower triangular matrix we can write gaussian_processes_for_machine_learning_11fcb4a05e4bdecd080bfc070729d0d36ed473ff.png as follows

gaussian_processes_for_machine_learning_c75a2ff18447928ff654097d72798fa78b3a02c3.png

Notice that gaussian_processes_for_machine_learning_4d968623ae59a4680d9503b08bd7a389298fa605.png, and so we can solve gaussian_processes_for_machine_learning_e97d6deb2e8d3e25b476fa9da7e535a61dfe1f9b.png by substituting gaussian_processes_for_machine_learning_659030333bb9bc2a5da3effd68de52848e0ac1e8.png in, i.e. forward substitution.

Doing this repeatedly leads to the full solution for gaussian_processes_for_machine_learning_11fcb4a05e4bdecd080bfc070729d0d36ed473ff.png, that is:

gaussian_processes_for_machine_learning_ccbd76fd50b370227687d13b380f59a9ad701d58.png

For backward substitution we do the exact same thing, just in reverse, since we're then dealing with a upper triangular matrix.

Appendix B: Derivations

Kernel

Declarations

  • gaussian_processes_for_machine_learning_5a435b9e9e4bea6e753d4e890016141e126e2242.png = input space
  • gaussian_processes_for_machine_learning_f6c25ed44cd9d433c452e7582e3a4ff9df0b97cd.png = feature space
  • gaussian_processes_for_machine_learning_0b5f8103beea612275ad50bf9f302df64b263d45.png
  • gaussian_processes_for_machine_learning_d379851ecd586e021a14e3e40d525810a163f82c.png = matrix with columns gaussian_processes_for_machine_learning_13e96e0a99dfe46b519207f62e139acfe7d0be30.png for all training inputs

Model

gaussian_processes_for_machine_learning_d7f336e5b68a2157f76b254b8e3859415096a829.png

where gaussian_processes_for_machine_learning_4016a81e52b1f12c3e1d907335d404f43d7cf519.png is of length gaussian_processes_for_machine_learning_e10e2b430f95617381cdd6d6b52aed29fb971dff.png. This is analogous to the standard linear model except everywhere gaussian_processes_for_machine_learning_b2d518a1a7bd2ec074c400cbcba08f9aa659c59e.png is substituted for gaussian_processes_for_machine_learning_0207be880056b9a69e22e729dd37bced29cd174a.png.

Appendix C: Exercises

Chapter 2: Regression

Ex. 1

First we import the stuff that needs to be imported

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from __future__ import division

After having some issues, I found this article from which I used some of the code, so thanks a lot to Chris Fonnesbeck!

We start out by the following specification for our Gaussian Process:

gaussian_processes_for_machine_learning_5da13a3ed667a13b57d8aef1a580af3c009418fd.png

def exponential_cov(x, y, theta):
    return theta[0] * np.exp(-0.5 * theta[1] * np.subtract.outer(x, y) ** 2)

Then to compute the conditional probability for some new input gaussian_processes_for_machine_learning_3c314f80373742988ad542a6d4ce66a111b17847.png conditioned on the previously send inputs with outputs, we use the following:

gaussian_processes_for_machine_learning_4e08704790d75d8be419fdae21d9fe938a8db802.png

where, again, remember that gaussian_processes_for_machine_learning_60e04fb235b9107aab42172b79d5ffdc28923571.png and gaussian_processes_for_machine_learning_97c3cb721cd6d76849d88aff3353c3110783a9ae.png represents the test and training outputs, i.e in the case of training (already seen) inputs:

gaussian_processes_for_machine_learning_290582ac561424907f5729d31d5e514f9332a934.png

def conditional(X_star, X, y, theta):
    B = exponential_cov(X_star, X, theta)
    C = exponential_cov(X, X, theta)
    A = exponential_cov(X_star, X_star, theta)

    mu = np.linalg.inv(C).dot(B.T).T.dot(y)
    sigma = A - B.dot(np.linalg.inv(C).dot(B.T))

    return mu.squeeze(), sigma.squeeze()  # simply unwraps any 1-D arrays, so (1, 100) => (100,)

For now we set gaussian_processes_for_machine_learning_ca29355e7781ef2527f12de3ef034d00ec491a35.png:

theta = (1, 1)
sigma_theta = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_theta, capsize=0)
plt.ylim(-3, 3)

# some sampled priors
f1_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta))
f2_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta))
f3_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta))
plt.plot(xpts, f1_prior)
plt.plot(xpts, f2_prior)
plt.plot(xpts, f3_prior)

gp_prior_article.png

We select an arbitrary starting point, say gaussian_processes_for_machine_learning_9e596ccc7cddea7287a30ed7f32237223eafa5e9.png, and simply sample from a normal distribution with unit covariance. This is just for us to have a "seen" sample to play with.

x = [1.]
y = [np.random.normal(scale=sigma_theta)]
np.array(y)

We now update our confidence band, given the point we just sampled, by using the covariance function to generate new point-wise confidence intervals, conditional on the value gaussian_processes_for_machine_learning_a0e990391dead6fbbc362b8f21c4307999efcf30.png seen above.

At a first glance it might look like conditional and predict_noiseless are the same functions, but the difference is that predict_noiseless returns the mean and (noiseless) variance at each point where as conditional returns the mean and the entire covariance matrix for all the test points.

So if you want to draw function samples (as we do later on), you'll want to use the conditional, while if you're simply after the mean-line and it's uncertainty at each evaluated point you use the predict_noiseless.

sigma_1 = exponential_cov(x, x, theta)  # compute covariance for new point

def predict_noiseless(x, data, kernel, theta, sigma, f):
    K = [kernel(x, y, theta) for y in data]           # covariance between test input and training input
    sigma_inv = np.linalg.inv(sigma)                      # inverting the covariance matrix for training input
    K_dot_sigma_inv = np.dot(K, sigma_inv)
    y_pred = K_dot_sigma_inv.dot(f)                   # compute the predicted mean 
    sigma_new = kernel(x, x, theta) - K_dot_sigma_inv.dot(K)  # covariance of new point

    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 2000)
predictions = [predict_noiseless(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0, alpha=0.6)
plt.plot(x, y, "ro")
plt.plot(x_pred, y_pred, c="c")
plt.ylim(-3, 3)

gp_posterior_article_1.png

Notice that here we are predicting a mean gaussian_processes_for_machine_learning_fe52c58a96549528260fe0d7d04caf390dd47865.png and covariance gaussian_processes_for_machine_learning_378a8e44bed4144e731178db40020e4dc794110c.png at each test point, but for fun's sake we can sample some functions too!

gaussian_processes_for_machine_learning_fac413ea26b49dcfae0a8098cf2b6846e93be635.png

That is; we create a normal distribution of dimension equal to the number of test inputs, draw from this distribution, then the sample at index i corresponds to gaussian_processes_for_machine_learning_cd6ebc7a0e782fd9dc5562a0874dd8a41bcb6c3f.png, yah see? It's weird, yes, but allows the indexed sample points to be any real value!

# same as above
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0, alpha=0.3, label="$\sigma_*$")
plt.plot(x, y, "ro")
plt.plot(x_pred, y_pred, c="black", label="$\mu_*$")
plt.ylim(-3, 3)
plt.xlim(-3, 3)

# Some sampled functions
mu_star, K_star = conditional(x_pred, x, y, theta)       # x_pred conditioned on training input (x, y)
f1 = np.random.multivariate_normal(mu_star, K_star)
f2 = np.random.multivariate_normal(mu_star, K_star)
f3 = np.random.multivariate_normal(mu_star, K_star)
plt.plot(x_pred, f1, label="$f_*^{(1)}$")
plt.plot(x_pred, f2, label="$f_*^{(2)}$")
plt.plot(x_pred, f3, label="$f_*^{(3)}$")
plt.legend()

gp_article_function_draw.png

Issues

I had multiple issues with following the book:

  • Covariance matrix of the input matrix that I come up with is not positive definite => can't use Cholesky decomposition to draw samples, nor to circumvent having to compute the inverse. There is no guarantee for this to be the case either, but there is a guarantee that for a real matrix we will have a semi-positive definite covariance matrix, which allows for drawing from a Gaussian by basically replacing each L step with the equivalent SVD step (where we use the product gaussian_processes_for_machine_learning_f995466bf232cecaefdbfd308093869d954eaa93.png instead of gaussian_processes_for_machine_learning_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png).
  • What to transpose and so on in the computation of the mean vector and covariance matrix for the predictions I got wrong multiple times… IT'S HARD, MKAY?!
Some of my own deprecated / horrible work

To generate samples gaussian_processes_for_machine_learning_a48da4abc203876522916be1c1bf40912972fddb.png with arbitrary mean gaussian_processes_for_machine_learning_161472f3911b772e594bb13f81712f01743ace0f.png and covariance matrix gaussian_processes_for_machine_learning_d80b633218f04f7a25f6a9dd7e4617a38e88f263.png using a scalar Gaussian generator, we proceed as follows:

  1. Compute the Cholesky decomposition (also known as the "matrix square root") gaussian_processes_for_machine_learning_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png of the positive-definite symmetric covariance matrix gaussian_processes_for_machine_learning_82c7f278892bba8d31508b5a7a870c5f89519e27.png, where gaussian_processes_for_machine_learning_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png is lower triangular.
  2. Generate gaussian_processes_for_machine_learning_cc5e06fab8c9dd7cc5bea994d8661a24d814491c.png by multiple separate calls to the scalar Gaussian generator.
  3. Compute gaussian_processes_for_machine_learning_3c7318e90c2e2208badca0d04fdbde52fae12d8a.png, which has the desired distribution with:
    • Mean gaussian_processes_for_machine_learning_161472f3911b772e594bb13f81712f01743ace0f.png
    • Cov. gaussian_processes_for_machine_learning_918c1ae232832a1d1eb8ebde20ae9b1574650dda.png (by independence of the elements of gaussian_processes_for_machine_learning_9aff87d25cc15fd1fe2df999fb21866262499dd7.png)
def gaussian_draws(m, K, N=10000):
    """
    IMPORTANT: This version uses the Cholesky decomposition which ONLY works if the
    matrix `X` is positive definite symmetric, which is often not the case! A covariance
    matrix is only guaranteed to be SEMI-positive definite symmetric.
    """
    dim = m.shape[0]
    X = np.zeros((N, dim))

    L = np.linalg.cholesky(K)

    for i in range(N):
        u = np.random.normal(0, 1, dim)
        X[i, :] = m + np.dot(L, u)

    if N == 1:
        return X[0]  # return a single vector if only 1 sample is requested
    else:
        return X

m = np.array([2, 1, 0])
K = np.array([[2, -1, 0], [-1, 2, -1], [0, -1, 2]])  # this is positive-definite

X = gaussian_draws(m, K, 10)
X

Of course I could just have take advantage of the fact that numpy can generate the entire sample-matrix for me, buuut I dunno, felt like doing some work myself, though it's not much.

And if you want to know how we can draw from a semi-positive definite symmetric matrix, checkout this. Explains how you ought to use SVD in this case.

And then we need to able to compute the squared exponential kernel function:

gaussian_processes_for_machine_learning_76b1bd2d2866f0c54ec35e081a0f27861352b856.png

def squared_exponential(X, l=2):
    """
    DEPRECATED. Use exponential_cov instead. Much better.

    Computes the squared exponential (also known as Radial Basis function) of the 
    the matrix `X`.

    Assumes `X` to be of the shape (n, p) where n is the number of samples, and p is the
    number of random variables for which we have samples.
    """
    dim = X.shape[0]
    K = np.zeros((dim, dim))

    for p in range(dim):
        x_p = X[p]

        for q in range(p, dim):
            x_q = X[q]
            diff = np.subtract(x_p, x_q)

            # since the resulting covariance matrix will be symmetric we only need to compute
            # the triangular matrix, and then copy the values.
            se = np.exp(- (1 / l) * (diff ** 2).sum())
            K[p, q] = se
            # stop us from assigning to the diagonal entries
            # twice, being SLIGHTLY more performant, yey!
            if p != q:
                K[q, p] = se


    return K

squared_exponential(np.arange(3))

Ex. 2

Bibliography