Gaussian Processes for Machine Learning

Table of Contents


Gaussian processes is a generalization of Gaussian probability distribution to continuous case where we deal with these infinite number of functions by only considering a finite number of points.

Two common appraoches to deal with supervised learning:

  1. Restrict the class of functions that we consider
  2. Prior probability to every possible function
    • How do we compare a infinite number of functions?

A Pictorial Introduction to Bayesian Modelling

1D regression

  • Here we specify a prior over functions specified by a particular Gaussian process which favours smooth functions
  • We assume a 0 mean for every x , since we don't have any prior knowledge
  • We then add two data points $\mathcal{D} = \{(\mathbf{x}_1, y_1) , (\mathbf{x}_2, y_2)\}$ and only consider functions passing through these two points, ending up with the posterior in 1.1 b


Figure 1: From rasmussen2006gaussian


There are multiple ways to interpret a Gaussian process (GP)

Weight-space view


  • $\mathcal{D} = \{(\mathbf{x}_i, y_i | i = 1, \dots , n \}$ is $\mathcal{D}$ of $n$ observations
  • $\mathbf{x}$ = input vector of dimension $D$
  • $y$ = scalar output/target
  • $X$ = $D \times n$ design matrix with all $n$ input vectors as column vectors $\rightarrow$ $\mathcal{D} = (X, \mathbf{y})$

Standard Linear Model

Bayesian analysis of the standard linear regression model with Gaussian noise:

f(\mathbf{x}) = \mathbf{x}^T \mathbf{w}, \quad y = f(\mathbf{x}) + \varepsilon,
\quad \varepsilon \sim \mathcal{N} (0, \sigma^2_n)
  • Often a bias is included, but this can easily be implemented by adding a extra element to the input vector $\mathbf{x}$, whose value is always $1$
  • The noise assumption ($\varepsilon$ to be normally distributed) together with the model direcly gives us the likelihood, ergo the probability density of the observations given the parameters.

Because of (assumed) independence between samples:

p(\mathbf{y} | X, \mathbf{w}) = \overset{n}{\underset{i=1}{\prod}}
p(y_i | \mathbf{x}_i, \mathbf{w}_i) = \mathcal{N} (X^T \mathbf{w}, \sigma^2_n I)

In the Bayesian formalism we need to specify a prior over the parameters, expressing our beliefs about the parameters before considering the observations. We simply give a zero mean Gaussian with covariance matrix $\sum_p$ on the weights

\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_p)

Inference in a Bayesian linear model is based on the posterior distribution over the weights:

\text{posterior} = \frac{\text{likelihood} \times \text{prior}}
{\text{marginal likelihood}}, \quad
p(\mathbf{w} | \mathbf{y}, X) = \frac{p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w})} {p(\mathbf{y} | X)}

where the normalization constant or marginal likelihood is given by

p(\mathbf{y} | X) = \int p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w}) d\mathbf{w}

If we can obtain $p(\mathbf{y} | X)$ without explicitly calculating this integral, this is what they call "integrating out the weights".

For now we ignore the marginal distribution in the denominator, and then we have

p(\mathbf{w} | X, \mathbf{y}) & \propto p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w}) \\
& \propto exp \bigg( - \frac{1}{ 2 \sigma^2_n } \big(\mathbf{y} - X^T \mathbf{w})^T(\mathbf{y} - X^T \mathbf{w} \big) \bigg)
exp \bigg( - \frac{1}{2} \mathbf{w}^T \Sigma^{-1}_p \mathbf{w} \bigg) \\
& \propto exp \bigg( - \frac{1}{2} \big(\mathbf{w} - \mathbf{\overline{w}} \big)^T 
\Big( \frac{1}{\sigma^2_n} X X^T + \Sigma^{-1}_p \Big) 
\big(\mathbf{w} - \mathbf{\overline{w}} \big) \bigg) \\

where $\mathbf{\overline{w}} =  \sigma^{-2}_n (\sigma^{-2}_n X X^T + \Sigma^{-1}_p )^{-1} X \mathbf{y}$. Notice how this has the form of a Gaussian posterior with mean $\mathbf{\overline{w}}$ and covariance matrix $A^{-1}$:

p( \mathbf{w} | X, \mathbf{y} ) \sim \mathcal{N} \Big( \mathbf{\overline{w}} = \frac{1}{\sigma^2_n} A^{-1} X \mathbf{y}, \quad A^{-1} \Big)

where $A = \frac{1}{\sigma^2_n} X X^T + \Sigma^{-1}_p$.

I suppose this is legit without having to compute the marginal likelihood since we can easily obtain the normalization constant for a Gaussian Distribution.

For this model, and indeed for any Guassian posterior, the mean of the posterior $p( \mathbf{w} | \mathbf{y}, X)$ is also its mode, which is also called the maximum a posteriori (MAP) estimate of $\mathbf{w}$.

The meaning of the log prior and the MAP estimate substantially differs between Bayesian and non-Bayesian settings:

  • Bayesian $\rightarrow$ MAP estimate plays no special role
  • Non-Bayesian $\rightarrow$ negative prior is sometimes used as penalty term, and MAP point is known as the penalized maxium likelihood estimate of the weights. In fact, this is what is known as ridge regression, because of the quadratic term $\frac{1}{2} \mathbf{w}^T \Sigma^{-1}_p \mathbf{w}$ from the log prior.

To make predictions for some input $\mathbf{x_*}$, we average over all the parameter values, weighted by their posterior probability. Thus the predictive distribution for $f_* = f( \mathbf{x_*} )$, is given by averaging (or rather, taking the expectation for $f_*$ over the weights $w$) the output of all possible linear models w.r.t. the Gaussian posterior

p(f_* | \mathbf{x_*}, X, y) &= \int p(f_* | \mathbf{x_* }, \mathbf{w}) p(\mathbf{w} | X, y) d \mathbf{w} \\
&= \mathcal{N} \Big( \frac{1}{\sigma^2_n} \mathbf{x_*^T} A^{-1} X \mathbf{y}, \quad \mathbf{x_*}^T A^{-1} \mathbf{x_*} \Big) \\

And even the predictive distribution is Gaussian! With a mean given by the posterior mean of the weights from eq. (2.8) multiplied by the new input. All hail Gauss!

The above step demonstrates a crucial part: we can consider all possible weights simply by "integrating out" or "marginalizing over" the weights.

Also note how the predictive variance is a quadratic of the new input, and so the prediction will be more uncertain as the magnitude of test input grows.

Projections of Inputs into Feature Space

  • Can overcome linearity by projecting features into higher-dimensional space allowing us to implement polynomial regression
  • If projections are independent of parameters $\mathbf{w}$ => model is still linear in the parameters and therefore analytically solvable/tractable!
  • Data which is not linearly separable in the original data space may become linearly separable in a high dimensional feature space.

Function-space view

Our other approach is to consider inference directly in the function space, i.e. use a GP to describe a distribution over functions.

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A GP is completely specified by its mean function $m (\mathbf{x})$ and covariance function $K(\mathbf{x}, \mathbf{x'} )$ . For a real process $f(\mathbf{x})$

     m(\mathbf{x}) &= E[f(\mathbf{x})] \\
     k(\mathbf{x}, \mathbf{x'}) &= E[(f(\mathbf{x}) - m (\mathbf{x}))(f(\mathbf{x'}) - m (\mathbf{x'}))] \\

and we write the Gaussian process as

f(\mathbf{x}) \sim \mathcal{GP}\Big(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x^{'}}) \Big)

Usually, for notational simplicity we will take $m(\mathbf{x}) = 0$, though this need-not be done.

Consistency requirement / marginalization property

GP is defined as collection of rvs $\rightarrow$ a consistency requirement. This simply means that if the GP e.g. specifies $(y_1, y_2) \sim \mathcal{N}(\mathbf{\mu}, \Sigma)$, $\implies y_1 \sim \mathcal{N} (\mu_1, \Sigma_{11})$

The defintion does not exclude GPs with finite index sets (which would just be Gaussian distributions).

Simple example

Example of GP can be obtained from our Bayesian linear regression model $f(\mathbf{x}) = \psi(\mathbf{x}^T \mathbf{w})$ with prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_p)$.

E[f(\mathbf{x})] &= \psi(\mathbf{x})^T E[\mathbf{w}] = 0 \\
E[f(\mathbf{x}) f(\mathbf{x'})] &= \psi(\mathbf{x})^T E[\mathbf{w} \mathbf{w}^T] \psi(\mathbf{x'})
= \psi(\mathbf{x})^T \Sigma_p \psi(\mathbf{x'}) 

Ergo, $f(\mathbf{x})$ and $f(\mathbf{x'})$ are jointly Gaussian with zero mean and covariance given by $\psi(\mathbf{x})^T \Sigma_p \psi(\mathbf{x'})$.

Predictions using noise-free observations

  • We know $\{(\mathbf{x}_i , f_i) | i = 1, \dots , n\}$
  • $\mathbf{f}$ = vector of training outputs
  • $\mathbf{f_*}$ = vector of test outputs
  • Join distribution of training and test outputs according to the prior is then
\mathbf{f} \\
\sim \mathcal{N} \bigg( \mathbf{0}, 
K(X,X) & K(X, X_*) \\
K(X_*, X) & K(X_*, X_*) 
  • If $n$ training points and $n*$ test points, the matrix $K(X, X_*)$ = covariaces evaluated at all pairs of training and test points, and similarily for $K(X, X)$, etc.
  • Restrict this joint prior distribution to contain only those functions which agree with the observed data points
    • A lot of functions mate $\rightarrow$ not efficient
    • In probabilistic terms this operation is simple: equivalent of conditioning on the joint Gaussian prior distribution on the observations to give

\mathbf{f_*} | X_*, X, \mathbf{f} \sim &
\mathcal{N} \Big( K(X_*, X)K(X, X)^{-1} \mathbf{f}, \\
& K(X_*, X_*) - K(X_*, X)K(X,X)^{-1}K(X, X_*) \Big)
  • Now we can sample values for $\mathbf{f}_*$ (corresponding to test inputs $X_*$) from the joint posterior distribution by evaluating the mean and covariance matrix from the eqn. above.

Appendix A: Stuff to know

Sampling from a Multivariate Gaussian

Assuming we have access to a Gaussian scalar generator.

Want samples $\mathbf{x} \sim \mathcal{N}(\mathbf{m}, K)$ with arbitrary mean $\mathbf{m}$ and covariance $K$.

  1. Compute Cholesky decomposition (also known as "matrix square root") $L$ of the positive (semi? not sure) definite symmetric covariance matrix $K = L L^T$
    • $L$ is lower triangular matrix
  2. Generate $\mathbf{u} \sim \mathcal{N}(\mathbf{0}, I)$ by multiple separate calls to the scalar Gaussian generator
  3. Compute $\mathbf{x} = \mathbf{m} + L \mathbf{u}$
    • Has desired distribution with mean $m$ and covariance $L E[\mathbf{u}\mathbf{u}^T] L^T = LL^T = K$, by independence of $\mathbf{u}$

Cholesky Decomposition

  • Decomposes a symmetric, positive definite matrix $A$ into a product of a lower triangular matrix $L$ and its transpose:

A = LL^T
  • Useful for solving linear systems with symmetric, positive definite coefficient matrix $A$.
    1. Want to solve $Ax = b$ for $x$
    2. Solve the triangular system $Ly = b$ by forward substitution
    3. Solve triangular system $L^Tx = y$ by backward substition
    4. Solution is $x = L^T \text{\\} (L \text{\\} b)$
      • $A\text{\\}b$ = vector $x$ which solves $Ax = b$


  • Both forward and backward substitution requires $\frac{n^2}{2}$ operations, when $A$ is of size $n \times n$.
  • The computation of the Cholesky factor $L$ is considered numerically extremely stable and takes time $\frac{n^3}{6}$.

Forward/backward substitution

For a lower triangular matrix we can write $L \mathbf{x} = y$ as follows

l_{1,1} & & & & &= b_1 \\
l_{2,1} & + & l_{2,2} & & &= b_2 \\
\vdots & & \vdots & \ddots &  & \vdots \\
l_{m,1} & + & \dots & + & l_{m,m} &= b_m 

Notice that $l_{1,1} x_1 = b_1$, and so we can solve $l_{2,1} x_1 + l_{2,2} x_2 = b_2$ by substituting $l_{1,1}$ in, i.e. forward substitution.

Doing this repeatedly leads to the full solution for $L \mathbf{x} = y$, that is:

x_{2} &= {\frac {b_{2}-\ell _{2,1}x_{1}}{\ell _{2,2}}}, \\
x_{2} &= {\frac {b_{2}-\ell _{2,1}x_{1}}{\ell _{2,2}}}, \\
& \vdots \\
x_{m} &= {\frac {b_{m}-\sum _{i=1}^{m-1}\ell _{m,i}x_{i}}{\ell _{m,m}}}

For backward substitution we do the exact same thing, just in reverse, since we're then dealing with a upper triangular matrix.

Appendix B: Derivations



  • $\mathcal{R}^D$ = input space
  • $\mathcal{R}^N$ = feature space
  • $\phi(\mathbf{x}) : \mathcal{R}^D \rightarrow \mathcal{R}^N$
  • $\Phi(X)$ = matrix with columns $\psi(\mathbf{x})$ for all training inputs


f(\mathbf{x}) = \psi( \mathbf{x} )^T \mathbf{w}

where $\mathbf{w}$ is of length $N$. This is analogous to the standard linear model except everywhere $\Phi (X)$ is substituted for $X$.

Appendix C: Exercises

Chapter 2: Regression

Ex. 1

First we import the stuff that needs to be imported

    %matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

    from __future__ import division

After having some issues, I found this article from which I used some of the code, so thanks a lot to Chris Fonnesbeck!

We start out by the following specification for our Gaussian Process:

m(x) = 0, \qquad k(x, x') = \theta_0 \exp \Big( - \frac{\theta_1}{2} |x - x'|^2 \Big)
def exponential_cov(x, y, theta):
    return theta[0] * np.exp(-0.5 * theta[1] * np.subtract.outer(x, y) ** 2)

Then to compute the conditional probability for some new input $x$ conditioned on the previously send inputs with outputs, we use the following:

    \mathbf{f_*} | X_*, X, \mathbf{f} \sim &
    \mathcal{N} \Big( K(X_*, X)K(X, X)^{-1} \mathbf{f}, \\
    & K(X_*, X_*) - K(X_*, X)K(X,X)^{-1}K(X, X_*) \Big)

where, again, remember that $\mathbf{f}_*$ and $\mathbf{f}$ represents the test and training outputs, i.e in the case of training (already seen) inputs:

\mathbf{f} = 
f(x^{(1)}) \\
f(x^{(2)}) \\
\vdots \\
f(x^{(n)}) \\
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(n)} \\
def conditional(X_star, X, y, theta):
    B = exponential_cov(X_star, X, theta)
    C = exponential_cov(X, X, theta)
    A = exponential_cov(X_star, X_star, theta)

    mu = np.linalg.inv(C).dot(B.T)
    sigma = A -

    return mu.squeeze(), sigma.squeeze()  # simply unwraps any 1-D arrays, so (1, 100) => (100,)

For now we set $\theta_0 = 1, \quad \theta_1 = 1$:

theta = (1, 1)
sigma_theta = exponential_cov(0, 0, theta)
xpts = np.arange(-3, 3, step=0.01)
plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_theta, capsize=0)
    plt.ylim(-3, 3)

    # some sampled priors
    f1_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta))
f2_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta))
f3_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta))
plt.plot(xpts, f1_prior)
plt.plot(xpts, f2_prior)
plt.plot(xpts, f3_prior)


We select an arbitrary starting point, say $x = 1$, and simply sample from a normal distribution with unit covariance. This is just for us to have a "seen" sample to play with.

x = [1.]
y = [np.random.normal(scale=sigma_theta)]

We now update our confidence band, given the point we just sampled, by using the covariance function to generate new point-wise confidence intervals, conditional on the value $(x_0, y_0)$ seen above.

At a first glance it might look like conditional and predict_noiseless are the same functions, but the difference is that predict_noiseless returns the mean and (noiseless) variance at each point where as conditional returns the mean and the entire covariance matrix for all the test points.

So if you want to draw function samples (as we do later on), you'll want to use the conditional, while if you're simply after the mean-line and it's uncertainty at each evaluated point you use the predict_noiseless.

sigma_1 = exponential_cov(x, x, theta)  # compute covariance for new point

def predict_noiseless(x, data, kernel, theta, sigma, f):
    K = [kernel(x, y, theta) for y in data]           # covariance between test input and training input
    sigma_inv = np.linalg.inv(sigma)                      # inverting the covariance matrix for training input
    K_dot_sigma_inv =, sigma_inv)
    y_pred =                   # compute the predicted mean 
    sigma_new = kernel(x, x, theta) -  # covariance of new point

    return y_pred, sigma_new

x_pred = np.linspace(-3, 3, 2000)
predictions = [predict_noiseless(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
y_pred, sigmas = np.transpose(predictions)
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0, alpha=0.6)
plt.plot(x, y, "ro")
plt.plot(x_pred, y_pred, c="c")
plt.ylim(-3, 3)


Notice that here we are predicting a mean $m$ and covariance $k(x_*, X)$ at each test point, but for fun's sake we can sample some functions too!

\mathbf{f}_* \sim \mathcal{N} \Big( \mathbf{0}, K(X_*, X_*) \Big)

That is; we create a normal distribution of dimension equal to the number of test inputs, draw from this distribution, then the sample at index i corresponds to $f_* (i)$, yah see? It's weird, yes, but allows the indexed sample points to be any real value!

    # same as above
plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0, alpha=0.3, label="$\sigma_*$")
plt.plot(x, y, "ro")
plt.plot(x_pred, y_pred, c="black", label="$\mu_*$")
plt.ylim(-3, 3)
    plt.xlim(-3, 3)

# Some sampled functions
mu_star, K_star = conditional(x_pred, x, y, theta)       # x_pred conditioned on training input (x, y)
f1 = np.random.multivariate_normal(mu_star, K_star)
f2 = np.random.multivariate_normal(mu_star, K_star)
f3 = np.random.multivariate_normal(mu_star, K_star)
plt.plot(x_pred, f1, label="$f_*^{(1)}$")
plt.plot(x_pred, f2, label="$f_*^{(2)}$")
plt.plot(x_pred, f3, label="$f_*^{(3)}$")



I had multiple issues with following the book:

  • Covariance matrix of the input matrix that I come up with is not positive definite => can't use Cholesky decomposition to draw samples, nor to circumvent having to compute the inverse. There is no guarantee for this to be the case either, but there is a guarantee that for a real matrix we will have a semi-positive definite covariance matrix, which allows for drawing from a Gaussian by basically replacing each L step with the equivalent SVD step (where we use the product $U \Sigma$ instead of $L$).
  • What to transpose and so on in the computation of the mean vector and covariance matrix for the predictions I got wrong multiple times… IT'S HARD, MKAY?!
Some of my own deprecated / horrible work

To generate samples $\mathbf{x} \sim \mathcal{N} (\mathbf{m}, K)$ with arbitrary mean $\mathbf{m}$ and covariance matrix $K$ using a scalar Gaussian generator, we proceed as follows:

  1. Compute the Cholesky decomposition (also known as the "matrix square root") $L$ of the positive-definite symmetric covariance matrix $K = LL^T$, where $L$ is lower triangular.
  2. Generate $\mathbf{u} \sim \mathcal{N}(\mathbf{0}, I)$ by multiple separate calls to the scalar Gaussian generator.
  3. Compute $\mathbf{x} = \mathbf{m} + L \mathbf{u}$, which has the desired distribution with:
    • Mean $\mathbf{m}$
    • Cov. $L \mathcal(E)[\mathbf{u}\mathbf{u}^T] L^T = LL^T = K$ (by independence of the elements of $\mathbf{u}$)
def gaussian_draws(m, K, N=10000):
    IMPORTANT: This version uses the Cholesky decomposition which ONLY works if the
    matrix `X` is positive definite symmetric, which is often not the case! A covariance
    matrix is only guaranteed to be SEMI-positive definite symmetric.
    dim = m.shape[0]
    X = np.zeros((N, dim))

    L = np.linalg.cholesky(K)

    for i in range(N):
        u = np.random.normal(0, 1, dim)
        X[i, :] = m +, u)

    if N == 1:
        return X[0]  # return a single vector if only 1 sample is requested
        return X

m = np.array([2, 1, 0])
K = np.array([[2, -1, 0], [-1, 2, -1], [0, -1, 2]])  # this is positive-definite

X = gaussian_draws(m, K, 10)

Of course I could just have take advantage of the fact that numpy can generate the entire sample-matrix for me, buuut I dunno, felt like doing some work myself, though it's not much.

And if you want to know how we can draw from a semi-positive definite symmetric matrix, checkout this. Explains how you ought to use SVD in this case.

And then we need to able to compute the squared exponential kernel function:

cov \Big( f(\mathbf{x}_p ), f(\mathbf{x}_q) \Big) = k(\mathbf{x}_p, \mathbf{x}_q) = 
exp \Big(- \frac{1}{2} | \mathbf{x}_p - \mathbf{x}_q |^2 \Big)
def squared_exponential(X, l=2):
            DEPRECATED. Use exponential_cov instead. Much better.

    Computes the squared exponential (also known as Radial Basis function) of the 
    the matrix `X`.

    Assumes `X` to be of the shape (n, p) where n is the number of samples, and p is the
    number of random variables for which we have samples.
    dim = X.shape[0]
    K = np.zeros((dim, dim))

    for p in range(dim):
        x_p = X[p]

        for q in range(p, dim):
            x_q = X[q]
            diff = np.subtract(x_p, x_q)

            # since the resulting covariance matrix will be symmetric we only need to compute
            # the triangular matrix, and then copy the values.
            se = np.exp(- (1 / l) * (diff ** 2).sum())
            K[p, q] = se
            # stop us from assigning to the diagonal entries
            # twice, being SLIGHTLY more performant, yey!
            if p != q:
                K[q, p] = se

    return K


Ex. 2