Gaussian Processes for Machine Learning
Table of Contents
Introduction
Gaussian processes is a generalization of Gaussian probability distribution to continuous case where we deal with these infinite number of functions by only considering a finite number of points.
Two common appraoches to deal with supervised learning:
- Restrict the class of functions that we consider
- Prior probability to every possible function
- How do we compare a infinite number of functions?
A Pictorial Introduction to Bayesian Modelling
1D regression
- Here we specify a prior over functions specified by a particular Gaussian process which favours smooth functions
- We assume a 0 mean for every
x
, since we don't have any prior knowledge - We then add two data points and only consider functions passing through these two points, ending up with the posterior in 1.1 b
Figure 1: From rasmussen2006gaussian
Regression
There are multiple ways to interpret a Gaussian process (GP)
Weight-space view
Notation
- is of observations
- = input vector of dimension
- = scalar output/target
- = design matrix with all input vectors as column vectors
Standard Linear Model
Bayesian analysis of the standard linear regression model with Gaussian noise:
- Often a bias is included, but this can easily be implemented by adding a extra element to the input vector , whose value is always
- The noise assumption ( to be normally distributed) together with the model direcly gives us the likelihood, ergo the probability density of the observations given the parameters.
Because of (assumed) independence between samples:
In the Bayesian formalism we need to specify a prior over the parameters, expressing our beliefs about the parameters before considering the observations. We simply give a zero mean Gaussian with covariance matrix on the weights
Inference in a Bayesian linear model is based on the posterior distribution over the weights:
where the normalization constant or marginal likelihood is given by
If we can obtain without explicitly calculating this integral, this is what they call "integrating out the weights".
For now we ignore the marginal distribution in the denominator, and then we have
where . Notice how this has the form of a Gaussian posterior with mean and covariance matrix :
where .
I suppose this is legit without having to compute the marginal likelihood since we can easily obtain the normalization constant for a Gaussian Distribution.
For this model, and indeed for any Guassian posterior, the mean of the posterior is also its mode, which is also called the maximum a posteriori (MAP) estimate of .
The meaning of the log prior and the MAP estimate substantially differs between Bayesian and non-Bayesian settings:
- Bayesian MAP estimate plays no special role
- Non-Bayesian negative prior is sometimes used as penalty term, and MAP point is known as the penalized maxium likelihood estimate of the weights. In fact, this is what is known as ridge regression, because of the quadratic term from the log prior.
To make predictions for some input , we average over all the parameter values, weighted by their posterior probability. Thus the predictive distribution for , is given by averaging (or rather, taking the expectation for over the weights ) the output of all possible linear models w.r.t. the Gaussian posterior
And even the predictive distribution is Gaussian! With a mean given by the posterior mean of the weights from eq. (2.8) multiplied by the new input. All hail Gauss!
The above step demonstrates a crucial part: we can consider all possible weights simply by "integrating out" or "marginalizing over" the weights.
Also note how the predictive variance is a quadratic of the new input, and so the prediction will be more uncertain as the magnitude of test input grows.
Projections of Inputs into Feature Space
- Can overcome linearity by projecting features into higher-dimensional space allowing us to implement polynomial regression
- If projections are independent of parameters => model is still linear in the parameters and therefore analytically solvable/tractable!
- Data which is not linearly separable in the original data space may become linearly separable in a high dimensional feature space.
Function-space view
Our other approach is to consider inference directly in the function space, i.e. use a GP to describe a distribution over functions.
A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
A GP is completely specified by its mean function and covariance function . For a real process
and we write the Gaussian process as
Usually, for notational simplicity we will take , though this need-not be done.
Consistency requirement / marginalization property
GP is defined as collection of rvs a consistency requirement. This simply means that if the GP e.g. specifies ,
The defintion does not exclude GPs with finite index sets (which would just be Gaussian distributions).
Simple example
Example of GP can be obtained from our Bayesian linear regression model with prior .
Ergo, and are jointly Gaussian with zero mean and covariance given by .
Predictions using noise-free observations
- We know
- = vector of training outputs
- = vector of test outputs
- Join distribution of training and test outputs according to the prior is then
- If training points and test points, the matrix = covariaces evaluated at all pairs of training and test points, and similarily for , etc.
- Restrict this joint prior distribution to contain only
those functions which agree with the observed data points
- A lot of functions mate not efficient
In probabilistic terms this operation is simple: equivalent of conditioning on the joint Gaussian prior distribution on the observations to give
13
- Now we can sample values for (corresponding to test inputs ) from the joint posterior distribution by evaluating the mean and covariance matrix from the eqn. above.
Appendix A: Stuff to know
Sampling from a Multivariate Gaussian
Assuming we have access to a Gaussian scalar generator.
Want samples with arbitrary mean and covariance .
- Compute Cholesky decomposition (also known as "matrix square
root") of the positive (semi? not sure) definite symmetric
covariance matrix
- is lower triangular matrix
- Generate by multiple separate calls to the scalar Gaussian generator
- Compute
- Has desired distribution with mean and covariance , by independence of
Cholesky Decomposition
Decomposes a symmetric, positive definite matrix into a product of a lower triangular matrix and its transpose:
14- Useful for solving linear systems with symmetric, positive
definite coefficient matrix .
- Want to solve for
- Solve the triangular system by forward substitution
- Solve triangular system by backward substition
- Solution is
- = vector which solves
Complexity
- Both forward and backward substitution requires operations, when is of size .
- The computation of the Cholesky factor is considered numerically extremely stable and takes time .
Forward/backward substitution
For a lower triangular matrix we can write as follows
Notice that , and so we can solve by substituting in, i.e. forward substitution.
Doing this repeatedly leads to the full solution for , that is:
For backward substitution we do the exact same thing, just in reverse, since we're then dealing with a upper triangular matrix.
Appendix B: Derivations
Kernel
Declarations
- = input space
- = feature space
- = matrix with columns for all training inputs
Model
where is of length . This is analogous to the standard linear model except everywhere is substituted for .
Appendix C: Exercises
Chapter 2: Regression
Ex. 1
First we import the stuff that needs to be imported
%matplotlib inline import numpy as np import matplotlib.pyplot as plt import seaborn as sns from __future__ import division
After having some issues, I found this article from which I used some of the code, so thanks a lot to Chris Fonnesbeck!
We start out by the following specification for our Gaussian Process:
def exponential_cov(x, y, theta): return theta[0] * np.exp(-0.5 * theta[1] * np.subtract.outer(x, y) ** 2)
Then to compute the conditional probability for some new input conditioned on the previously send inputs with outputs, we use the following:
where, again, remember that and represents the test and training outputs, i.e in the case of training (already seen) inputs:
def conditional(X_star, X, y, theta): B = exponential_cov(X_star, X, theta) C = exponential_cov(X, X, theta) A = exponential_cov(X_star, X_star, theta) mu = np.linalg.inv(C).dot(B.T).T.dot(y) sigma = A - B.dot(np.linalg.inv(C).dot(B.T)) return mu.squeeze(), sigma.squeeze() # simply unwraps any 1-D arrays, so (1, 100) => (100,)
For now we set :
theta = (1, 1) sigma_theta = exponential_cov(0, 0, theta) xpts = np.arange(-3, 3, step=0.01) plt.errorbar(xpts, np.zeros(len(xpts)), yerr=sigma_theta, capsize=0) plt.ylim(-3, 3) # some sampled priors f1_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta)) f2_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta)) f3_prior = np.random.multivariate_normal(np.zeros(len(xpts)), exponential_cov(xpts, xpts, theta)) plt.plot(xpts, f1_prior) plt.plot(xpts, f2_prior) plt.plot(xpts, f3_prior)
We select an arbitrary starting point, say , and simply sample from a normal distribution with unit covariance. This is just for us to have a "seen" sample to play with.
x = [1.] y = [np.random.normal(scale=sigma_theta)] np.array(y)
We now update our confidence band, given the point we just sampled, by using the covariance function to generate new point-wise confidence intervals, conditional on the value seen above.
At a first glance it might look like conditional
and predict_noiseless
are the same functions,
but the difference is that predict_noiseless
returns the mean and (noiseless) variance at each point
where as conditional
returns the mean and the entire covariance matrix for all the test points.
So if you want to draw function samples (as we do later on), you'll want to use the conditional
, while
if you're simply after the mean-line and it's uncertainty at each evaluated point you use the predict_noiseless
.
sigma_1 = exponential_cov(x, x, theta) # compute covariance for new point def predict_noiseless(x, data, kernel, theta, sigma, f): K = [kernel(x, y, theta) for y in data] # covariance between test input and training input sigma_inv = np.linalg.inv(sigma) # inverting the covariance matrix for training input K_dot_sigma_inv = np.dot(K, sigma_inv) y_pred = K_dot_sigma_inv.dot(f) # compute the predicted mean sigma_new = kernel(x, x, theta) - K_dot_sigma_inv.dot(K) # covariance of new point return y_pred, sigma_new x_pred = np.linspace(-3, 3, 2000) predictions = [predict_noiseless(i, x, exponential_cov, theta, sigma_1, y) for i in x_pred]
y_pred, sigmas = np.transpose(predictions) plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0, alpha=0.6) plt.plot(x, y, "ro") plt.plot(x_pred, y_pred, c="c") plt.ylim(-3, 3)
Notice that here we are predicting a mean and covariance at each test point, but for fun's sake we can sample some functions too!
That is; we create a normal distribution of dimension equal to the number of test inputs,
draw from this distribution, then the sample at index i
corresponds to , yah see?
It's weird, yes, but allows the indexed sample points to be any real value!
# same as above plt.errorbar(x_pred, y_pred, yerr=sigmas, capsize=0, alpha=0.3, label="$\sigma_*$") plt.plot(x, y, "ro") plt.plot(x_pred, y_pred, c="black", label="$\mu_*$") plt.ylim(-3, 3) plt.xlim(-3, 3) # Some sampled functions mu_star, K_star = conditional(x_pred, x, y, theta) # x_pred conditioned on training input (x, y) f1 = np.random.multivariate_normal(mu_star, K_star) f2 = np.random.multivariate_normal(mu_star, K_star) f3 = np.random.multivariate_normal(mu_star, K_star) plt.plot(x_pred, f1, label="$f_*^{(1)}$") plt.plot(x_pred, f2, label="$f_*^{(2)}$") plt.plot(x_pred, f3, label="$f_*^{(3)}$") plt.legend()
Issues
I had multiple issues with following the book:
- Covariance matrix of the input matrix that I come up with is not positive definite => can't use Cholesky decomposition to draw samples, nor to circumvent having to compute the inverse. There is no guarantee for this to be the case either, but there is a guarantee that for a real matrix we will have a semi-positive definite covariance matrix, which allows for drawing from a Gaussian by basically replacing each L step with the equivalent SVD step (where we use the product instead of ).
- What to transpose and so on in the computation of the mean vector and covariance matrix for the predictions I got wrong multiple times… IT'S HARD, MKAY?!
Some of my own deprecated / horrible work
To generate samples with arbitrary mean and covariance matrix using a scalar Gaussian generator, we proceed as follows:
- Compute the Cholesky decomposition (also known as the "matrix square root") of the positive-definite symmetric covariance matrix , where is lower triangular.
- Generate by multiple separate calls to the scalar Gaussian generator.
- Compute , which has the desired distribution with:
- Mean
- Cov. (by independence of the elements of )
def gaussian_draws(m, K, N=10000): """ IMPORTANT: This version uses the Cholesky decomposition which ONLY works if the matrix `X` is positive definite symmetric, which is often not the case! A covariance matrix is only guaranteed to be SEMI-positive definite symmetric. """ dim = m.shape[0] X = np.zeros((N, dim)) L = np.linalg.cholesky(K) for i in range(N): u = np.random.normal(0, 1, dim) X[i, :] = m + np.dot(L, u) if N == 1: return X[0] # return a single vector if only 1 sample is requested else: return X m = np.array([2, 1, 0]) K = np.array([[2, -1, 0], [-1, 2, -1], [0, -1, 2]]) # this is positive-definite X = gaussian_draws(m, K, 10) X
Of course I could just have take advantage of the fact that numpy
can generate the
entire sample-matrix for me, buuut I dunno, felt like doing some work myself, though
it's not much.
And if you want to know how we can draw from a semi-positive definite symmetric matrix, checkout this. Explains how you ought to use SVD in this case.
And then we need to able to compute the squared exponential kernel function:
def squared_exponential(X, l=2): """ DEPRECATED. Use exponential_cov instead. Much better. Computes the squared exponential (also known as Radial Basis function) of the the matrix `X`. Assumes `X` to be of the shape (n, p) where n is the number of samples, and p is the number of random variables for which we have samples. """ dim = X.shape[0] K = np.zeros((dim, dim)) for p in range(dim): x_p = X[p] for q in range(p, dim): x_q = X[q] diff = np.subtract(x_p, x_q) # since the resulting covariance matrix will be symmetric we only need to compute # the triangular matrix, and then copy the values. se = np.exp(- (1 / l) * (diff ** 2).sum()) K[p, q] = se # stop us from assigning to the diagonal entries # twice, being SLIGHTLY more performant, yey! if p != q: K[q, p] = se return K squared_exponential(np.arange(3))
Ex. 2
Bibliography
Bibliography
- [rasmussen2006gaussian] Rasmussen & Williams, Gaussian Processes for Machine Learning, MIT Press (2006).