# Notes on: Brochu, E., Cora, V. M., & Freitas, N. d. (2010): A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

## Table of Contents

## Notation

- "Exploitation" is the notion of "greedy / local" optimization
- Nonlinear function over compact set . This is often a loss-function (we're formulating this as a maximization problem, in which case we would maximize instead)
- is a
*sequence*of samples - is the observations of input-output pairs
- and
- denotes the
**acquisition function** The predictive distribution is given by:

where

- Sometimes we will use the notation and for and

## Bayesian Optimization

**Acquisition function**is a function to be maximized to obtain the next sample. It depends on the previous samples:**Posterior distribution over objective**is a distribution over the objective function, which we are*sampling*from*Sampling*means that we can balance the "exploitation" ("greedy" optimization) and "exploration"- Useful if has multiple optimas

Given by Bayes' rule:

### Priors over functions

Bayesian optimization method will converge to optimum if:

- the acquisition function is
*continuous*and*approximately minimizes the risk*(expected deviation from global minimum at fixed point ) - conditional variance converges to zero (or appropriate minimum value in the presence of noise) if and only if the distance to the nearest observation is zero
- the objective is continuous
- the prior is homogenous
- the optimization is independent of the m-th differences

Turns out that a Gaussian Process (GP) is very well suited for such a task, that is:

In this case, the predictive distribution is given by:

where

Hence and are sufficient statistics for the predictive posterior distribution .

### Covariance function

- Determines smoothnesss properties of samples of

The **Matérn kernel** incorporates a smoothness parameter to permit greater flexibility in modelling functions:

where

- is the Gamma function
- is the Bessel function of order

Note that as , the **Matérn kernel** reduces to the squared exponential kernel.

will often be used as a hyperparameter.

### Acquision functions

Would like to such that *high* aquision corresponds to potentially

- High prediction
- High uncertainty

The **Maximum Probability of Improvement (MPI)** or **P-algorithm** (since the utility is the probability of improvement) is defined:

where

- is the
*normal cumulative distribution function* the incubement

Drawback with **MPI** is that the it's *purely exploitation*. Points that have a high probability of being infinitesimally greater than have higher probability than points that offer potentially larger gains but with greater uncertainty. As a result, a modification is to add a trade-off parameter :

Scheduling is sometimes used for , starting at a fairly high value to drive *exploration* and then decreased to later drive *exploitation*.

"It's difficult in choosing in MPI, since too large makes the search excessively global, while too small makes search excessively greedy.

One would wish it not only took the *probability* of improvement into account, but also the *magintude* of the improvement a point can potentially yield.

But since it's expensive to compute the norm of two vectors, we instead use the **Improvement function**:

We can then consider the *expected* improvement, and choose accordingly, i.e.:

The distribution we're taking the expectation over is the normal posterior distribution characterized by and , which has the normal density

The **Expected improvement** can then be evaluated analytically (by integrating out , which I find slightly weird since it also depends on ), yielding:

where

- and denote the CDF and PDF, respectively, of the standard normal distribution

Observe that if , then

which shows why this is then "more" into account the magnitude of the increase.

This is a **myopic** method, in a sense that we're minimizing the difference over a *single step* (i.e. ). We could also do the same thing for for some . Turns out it is indeed possible to obtain analytical expressions for these too.

### Confidence Bound criteria

- Component in Sequential Design for Optimization (SDO)
Given random function model, SDO selects points for evaluation based on the lower confidence bound of the prediction state:

Since we're

*maximizing*, we're interested in the*upper*confidence bound:- is a parameter to be specified by the user
Can cast Bayesian optimization problem as multi-armed bandit, with

*acquisition*being the*instantaneous regret*Goal is then:

where is the number of iterations.

**Upper confidence bound**selection with and the hyperparameter , we defineWith and , it can be shown that with high probability that this method has

*no regre*, i.e.i.e. the regret converges to zero.

- This implies a lower-bound on the convergence rate (or equivalently, upper bound for the "convergence-time")

### Maximizing the acquisition function

- To find next sample point, we need to maximize
- One can use deterministic, derivative-free optimizer named
`DIRECT`

Jones1993_lipschitzian_ptimization_without

### Noise

- Easy enough to incorporate by adding to the kernel matrix and derive predictive distribution from the GP
Change definition of

*incumbent*in the PI and EI formulations to use distribution at sample points, instead of maximum, and let the incumbent be the point with the highest*expected*value (over the noise):