# Notes on: Brochu, E., Cora, V. M., & Freitas, N. d. (2010): A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

## Notation

• "Exploitation" is the notion of "greedy / local" optimization
• Nonlinear function over compact set . This is often a loss-function (we're formulating this as a maximization problem, in which case we would maximize instead)
• is a sequence of samples
• is the observations of input-output pairs
• and • denotes the acquisition function
• The predictive distribution is given by: where • Sometimes we will use the notation and for and ## Bayesian Optimization

• Acquisition function is a function to be maximized to obtain the next sample. It depends on the previous samples: • Posterior distribution over objective is a distribution over the objective function, which we are sampling from
• Sampling means that we can balance the "exploitation" ("greedy" optimization) and "exploration"
• Useful if has multiple optimas
• Given by Bayes' rule: ### Priors over functions

Bayesian optimization method will converge to optimum if:

1. the acquisition function is continuous and approximately minimizes the risk (expected deviation from global minimum at fixed point )
2. conditional variance converges to zero (or appropriate minimum value in the presence of noise) if and only if the distance to the nearest observation is zero
3. the objective is continuous
4. the prior is homogenous
5. the optimization is independent of the m-th differences

Turns out that a Gaussian Process (GP) is very well suited for such a task, that is: In this case, the predictive distribution is given by: where Hence and are sufficient statistics for the predictive posterior distribution .

### Covariance function

• Determines smoothnesss properties of samples of The Matérn kernel incorporates a smoothness parameter to permit greater flexibility in modelling functions: where

• is the Gamma function
• is the Bessel function of order Note that as , the Matérn kernel reduces to the squared exponential kernel. will often be used as a hyperparameter.

### Acquision functions

Would like to such that high aquision corresponds to potentially

• High prediction
• High uncertainty

The Maximum Probability of Improvement (MPI) or P-algorithm (since the utility is the probability of improvement) is defined: where

• is the normal cumulative distribution function
• the incubement Drawback with MPI is that the it's purely exploitation. Points that have a high probability of being infinitesimally greater than have higher probability than points that offer potentially larger gains but with greater uncertainty. As a result, a modification is to add a trade-off parameter : Scheduling is sometimes used for , starting at a fairly high value to drive exploration and then decreased to later drive exploitation.

"It's difficult in choosing in MPI, since too large makes the search excessively global, while too small makes search excessively greedy.

One would wish it not only took the probability of improvement into account, but also the magintude of the improvement a point can potentially yield.

But since it's expensive to compute the norm of two vectors, we instead use the Improvement function: We can then consider the expected improvement, and choose accordingly, i.e.: The distribution we're taking the expectation over is the normal posterior distribution characterized by and , which has the normal density The Expected improvement can then be evaluated analytically (by integrating out , which I find slightly weird since it also depends on ), yielding: where

• and denote the CDF and PDF, respectively, of the standard normal distribution

Observe that if , then which shows why this is then "more" into account the magnitude of the increase.

This is a myopic method, in a sense that we're minimizing the difference over a single step (i.e. ). We could also do the same thing for for some . Turns out it is indeed possible to obtain analytical expressions for these too.

### Confidence Bound criteria

• Component in Sequential Design for Optimization (SDO)
• Given random function model, SDO selects points for evaluation based on the lower confidence bound of the prediction state: • Since we're maximizing, we're interested in the upper confidence bound: • is a parameter to be specified by the user
• Can cast Bayesian optimization problem as multi-armed bandit, with acquisition being the instantaneous regret • Goal is then: where is the number of iterations.

• Upper confidence bound selection with and the hyperparameter , we define • With and , it can be shown that with high probability that this method has no regre, i.e. i.e. the regret converges to zero.

• This implies a lower-bound on the convergence rate (or equivalently, upper bound for the "convergence-time")

### Maximizing the acquisition function

• To find next sample point, we need to maximize • One can use deterministic, derivative-free optimizer named `DIRECT` Jones1993_lipschitzian_ptimization_without

### Noise

• Easy enough to incorporate by adding to the kernel matrix and derive predictive distribution from the GP
• Change definition of incumbent in the PI and EI formulations to use distribution at sample points, instead of maximum, and let the incumbent be the point with the highest expected value (over the noise): 