Notes on: Brochu, E., Cora, V. M., & Freitas, N. d. (2010): A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning
Table of Contents
Notation
- "Exploitation" is the notion of "greedy / local" optimization
- Nonlinear function
over compact set
. This is often a loss-function (we're formulating this as a maximization problem, in which case we would maximize
instead)
is a sequence of samples
is the observations of input-output pairs
and
denotes the acquisition function
The predictive distribution is given by:
where
- Sometimes we will use the notation
and
for
and
Bayesian Optimization
Acquisition function is a function to be maximized to obtain the next sample. It depends on the previous samples:
- Posterior distribution over objective is a distribution over the objective function, which we are sampling from
- Sampling means that we can balance the "exploitation" ("greedy" optimization) and "exploration"
- Useful if
has multiple optimas
- Useful if
Given by Bayes' rule:
- Sampling means that we can balance the "exploitation" ("greedy" optimization) and "exploration"
Priors over functions
Bayesian optimization method will converge to optimum if:
- the acquisition function
is continuous and approximately minimizes the risk (expected deviation from global minimum at fixed point
)
- conditional variance converges to zero (or appropriate minimum value in the presence of noise) if and only if the distance to the nearest observation is zero
- the objective is continuous
- the prior is homogenous
- the optimization is independent of the m-th differences
Turns out that a Gaussian Process (GP) is very well suited for such a task, that is:
In this case, the predictive distribution is given by:
where
Hence and
are sufficient statistics for the predictive posterior distribution
.
Covariance function
- Determines smoothnesss properties of samples of
The Matérn kernel incorporates a smoothness parameter to permit greater flexibility in modelling functions:
where
is the Gamma function
is the Bessel function of order
Note that as , the Matérn kernel reduces to the squared exponential kernel.
will often be used as a hyperparameter.
Acquision functions
Would like to such that high aquision corresponds to potentially
- High prediction
- High uncertainty
The Maximum Probability of Improvement (MPI) or P-algorithm (since the utility is the probability of improvement) is defined:
where
is the normal cumulative distribution function
the incubement
Drawback with MPI is that the it's purely exploitation. Points that have a high probability of being infinitesimally greater than have higher probability than points that offer potentially larger gains but with greater uncertainty. As a result, a modification is to add a trade-off parameter
:
Scheduling is sometimes used for , starting at a fairly high value to drive exploration and then decreased to later drive exploitation.
"It's difficult in choosing in MPI, since too large
makes the search excessively global, while too small makes search excessively greedy.
One would wish it not only took the probability of improvement into account, but also the magintude of the improvement a point can potentially yield.
But since it's expensive to compute the norm of two vectors, we instead use the Improvement function:
We can then consider the expected improvement, and choose accordingly, i.e.:
The distribution we're taking the expectation over is the normal posterior distribution characterized by and
, which has the normal density
The Expected improvement can then be evaluated analytically (by integrating out , which I find slightly weird since it also depends on
), yielding:
where
and
denote the CDF and PDF, respectively, of the standard normal distribution
Observe that if , then
which shows why this is then "more" into account the magnitude of the increase.
This is a myopic method, in a sense that we're minimizing the difference over a single step (i.e. ). We could also do the same thing for
for some
. Turns out it is indeed possible to obtain analytical expressions for these too.
Confidence Bound criteria
- Component in Sequential Design for Optimization (SDO)
Given random function model, SDO selects points for evaluation based on the lower confidence bound of the prediction state:
Since we're maximizing, we're interested in the upper confidence bound:
is a parameter to be specified by the user
Can cast Bayesian optimization problem as multi-armed bandit, with acquisition being the instantaneous regret
Goal is then:
where
is the number of iterations.
Upper confidence bound selection with
and the hyperparameter
, we define
With
and
, it can be shown that with high probability that this method has no regre, i.e.
i.e. the regret converges to zero.
- This implies a lower-bound on the convergence rate (or equivalently, upper bound for the "convergence-time")
Maximizing the acquisition function
- To find next sample point, we need to maximize
- One can use deterministic, derivative-free optimizer named
DIRECT
Jones1993_lipschitzian_ptimization_without
Noise
- Easy enough to incorporate by adding
to the kernel matrix
and derive predictive distribution from the GP
Change definition of incumbent in the PI and EI formulations to use distribution at sample points, instead of maximum, and let the incumbent be the point with the highest expected value (over the noise):