Notes on: Brochu, E., Cora, V. M., & Freitas, N. d. (2010): A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

Table of Contents

Notation

  • "Exploitation" is the notion of "greedy / local" optimization
  • Nonlinear function brochu10_tutor_bayes_optim_expen_cost_689c66a323bc8a7e92d22e177edbabd1cb957d7c.png over compact set brochu10_tutor_bayes_optim_expen_cost_ecffcd3cd2275dd07d0edd842ad40fa1bf9ff1eb.png. This is often a loss-function (we're formulating this as a maximization problem, in which case we would maximize brochu10_tutor_bayes_optim_expen_cost_ac61781d4652cd37f7edc3bfc94620515e15f7c0.png instead)
  • brochu10_tutor_bayes_optim_expen_cost_e20e74c3913edfb21df2338189a0788dc00af3bb.png is a sequence of samples
  • brochu10_tutor_bayes_optim_expen_cost_4c2c9597620204d41958f86dbbf497f2d2f3f5aa.png is the observations of input-output pairs
  • brochu10_tutor_bayes_optim_expen_cost_ed6ca2bc35271771593fa475335396d7879667cc.png and brochu10_tutor_bayes_optim_expen_cost_4ae86e8fad5ce4f4878771ec0ee578f2cf6b66fc.png
  • brochu10_tutor_bayes_optim_expen_cost_98fcbbb34e88aa91916c3f654c4744f348a88f3e.png denotes the acquisition function
  • The predictive distribution is given by:

    brochu10_tutor_bayes_optim_expen_cost_d4770e074bae412afda40fee23fde523a682a05b.png

    where

    brochu10_tutor_bayes_optim_expen_cost_360615151e5104c78b84a4f4bebf4b679661a686.png

  • Sometimes we will use the notation brochu10_tutor_bayes_optim_expen_cost_8700f80c4895c66c1fb785c5b0cf5c4b6f532c75.png and brochu10_tutor_bayes_optim_expen_cost_f2eebbd62e41be9d5a2457b4bd291dd096896884.png for brochu10_tutor_bayes_optim_expen_cost_4f2116aae81253f52cbaa1697a8d25de26b731d5.png and brochu10_tutor_bayes_optim_expen_cost_ee472636e8fd13326883dfd9e80f6df838d79c0b.png

Bayesian Optimization

  • Acquisition function is a function to be maximized to obtain the next sample. It depends on the previous samples:

    brochu10_tutor_bayes_optim_expen_cost_c683208a3d817d28bfe32070c165c0ea0ec6b8c1.png

  • Posterior distribution over objective is a distribution over the objective function, which we are sampling from
    • Sampling means that we can balance the "exploitation" ("greedy" optimization) and "exploration"
      • Useful if brochu10_tutor_bayes_optim_expen_cost_689c66a323bc8a7e92d22e177edbabd1cb957d7c.png has multiple optimas
    • Given by Bayes' rule:

      brochu10_tutor_bayes_optim_expen_cost_df90081186ca74c5848e44f153df9b1de055ceb9.png

Priors over functions

Bayesian optimization method will converge to optimum if:

  1. the acquisition function brochu10_tutor_bayes_optim_expen_cost_de4b3256c9441a9fb95ba92da7c59c7d1f70d1c7.png is continuous and approximately minimizes the risk (expected deviation from global minimum at fixed point brochu10_tutor_bayes_optim_expen_cost_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png)
  2. conditional variance converges to zero (or appropriate minimum value in the presence of noise) if and only if the distance to the nearest observation is zero
  3. the objective is continuous
  4. the prior is homogenous
  5. the optimization is independent of the m-th differences

Turns out that a Gaussian Process (GP) is very well suited for such a task, that is:

brochu10_tutor_bayes_optim_expen_cost_a6948fac78cba3c3719fc10b3fa8db51dab43aac.png

In this case, the predictive distribution is given by:

brochu10_tutor_bayes_optim_expen_cost_d4770e074bae412afda40fee23fde523a682a05b.png

where

brochu10_tutor_bayes_optim_expen_cost_360615151e5104c78b84a4f4bebf4b679661a686.png

Hence brochu10_tutor_bayes_optim_expen_cost_4f2116aae81253f52cbaa1697a8d25de26b731d5.png and brochu10_tutor_bayes_optim_expen_cost_ddb772de7f4d4aea8f1cbe0173d7b4da48dd8e8b.png are sufficient statistics for the predictive posterior distribution brochu10_tutor_bayes_optim_expen_cost_093a83d8ffe0b6d3e3ff4947bec3c7bea0ee8af5.png.

Covariance function

  • Determines smoothnesss properties of samples of brochu10_tutor_bayes_optim_expen_cost_cdd1cc131da6040eca078917132a377727053c44.png

The Matérn kernel incorporates a smoothness parameter brochu10_tutor_bayes_optim_expen_cost_5f2ba5d98a1201e7d39a0c689133d800b4680423.png to permit greater flexibility in modelling functions:

brochu10_tutor_bayes_optim_expen_cost_5e84ea0d79ff680a2b2572daa82713c5ef3852b8.png

where

Note that as brochu10_tutor_bayes_optim_expen_cost_d09a8a0d820e1dcdf3fabf0d5dbed0755838f4d2.png, the Matérn kernel reduces to the squared exponential kernel.

brochu10_tutor_bayes_optim_expen_cost_5f2ba5d98a1201e7d39a0c689133d800b4680423.png will often be used as a hyperparameter.

Acquision functions

Would like to such that high aquision corresponds to potentially

  • High prediction
  • High uncertainty

The Maximum Probability of Improvement (MPI) or P-algorithm (since the utility is the probability of improvement) is defined:

brochu10_tutor_bayes_optim_expen_cost_5c9ecf7ff9e1c2d73ae5cabab06e1fc5fd79da56.png

where

  • brochu10_tutor_bayes_optim_expen_cost_8c0703d4a8de77b812dda8f64341c716bfc87f61.png is the normal cumulative distribution function
  • the incubement

    brochu10_tutor_bayes_optim_expen_cost_bfd262065763dd15b93d4a0f569c615d43f8d281.png

Drawback with MPI is that the it's purely exploitation. Points that have a high probability of being infinitesimally greater than brochu10_tutor_bayes_optim_expen_cost_b4588784cdd4ddb50c846c05e36dc958d7c26932.png have higher probability than points that offer potentially larger gains but with greater uncertainty. As a result, a modification is to add a trade-off parameter brochu10_tutor_bayes_optim_expen_cost_4abc81cd06785efe89caca503414b7d1daedb952.png:

brochu10_tutor_bayes_optim_expen_cost_e4e392a02c260e8ed519b9a90083cc42642a3103.png

Scheduling is sometimes used for brochu10_tutor_bayes_optim_expen_cost_03d96c52df5ea6a510607e2260e0745d11e4bd85.png, starting at a fairly high value to drive exploration and then decreased to later drive exploitation.

"It's difficult in choosing brochu10_tutor_bayes_optim_expen_cost_03d96c52df5ea6a510607e2260e0745d11e4bd85.png in MPI, since too large brochu10_tutor_bayes_optim_expen_cost_03d96c52df5ea6a510607e2260e0745d11e4bd85.png makes the search excessively global, while too small makes search excessively greedy.

One would wish it not only took the probability of improvement into account, but also the magintude of the improvement a point can potentially yield.

But since it's expensive to compute the norm of two vectors, we instead use the Improvement function:

brochu10_tutor_bayes_optim_expen_cost_771dac0634bff24bdab4fb80d2a2bdbab2fdb6ea.png

We can then consider the expected improvement, and choose brochu10_tutor_bayes_optim_expen_cost_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png accordingly, i.e.:

brochu10_tutor_bayes_optim_expen_cost_e79d01b774a72e6b3eee4655366babf359804888.png

The distribution we're taking the expectation over is the normal posterior distribution characterized by brochu10_tutor_bayes_optim_expen_cost_e444c2eb2059714f6eaa2c8f28ff0305f9ad2acb.png and brochu10_tutor_bayes_optim_expen_cost_7533e236d335ade0042648cb6ed0a2dd87cf9fe0.png, which has the normal density

brochu10_tutor_bayes_optim_expen_cost_8ae212a221766275791f5522dc905bc5494b1127.png

The Expected improvement can then be evaluated analytically (by integrating out brochu10_tutor_bayes_optim_expen_cost_6c03018fd4253198b850bc6339e99699b567a798.png, which I find slightly weird since it also depends on brochu10_tutor_bayes_optim_expen_cost_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png), yielding:

brochu10_tutor_bayes_optim_expen_cost_8dd6a3e9fd1d211c4b0a77c8965da191ec5cc3eb.png

where

  • brochu10_tutor_bayes_optim_expen_cost_8c0703d4a8de77b812dda8f64341c716bfc87f61.png and brochu10_tutor_bayes_optim_expen_cost_2a275ba57b65332c8eb71ed5f6713abb5db0e66b.png denote the CDF and PDF, respectively, of the standard normal distribution

Observe that if brochu10_tutor_bayes_optim_expen_cost_1ec26dfe70686d8bf5e1a38eb596f55479d250b4.png, then

brochu10_tutor_bayes_optim_expen_cost_d0350e6c053f2e62ab671f33282c9b22e0ad1a44.png

which shows why this is then "more" into account the magnitude of the increase.

This is a myopic method, in a sense that we're minimizing the difference over a single step (i.e. brochu10_tutor_bayes_optim_expen_cost_b4f809eb14ebeaba624354bccc1f2278514867cb.png). We could also do the same thing for brochu10_tutor_bayes_optim_expen_cost_393778c906b2145c7ffad5182dd2b9751fb28bd8.png for some brochu10_tutor_bayes_optim_expen_cost_415435c49d1c35ff8550c9dc880b5544060fcc08.png. Turns out it is indeed possible to obtain analytical expressions for these too.

Confidence Bound criteria

  • Component in Sequential Design for Optimization (SDO)
  • Given random function model, SDO selects points for evaluation based on the lower confidence bound of the prediction state:

    brochu10_tutor_bayes_optim_expen_cost_675f8853d2f4ca5c9dbb6478d7743988b9f95c97.png

  • Since we're maximizing, we're interested in the upper confidence bound:

    brochu10_tutor_bayes_optim_expen_cost_1b7ffadd20030aba3d7b890ca5eab343ab59d1b2.png

  • brochu10_tutor_bayes_optim_expen_cost_2ae9ab5e2c77e8e7425cd6e336b06aa75401ddbf.png is a parameter to be specified by the user
  • Can cast Bayesian optimization problem as multi-armed bandit, with acquisition being the instantaneous regret

    brochu10_tutor_bayes_optim_expen_cost_12c50d93a1f620319986ea015fe3121c99c239e1.png

  • Goal is then:

    brochu10_tutor_bayes_optim_expen_cost_5ace34aadc51e3cb586f14743b3ca014e814c242.png

    where brochu10_tutor_bayes_optim_expen_cost_b3731d37cd447bd6c31809075a6be43b3d0b04ec.png is the number of iterations.

  • Upper confidence bound selection with brochu10_tutor_bayes_optim_expen_cost_819f2a49239a65ea915ee005eeec0b1f53020258.png and the hyperparameter brochu10_tutor_bayes_optim_expen_cost_444b1749342d37b341633b26fa78b2e54c2bc5dc.png, we define

    brochu10_tutor_bayes_optim_expen_cost_207d68caa98ebe0326064ace75a8a55c85190c6c.png

  • With brochu10_tutor_bayes_optim_expen_cost_593ed48ddc6061f036df0e63017a104aded75687.png and brochu10_tutor_bayes_optim_expen_cost_43bebc04e41cfc91c4b17e724e5ae7118b2d181e.png, it can be shown that with high probability that this method has no regre, i.e.

    brochu10_tutor_bayes_optim_expen_cost_1d433066868316abbea1402a848f7e0d77326e5d.png

    i.e. the regret converges to zero.

  • This implies a lower-bound on the convergence rate (or equivalently, upper bound for the "convergence-time")

Maximizing the acquisition function

Noise

  • Easy enough to incorporate by adding brochu10_tutor_bayes_optim_expen_cost_af26dca40b043ef47e7673c0a81fd0361e2867ad.png to the kernel matrix brochu10_tutor_bayes_optim_expen_cost_cca19d6a98a08333f32902069d9f41ad78a2bcd8.png and derive predictive distribution from the GP
  • Change definition of incumbent in the PI and EI formulations to use distribution at sample points, instead of maximum, and let the incumbent be the point with the highest expected value (over the noise):

    brochu10_tutor_bayes_optim_expen_cost_f8206e917decfc8ca9a8a01d04802808d38a3b6b.png