Notes on: Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002): Finite-time analysis of the multiarmed bandit problem

Table of Contents


  • regret is loss due to the fact that globally optimal policy is not followed at all times
  • policy or allocation strategy, auer02_8939ac39ed16887f3b3c033a47d4ab60d120251a.png, is an algorithm which chooses the next machine to play on based on the sequence of past plays and obtained reward
  • auer02_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png is the index of a gambling machine
  • auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png denotes the n-th time played
  • auer02_80f913d45d9542230bcb65556ff318f77891676e.png is some quantity auer02_ebc81d4df7c1e136fe1067dc6851989d932bd5f7.png which of the optimal machine
  • auer02_b8afcd18b7cc98df6a641404a1f74ee38468fdc1.png denotes the reward yielded by machine auer02_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png at step auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png, i.i.d. according to an unknown law with unknown expectation auer02_77bc14ffb2fac05e758ab86a73926f82995df685.png (for the i-th gambling machine)
  • auer02_bcf66182f1e4369b2387ecaa3ad8bc12421ead8d.png is the number of times machine auer02_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png has been played by the policy auer02_8939ac39ed16887f3b3c033a47d4ab60d120251a.png during the firts auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png plays
  • auer02_77bc14ffb2fac05e758ab86a73926f82995df685.png is the expected reward of the i-th machine
  • auer02_8c9107e25f4fbdf32682a5df48f24d5d964ab725.png denotes the maximum expected reward, i.e. the expected reward of the (on average) "best" machine
  • auer02_c7c37f3e46aa9f925db647f24672e1e8f6d34a96.png
  • auer02_3c5694abe2e74edaddb96a13941f5b9b0ef35430.png is the average reward obtain from machine auer02_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png
  • auer02_fbf470fb003ab02b7d7ae1bb705055676d0bb4ca.png is the number of times we've played machine auer02_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png so far



Multi-armed bandit problem

  • A K-armed bandit problem is defined by random variables (representing reward) auer02_b8afcd18b7cc98df6a641404a1f74ee38468fdc1.png for auer02_1504e033be6baddca366fa2cb484928c4f1a676a.png and auer02_d52e5f5f6707d4e7b94e19b9a5c74bfa984388c7.png
  • Successive plays on the i-th machine is assumed to be i.i.d according to a unknown law with unknown expectation auer02_77bc14ffb2fac05e758ab86a73926f82995df685.png
  • Rewards across machines are independent (not necessarily identically distributed), i.e. auer02_698a26b997df23507f243cd02514f70493ed3540.png and auer02_46eaeb083e61e35b3ebf3f77ebaaf0de143cff53.png are independent

Regret of policy auer02_8939ac39ed16887f3b3c033a47d4ab60d120251a.png is defined by


or, equivalently,


Thus, regret is the expected loss due to the fact that the policy does not always play the best machine.

In the paper they've actually written the equation as


Which really seems to be wierd notation, and thus I have corrected it above.

auer02_d805c93b14077f81bbb6253f5d88b791c62f4ff8.png denotes the expected number of times policy auer02_8939ac39ed16887f3b3c033a47d4ab60d120251a.png plays machine auer02_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png after a total number of auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png "rounds" / steps.

Therefore we're basically saying that the regret of a policy is defined to be difference between playing the optimal machine at every step (i.e. play it auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png of auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png times) and the "average run" of our policy auer02_8939ac39ed16887f3b3c033a47d4ab60d120251a.png, where the average reward the policy accumluates from each of the auer02_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png machines is simply the machines expected reward auer02_106f48c2d9c6105bbb0ba03024c974fb38da64ae.png times the average number of times policy auer02_8939ac39ed16887f3b3c033a47d4ab60d120251a.png choose to play machine auer02_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png in auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png steps.

Upper Confidence Bound

lai85_asymp_effic_adapt_alloc_rules found, for specific families of reward distributions, policies satisfying


where auer02_e6eb605e89d31508fc2782313353b3f51792ab4d.png as auer02_2a9ea0f86955fdea4f30af2f97e4e0722109bdb3.png.

Main result

For all auer02_b602a27cb5dd734d8222c58983efe716635d9b4e.png, if policy auer02_6b7debcd2c0b8d9636bc0390eb0a0654a1a448b5.png is run on auer02_d80b633218f04f7a25f6a9dd7e4617a38e88f263.png machines having arbitrary reward distributions auer02_2117df9107bc9ff4cd5e519f43593f504bbc5054.png with support in auer02_14ad9fc7ad19fd0d1dd65feedc0d5b8c171b6486.png, then its expected regret after any number of auer02_f07b932b12c91bca424fd428d383495413b5b6a9.png of plays is at most


where auer02_761d947e8a4f6b8e02fde051b71cace8f740fa32.png are the expected values of auer02_2117df9107bc9ff4cd5e519f43593f504bbc5054.png.



The proof can be found in the "Proofs" section in auer02.

Initialization: Play each machine once.

Loop: Play machine auer02_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png that maximizes auer02_6ce2b73929b86e6b60d6e3097e7235afe713df14.png

This gives us a trade-off between exploration and maximization, and thus, from the theorem, we obtain a convergence towards an optimal strategy throw a balance between exploration and maximization.

  • [lai85_asymp_effic_adapt_alloc_rules] L Lai & Herbert Robbins, Asymptotically Efficient Adaptive Allocation Rules, Advances in Applied Mathematics, 6(1), 4-22 (1985). link. doi.
  • [auer02] Peter Auer, Nicolò Cesa-Bianchi & Paul Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47(2/3), 235-256 (2002). link. doi.