Notes on: Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002): Finite-time analysis of the multiarmed bandit problem
Table of Contents
Notation
- regret is loss due to the fact that globally optimal policy is not followed at all times
- policy or allocation strategy,
, is an algorithm which chooses the next machine to play on based on the sequence of past plays and obtained reward
is the index of a gambling machine
denotes the n-th time played
is some quantity
which of the optimal machine
denotes the reward yielded by machine
at step
, i.i.d. according to an unknown law with unknown expectation
(for the i-th gambling machine)
is the number of times machine
has been played by the policy
during the firts
plays
is the expected reward of the i-th machine
denotes the maximum expected reward, i.e. the expected reward of the (on average) "best" machine
is the average reward obtain from machine
is the number of times we've played machine
so far
Overview
Stuff
Multi-armed bandit problem
- A K-armed bandit problem is defined by random variables (representing reward)
for
and
- Successive plays on the i-th machine is assumed to be i.i.d according to a unknown law with unknown expectation
- Rewards across machines are independent (not necessarily identically distributed), i.e.
and
are independent
Regret of policy is defined by
or, equivalently,
Thus, regret is the expected loss due to the fact that the policy does not always play the best machine.
In the paper they've actually written the equation as
Which really seems to be wierd notation, and thus I have corrected it above.
denotes the expected number of times policy
plays machine
after a total number of
"rounds" / steps.
Therefore we're basically saying that the regret of a policy is defined to be difference between playing the optimal machine at every step (i.e. play it of
times) and the "average run" of our policy
, where the average reward the policy accumluates from each of the
machines is simply the machines expected reward
times the average number of times policy
choose to play machine
in
steps.
Upper Confidence Bound
lai85_asymp_effic_adapt_alloc_rules found, for specific families of reward distributions, policies satisfying
where as
.
Main result
For all , if policy
is run on
machines having arbitrary reward distributions
with support in
, then its expected regret after any number of
of plays is at most
where are the expected values of
.
Equivalently,
The proof can be found in the "Proofs" section in auer02.
Initialization: Play each machine once.
Loop: Play machine that maximizes
This gives us a trade-off between exploration and maximization, and thus, from the theorem, we obtain a convergence towards an optimal strategy throw a balance between exploration and maximization.
- [lai85_asymp_effic_adapt_alloc_rules] L Lai & Herbert Robbins, Asymptotically Efficient Adaptive Allocation Rules, Advances in Applied Mathematics, 6(1), 4-22 (1985). link. doi.
- [auer02] Peter Auer, Nicolò Cesa-Bianchi & Paul Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47(2/3), 235-256 (2002). link. doi.