# Notes on: Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002): Finite-time analysis of the multiarmed bandit problem

## Notation

• regret is loss due to the fact that globally optimal policy is not followed at all times
• policy or allocation strategy, , is an algorithm which chooses the next machine to play on based on the sequence of past plays and obtained reward
• is the index of a gambling machine
• denotes the n-th time played
• is some quantity which of the optimal machine
• denotes the reward yielded by machine at step , i.i.d. according to an unknown law with unknown expectation (for the i-th gambling machine)
• is the number of times machine has been played by the policy during the firts plays
• is the expected reward of the i-th machine
• denotes the maximum expected reward, i.e. the expected reward of the (on average) "best" machine
• • is the average reward obtain from machine • is the number of times we've played machine so far

## Stuff

### Multi-armed bandit problem

• A K-armed bandit problem is defined by random variables (representing reward) for and • Successive plays on the i-th machine is assumed to be i.i.d according to a unknown law with unknown expectation • Rewards across machines are independent (not necessarily identically distributed), i.e. and are independent

Regret of policy is defined by or, equivalently, Thus, regret is the expected loss due to the fact that the policy does not always play the best machine.

In the paper they've actually written the equation as Which really seems to be wierd notation, and thus I have corrected it above. denotes the expected number of times policy plays machine after a total number of "rounds" / steps.

Therefore we're basically saying that the regret of a policy is defined to be difference between playing the optimal machine at every step (i.e. play it of times) and the "average run" of our policy , where the average reward the policy accumluates from each of the machines is simply the machines expected reward times the average number of times policy choose to play machine in steps.

### Upper Confidence Bound

lai85_asymp_effic_adapt_alloc_rules found, for specific families of reward distributions, policies satisfying where as .

### Main result

For all , if policy is run on machines having arbitrary reward distributions with support in , then its expected regret after any number of of plays is at most where are the expected values of .

Equivalently, The proof can be found in the "Proofs" section in auer02.

Initialization: Play each machine once.

Loop: Play machine that maximizes This gives us a trade-off between exploration and maximization, and thus, from the theorem, we obtain a convergence towards an optimal strategy throw a balance between exploration and maximization.