# Notes on: Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002): Finite-time analysis of the multiarmed bandit problem

## Table of Contents

## Notation

**regret**is loss due to the fact that globally optimal policy is not followed at all times**policy**or**allocation strategy**, , is an algorithm which chooses the next machine to play on based on the sequence of past plays and obtained reward- is the index of a gambling machine
- denotes the n-th time played
- is some quantity which of the optimal machine
- denotes the reward yielded by machine at step , i.i.d. according to an unknown law with unknown expectation (for the i-th gambling machine)
- is the number of times machine has been played by the policy during the firts plays
- is the expected reward of the i-th machine
- denotes the maximum expected reward, i.e. the expected reward of the (on average) "best" machine
- is the average reward obtain from machine
- is the number of times we've played machine so far

## Overview

## Stuff

### Multi-armed bandit problem

- A K-armed bandit problem is defined by random variables (representing reward) for and
- Successive plays on the i-th machine is assumed to be
**i.i.d**according to a unknown law with unknown expectation - Rewards across machines are
**independent**(not necessarily identically distributed), i.e. and are independent

**Regret of policy ** is defined by

or, equivalently,

Thus, **regret** is the *expected loss* due to the fact that the policy does not always play the best machine.

In the paper they've actually written the equation as

Which really seems to be wierd notation, and thus I have corrected it above.

denotes the *expected number of times* policy plays machine after a total number of "rounds" / steps.

Therefore we're basically saying that the regret of a policy is defined to be difference between playing the optimal machine at every step (i.e. play it of times) and the "average run" of our policy , where the average reward the policy accumluates from each of the machines is simply the machines expected reward times the average number of times policy choose to play machine in steps.

### Upper Confidence Bound

lai85_asymp_effic_adapt_alloc_rules found, for specific families of reward distributions, policies satisfying

where as .

### Main result

For all , if policy is run on machines having arbitrary reward distributions with support in , then its expected regret after any number of of plays is at most

where are the expected values of .

Equivalently,

The proof can be found in the "Proofs" section in auer02.

**Initialization:** Play each machine once.

**Loop:** Play machine that maximizes

This gives us a trade-off between exploration and maximization, and thus, from the theorem, we obtain a convergence towards an optimal strategy throw a balance between exploration and maximization.

- [lai85_asymp_effic_adapt_alloc_rules] L Lai & Herbert Robbins, Asymptotically Efficient Adaptive Allocation Rules,
*Advances in Applied Mathematics*,**6(1)**, 4-22 (1985). link. doi. - [auer02] Peter Auer, NicolĂ˛ Cesa-Bianchi & Paul Fischer, Finite-time Analysis of the Multiarmed Bandit Problem,
*Machine Learning*,**47(2/3)**, 235-256 (2002). link. doi.