# Bandits

## Sequential decision problems

• Reward/feedback type:
• Bandit
• Only observe information/feedback for the action taken.
• Full information
• Observe information/feedback for all the actions.
• Partial information
• Observe information/feedback for a subset (potentially sparse) of possible actions.

## Bandit book lattimore2018bandit

### Notation

• denotes a stochastic bandit
• is the suboptimality gap or action gap or immediate regret of action
• Number of times action was chosen by the learner at the end of round

In general, this is random, since the action taken at time is a random variable depending on all previous observatiosn and actions.

### 2

#### 2.2 σ-algebras and knowledge

• Collection of sequential observations (rvs.)
• The set of maps such that is measurable then characterizes all the possible ways the learner can respond.
• I.e. sufficies to concentrate on σ-algebra
• Reminder: is adapted to filtration if is for each .

### 4

#### 4.5 Decomposing regret

For any policy and stochastic bandit environment with finite or countable and horizon , the regret of policy in satisfies

### 5.

#### Exercises

##### 5.3

Eq. (5.2):

Eq. (5.4):

which means

Then the ratio between the two rates:

Then if , we have and is dominated by

Alternatively,

then

so,

As a result, on the lower-bound of the ratio is monotonically increasing in . On , the lower-bound of the ratio decreases monotonically in . For the increase, we can for certain say that

We know for , the factor completely dominates. For , we can use the fact that

which means that

But for the question of when , i.e. Eq. (5.2) is best, and , i.e. Eq. (5.4) is best, observe that if

which means Eq. (5.4) is best (BUT, so it doesn't actually provide us with any more information!).

Moreover, in the case , we know for certain that and thus Eq. (5.4) is the best.

We can refine our lower-bound of when by using the slightly tighter lower-bound for , where we have

and we can solve for RHS = 1, or equiv.,

which is equivalent to the quadratic

which has solutions

Therefore has non-real imaginary component, i.e. this doesn't provide us with any more insight.

But, we can consider when this quadratic is and which gives us

### 6. The Explore-then-Commit Algorithm

#### Exercises

##### 6.1 subGaussian empirical estimates

Let be the policy of ETC and be the 1-subGaussian distributions associated with the arms.

Provide a fully rigorous proof of the claim that

is . You should only use the definitions and the interaction protocol, which states that

Recall that

• The mean reward for arm is estimated as

where

is the number of times action has been played after round .

• The ETC policy for parameter , which we denote , is defined by

First we note that is sufficient to show that is for all and that they're independent, since we can decompose the expression as:

and by Lemma 5.4 (c) (which we need to prove!) a sum of two subGaussian rvs is a subGaussian [Easy to prove though (Y)]

[I guess we need the interaction protocol to actually prove that we can consider the mean-estimates as independent?]

### 11. The Exp3 Algorithm

#### Notation

• Conditional expectation given history up to time

• IS estimator of is

which satisfies

i.e. unbiased estimator. Note that in this case

• Alternative (unbiased) estimator

which is more easily seen by rearranging

i.e. RHS is an estimate of the "loss" . Note that in this case

#### 11.3 The Exp3 algorithm

• Exp3 = Exponential-weight algorihtm for Exploration and Exploitation
• The idea is to construct unbiased estimators for the rewards from each arm by the use of IS.
• Exp3 is a particular way of obtaining a distribution for each arm at time , :

where

where

where is the reward at time-step chosen for arm by the adversary.

• is referred to as the learning rate
• In simple case will be chosen depending on number of arms and horizon

Let

• be the policy of Exp3 with learning rate

Then

For any arm define

which is the expected regret relative to using action in all the rounds.

We will bound for all , including the optimal arm.

### 18. Contextual Bandits

For any , it holds that

Recall that

i.e.

### 19. Linear Stochastic Bandits

#### 19.3 Regret Analysis

Let be positive definite and be a sequence of vectors with for all .

Then

In the final inequality we have the following argument.

Operator norm of the matrix is going to be bounded by , since

where in the third inequality we've made use of the Cauchy inequality and the assumption that .

The factor of 2 shows up because we do

and each of these terms are bounded by due to the vectors being in .

### Errata

• p. 79: in point 3), there's a sum without any terms…

## Linear bandits

### Contextual bandits

#### Batched bandits

• han20_sequen_batch_learn_finit_action
• Analysis of both adverserial and stochastic contexts (both with stochastic rewards)
• Finite number of arms
• Number of batches is allowed to scale wrt. , and for the relevant results, they provide lower-bounds on the number of batches necessary to achieve the desired regret.
• Analysis is done by analysing the prediction error of a linear regression model which has seen observations, where is the index of the batch. From this we can deduce the regret of following the predictions based on these previous observations rather than the online version.
• Algorithm presented is essentially LinUCB?
• Lower-bound on the regret of any sequential batched algoritm for the case of han20_sequen_batch_learn_finit_action

where

• is the number of arms
• is dimensionality of
• Reward is modelled as

where .

• is horizon
• is number of batches
• denotes the algorithm which consists of a "time-grid" (with each "interval" in the grid corresponding to a single batch, so that ) and the learning algorithm/policy which maps previous observations to next action to take.
• is a universal constant independent of
• Stochastic contexts