Bandits
Table of Contents
Sequential decision problems
- Reward/feedback type:
- Bandit
- Only observe information/feedback for the action taken.
- Full information
- Observe information/feedback for all the actions.
- Partial information
- Observe information/feedback for a subset (potentially sparse) of possible actions.
- Bandit
Bandit book lattimore2018bandit
Notation
- denotes a stochastic bandit
- is the suboptimality gap or action gap or immediate regret of action
Number of times action was chosen by the learner at the end of round
In general, this is random, since the action taken at time is a random variable depending on all previous observatiosn and actions.
2
2.2 σ-algebras and knowledge
- Collection of sequential observations (rvs.)
- The set of maps such that is measurable then characterizes all the possible ways the learner can respond.
- I.e. sufficies to concentrate on σ-algebra
- Reminder: is adapted to filtration if is for each .
3
Markov Chains
4
4.5 Decomposing regret
For any policy and stochastic bandit environment with finite or countable and horizon , the regret of policy in satisfies
5.
Exercises
5.3
Eq. (5.2):
Eq. (5.4):
which means
Then the ratio between the two rates:
Then if , we have and is dominated by
Alternatively,
then
so,
As a result, on the lower-bound of the ratio is monotonically increasing in . On , the lower-bound of the ratio decreases monotonically in . For the increase, we can for certain say that
We know for , the factor completely dominates. For , we can use the fact that
which means that
But for the question of when , i.e. Eq. (5.2) is best, and , i.e. Eq. (5.4) is best, observe that if
which means Eq. (5.4) is best (BUT, so it doesn't actually provide us with any more information!).
Moreover, in the case , we know for certain that and thus Eq. (5.4) is the best.
We can refine our lower-bound of when by using the slightly tighter lower-bound for , where we have
and we can solve for RHS = 1, or equiv.,
which is equivalent to the quadratic
which has solutions
Therefore has non-real imaginary component, i.e. this doesn't provide us with any more insight.
But, we can consider when this quadratic is and which gives us
6. The Explore-then-Commit Algorithm
Exercises
6.1 subGaussian empirical estimates
Let be the policy of ETC and be the 1-subGaussian distributions associated with the arms.
Provide a fully rigorous proof of the claim that
is . You should only use the definitions and the interaction protocol, which states that
Recall that
The mean reward for arm is estimated as
where
is the number of times action has been played after round .
The ETC policy for parameter , which we denote , is defined by
First we note that is sufficient to show that is for all and that they're independent, since we can decompose the expression as:
and by Lemma 5.4 (c) (which we need to prove!) a sum of two subGaussian rvs is a subGaussian [Easy to prove though (Y)]
[I guess we need the interaction protocol to actually prove that we can consider the mean-estimates as independent?]
7. The Upper Condience Bound (UCB) Algorithm
Notation
Optimism principle
11. The Exp3 Algorithm
Notation
Conditional expectation given history up to time
IS estimator of is
which satisfies
i.e. unbiased estimator. Note that in this case
Alternative (unbiased) estimator
which is more easily seen by rearranging
i.e. RHS is an estimate of the "loss" . Note that in this case
Overview
- Adverserial bandits
11.3 The Exp3
algorithm
- Exp3 = Exponential-weight algorihtm for Exploration and Exploitation
- The idea is to construct unbiased estimators for the rewards from each arm by the use of IS.
Exp3 is a particular way of obtaining a distribution for each arm at time , :
where
where
where is the reward at time-step chosen for arm by the adversary.
- is referred to as the learning rate
- In simple case will be chosen depending on number of arms and horizon
Let
be the policy of
Exp3
with learning rate
Then
For any arm define
which is the expected regret relative to using action in all the rounds.
We will bound for all , including the optimal arm.
18. Contextual Bandits
For any , it holds that
Recall that
i.e.
19. Linear Stochastic Bandits
19.3 Regret Analysis
Let be positive definite and be a sequence of vectors with for all .
Then
In the final inequality we have the following argument.
Operator norm of the matrix is going to be bounded by , since
where in the third inequality we've made use of the Cauchy inequality and the assumption that .
The factor of 2 shows up because we do
and each of these terms are bounded by due to the vectors being in .
20.
20.1
Errata
- p. 79: in point 3), there's a sum without any terms…
Linear bandits
Contextual bandits
Batched bandits
- han20_sequen_batch_learn_finit_action
- Analysis of both adverserial and stochastic contexts (both with stochastic rewards)
- Finite number of arms
- Number of batches is allowed to scale wrt. , and for the relevant results, they provide lower-bounds on the number of batches necessary to achieve the desired regret.
- Analysis is done by analysing the prediction error of a linear regression model which has seen observations, where is the index of the batch. From this we can deduce the regret of following the predictions based on these previous observations rather than the online version.
- Algorithm presented is essentially LinUCB?
- Adverserial contexts
Lower-bound on the regret of any sequential batched algoritm for the case of han20_sequen_batch_learn_finit_action
where
- is the number of arms
- is dimensionality of
Reward is modelled as
where .
- is horizon
- is number of batches
- denotes the algorithm which consists of a "time-grid" (with each "interval" in the grid corresponding to a single batch, so that ) and the learning algorithm/policy which maps previous observations to next action to take.
- is a universal constant independent of
- Stochastic contexts
- Upper-bound on regret for what's essentially LinUCB han20_sequen_batch_learn_finit_action