Bandits

Sequential decision problems
Bandit book cite:lattimore2018bandit
Linear bandits
- Contextual bandits
Kernel bandits
UCB
- LinRel
- SupLinRel

Sequential decision problems

Reward/feedback type:
- Bandit
  - Only observe information/feedback for the action taken.
- Full information
  - Observe information/feedback for all the actions.
- Partial information
  - Observe information/feedback for a subset (potentially sparse) of possible actions.

Bandit book lattimore2018bandit

Notation

$\nu = \big( P_a: a \in \mathcal{A} \big)$ denotes a stochastic bandit
$\Delta_a(\nu) = \mu^*(\nu) - \mu_a(\nu)$ is the suboptimality gap or action gap or immediate regret of action $a$
Number of times action $a$ was chosen by the learner at the end of round $t$

$\begin{equation*} T_a(t) = \sum_{s = 1}^{t} \1{A_S = a } \end{equation*}$

In general, this is random, since the action taken at time $s \le t$ is a random variable depending on all previous observatiosn and actions.

2

2.2 σ-algebras and knowledge

Collection of sequential observations (rvs.) $X_{1:t} = X_1, \dots, X_n$
The set of maps such that is measurable then characterizes all the possible ways the learner can respond.
- I.e. sufficies to concentrate on σ-algebra $\mathcal{F}_t = \sigma \big( X_{1:t} \big) \subseteq \mathcal{F}$
Reminder: $\big( X_t \big)_{t = 1}^n$ is adapted to filtration $\mathbb{F} = \big( \mathcal{F}_t \big)_{t = 1}^n$ if $X_t$ is $\mathcal{F}_t \text{-measurable}$ for each $t \in [n]$ .

3

Markov Chains

4

4.5 Decomposing regret

For any policy $\pi$ and stochastic bandit environment $\nu$ with $\mathcal{A}$ finite or countable and horizon $n \in \mathbb{N}$ , the regret $R_n$ of policy $\pi$ in $\nu$ satisfies

$\begin{equation*} R_n = \sum_{a \in \mathcal{A}}^{} \Delta_a \mathbb{E} \big[ T_a(n) \big] \end{equation*}$

5.

Exercises

5.3

Eq. (5.2):

$\begin{equation*} \mathbb{P} \big( | \hat{\mu} - \mu| \ge \varepsilon \big) \le \frac{\sigma^2}{n \varepsilon^2} \end{equation*}$

Eq. (5.4):

$\begin{equation*} \mathbb{P} \big( \hat{\mu} \ge \mu + \varepsilon \big) \le \sqrt{\frac{\sigma^2}{2 \pi n \varepsilon^2}} \exp \bigg( - \frac{n \varepsilon^2}{2 \sigma^2} \bigg) \end{equation*}$

which means

$\begin{equation*} \mathbb{P} \big( \left| \hat{\mu} - \mu \right| \ge \varepsilon \big) \le \sqrt{\frac{2 \sigma^2}{\pi n \varepsilon^2}} \exp \bigg( - \frac{n \varepsilon^2}{2 \sigma^2} \bigg) \end{equation*}$

Then the ratio between the two rates:

$\begin{equation*} \frac{\frac{\sigma^2}{n \varepsilon^2}}{\frac{\sqrt{2} \sigma}{\pi \sqrt{n} \varepsilon} \exp \big( - \frac{n \varepsilon^2}{2 \sigma^2} \big)} = \end{equation*}$

$\begin{equation*} r = \frac{\sigma^2}{n \varepsilon^2} \cdot \frac{\sqrt{\pi n} \varepsilon}{\sqrt{2} \sigma} \exp \bigg( \frac{n \varepsilon^2}{2 \sigma^2} \bigg) = \frac{1}{\varepsilon} \cdot \frac{1}{\sqrt{n}} \cdot \sqrt{\frac{\pi \sigma^2}{2}} \exp \bigg( \frac{n \varepsilon^2}{2 \sigma^2} \bigg) \end{equation*}$

Then if $\varepsilon > 1$ , we have $\varepsilon \le \varepsilon^2$ and $\frac{1}{\varepsilon}$ is dominated by $\exp \bigg( \frac{n \varepsilon^2}{2 \sigma^2} \bigg)$

Alternatively,

$\begin{equation*} \alpha(\varepsilon, n) := \sqrt{\frac{n \varepsilon^2}{2 \sigma^2}} \end{equation*}$

then

$\begin{equation*} \big(1 + \alpha^2(\varepsilon, n) \big) \le \exp \big( \alpha^2(\varepsilon, n) \big) \end{equation*}$

so,

$\begin{equation*} r = \frac{\sqrt{\pi}}{2} \frac{1}{\alpha(\varepsilon, n)} \exp \big( \alpha(\varepsilon, n) \big) \ge \frac{\sqrt{\pi}}{2} \frac{1}{\alpha(\varepsilon, n)} \big( 1 + \alpha^2(\varepsilon, n) \big) \ge \frac{\sqrt{\pi}}{2} \bigg( \frac{1}{\alpha(\varepsilon, n)} + \alpha(\varepsilon, n) \bigg) \end{equation*}$

As a result, on $\alpha \in [1, \infty)$ the lower-bound of the ratio is monotonically increasing in $\alpha$ . On $\alpha \in (0, 1)$ , the lower-bound of the ratio decreases monotonically in $\alpha$ . For the increase, we can for certain say that

We know for $\alpha \in [1, \infty)$ , the $\exp$ factor completely dominates. For $\alpha \in (0, 1)$ , we can use the fact that

$\begin{equation*} \alpha^2 \le \big( 1 + \alpha^2 \big) \le \exp( \alpha^2 ) \end{equation*}$

which means that

$\begin{equation*} r \ge \frac{\sqrt{\pi}}{2} \frac{1}{\alpha} \cdot \alpha^2 = \frac{\sqrt{\pi}}{2} \alpha \end{equation*}$

But for the question of when $r < 1$ , i.e. Eq. (5.2) is best, and $r > 1$ , i.e. Eq. (5.4) is best, observe that if

$\begin{equation*} \alpha = \frac{2}{\sqrt{\pi}} \quad \implies \quad r \ge 1 \end{equation*}$

which means Eq. (5.4) is best (BUT, $\frac{2}{\sqrt{\pi}} \not\in [0, 1)$ so it doesn't actually provide us with any more information!).

Moreover, in the case $\alpha > 1$ , we know for certain that $r \ge 1$ and thus Eq. (5.4) is the best.

We can refine our lower-bound of when $r \ge 1$ by using the slightly tighter lower-bound for $\exp$ , where we have

$\begin{equation*} r \ge \frac{\sqrt{\pi}}{2} \bigg( \frac{1}{\alpha} + \alpha \bigg) \end{equation*}$

and we can solve for RHS = 1, or equiv.,

$\begin{equation*} \frac{1}{\alpha} + \alpha = \frac{2}{\sqrt{\pi}} \end{equation*}$

which is equivalent to the quadratic

$\begin{equation*} \alpha^2 - \frac{2}{\sqrt{\pi}} \alpha + 1 = 0 \end{equation*}$

which has solutions

$\begin{equation*} \alpha = \frac{1}{\sqrt{\pi}} \pm \underbrace{\sqrt{\frac{1}{\pi} - 1}}_{\le 0} \end{equation*}$

Therefore $\alpha$ has non-real imaginary component, i.e. this doesn't provide us with any more insight.

But, we can consider when this quadratic is $< 0$ and $> 0$ which gives us

6. The Explore-then-Commit Algorithm

Exercises

6.1 subGaussian empirical estimates

Let $\pi$ be the policy of ETC and $P_1, \dots, P_k$ be the 1-subGaussian distributions associated with the $k$ arms.

Provide a fully rigorous proof of the claim that

$\begin{equation*} \hat{\mu}_i(mk) - \mu_i - \hat{\mu}_1(mk) + \mu_1 \end{equation*}$

is $\sqrt{2 / m} \text{-subGaussian}$ . You should only use the definitions and the interaction protocol, which states that

$\mathbb{P}\big(A_t \in \cdot \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1}\big) \overset{\text{a.s.}}{=} \pi \big( \cdot \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1} \big)$
$\mathbb{P} \big( X_t \in \cdot \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1}, A_t \big) \overset{\text{a.s.}}{=} P_{A_t}(\cdot)$

Recall that

The mean reward for arm $i$ is estimated as

$\begin{equation*} \hat{\mu}_i(t) := \frac{1}{T_i(t)} \sum_{s=1}^{t} \1{ A_s = i} X_s \end{equation*}$

where

$\begin{equation*} T_i(t) = \sum_{s=1}^{t} \1{A_s = i} \end{equation*}$

is the number of times action $i$ has been played after round $t$ .
The ETC policy for parameter $m \in \mathbb{N}$ , which we denote $\pi_m$ , is defined by

$\begin{equation*} A_t := \begin{cases} (t \ \mathrm{mod} \ k) + 1, & \text{if } t \le mk \\ \underset{i}{\argmax}\ \hat{\mu}_i(mk), & \text{if }t > mk \end{cases} \end{equation*}$

First we note that is sufficient to show that $(\hat{\mu}_i(mk) - \mu_i)$ is $\frac{1}{\sqrt{m}} \text{-subGaussian}$ for all $i$ and that they're independent, since we can decompose the expression as:

$\begin{equation*} \big( \hat{\mu}_i(mk) - \mu_i \big) - \big( \hat{\mu}_1(mk) - \mu_1 \big) \end{equation*}$

and by Lemma 5.4 (c) (which we need to prove!) a sum of two subGaussian rvs is a subGaussian [Easy to prove though (Y)]

[I guess we need the interaction protocol to actually prove that we can consider the mean-estimates as independent?]

7. The Upper Condience Bound (UCB) Algorithm

Notation

Optimism principle

11. The Exp3 Algorithm

Notation

Conditional expectation given history up to time $t$

$\begin{equation*} E_t \big[ \wc \big] = \mathbb{E} \big[ \wc \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1} \big] \end{equation*}$
IS estimator of $x_{t, i}$ is

$\begin{equation*} \hat{X}_{t, i} = \frac{\1 \left\{ A_t = i \right\}}{P_{t, i}} X_t \end{equation*}$

which satisfies

$\begin{equation*} \mathbb{E}_t \big[ \hat{X}_{t, i} \big] = x_{t, i} \end{equation*}$

i.e. unbiased estimator. Note that in this case

$\begin{equation*} \hat{X}_{t, i} \in [0, \infty) \end{equation*}$
Alternative (unbiased) estimator

$\begin{equation*} \hat{X}_{t, i} = 1 - \frac{\1 \left\{ A_t = i \right\}}{P_{t, i}} \big( 1 - X_t \big) \end{equation*}$

which is more easily seen by rearranging

$\begin{equation*} 1 - \hat{X}_{t, i} = \frac{\1 \left\{ A_t = i \right\}}{P_{t, i}} \big( 1 - X_t \big) \end{equation*}$

i.e. RHS is an estimate of the "loss" $(1 - x_{t, i})$ . Note that in this case

$\begin{equation*} \hat{X}_{t, i} \in (-\infty, 1] \end{equation*}$

Overview

Adverserial bandits

11.3 The `Exp3` algorithm

Exp3 = Exponential-weight algorihtm for Exploration and Exploitation
The idea is to construct unbiased estimators for the rewards from each arm by the use of IS.
Exp3 is a particular way of obtaining a distribution for each arm $i$ at time $t$ , $P_{ti}$ :

$\begin{equation*} P_{t, i} := \frac{\exp \big( \eta \hat{S}_{t - 1, i} \big)}{\sum_{j = 1}^{k} \exp \big( \eta \hat{S}_{t - 1, j} \big)} \end{equation*}$

where

$\begin{equation*} S_{t, i} = \sum_{s = 1}^{t} \hat{X}_{s, i} \end{equation*}$

where

$\begin{equation*} \hat{X}_{t, i} := \frac{\1 \left\{ A_t = i \right\}}{P_{t, i}} x_{t, i} \end{equation*}$

where $x_{t, i}$ is the reward at time-step $t$ chosen for arm $i$ by the adversary.
is referred to as the learning rate
- In simple case will be chosen depending on number of arms $k$ and horizon $n$

Let

$x \in [0, 1]^{n \times k}$
$\pi$ be the policy of Exp3 with learning rate

$\begin{equation*} \eta = \sqrt{\frac{\log (k)}{nk}} \end{equation*}$

Then

$\begin{equation*} R_n(\pi, x) \le 2 \sqrt{nk \log (k)} \end{equation*}$

For any arm $i$ define

$\begin{equation*} R_{n, i} = \sum_{t = 1}^{n} x_{t, i} - \mathbb{E} \bigg[ \sum_{t = 1}^{n} X_t \bigg] \end{equation*}$

which is the expected regret relative to using action $i$ in all the rounds.

We will bound $R_{n, i}$ for all $i$ , including the optimal arm.

18. Contextual Bandits

For any $m^{\ast} \in [M]$ , it holds that

$\begin{equation*} \sum_{i = 1}^{n} \tilde{X}_{tm} - \sum_{i = 1}^{n} \sum_{m = 1}^{M} Q_{tm} \tilde{X}_{tm} \le \frac{\log (M)}{\eta} + \frac{\eta}{2} \sum_{t = 1}^{n} \sum_{m = 1}^{M} P_{tm} \big( 1 - \hat{X}_{tm} \big)^2 \end{equation*}$

Recall that

$\begin{equation*} \tilde{X}_t = E^{(t)} \hat{X}_t \end{equation*}$

i.e.

$\begin{equation*} \tilde{X}_{t, m} = \sum_{m = 1}^{M} E_{m, }^{(t)} \hat{X}_{t, m} \end{equation*}$

19. Linear Stochastic Bandits

19.3 Regret Analysis

Let $V_0$ be positive definite and $x_1, \dots, x_n \in \mathbb{R}^d$ be a sequence of vectors with $\norm{x_t} \le L < \infty$ for all $t \in [n]$ .

Then

$\begin{equation*} \sum_{t = 1}^{n} \bigg( 1 \wedge \norm{x_t}_{V_{t - 1}^{-1}}^2 \bigg) \le 2 \log \bigg( \frac{\det V_n}{\det V_0} \bigg) \le 2 d \log \bigg( \frac{\tr(V_0) + n L^2}{d \det^{1 / d}(V_0)} \bigg) \end{equation*}$

In the final inequality we have the following argument.

Operator norm of the matrix $V_t$ is going to be bounded by $n L^2$ , since

$\begin{equation*} \begin{split} \sup_{y: \norm{y} \le 1} y^T V_t y &= \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} y^T x x^T y \\ &= \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} (x^T y)^T (x^T y) \\ &\le \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} \left| \left\langle x, y \right\rangle \right|^2 \\ &\le \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} \norm{x}_2^2 \norm{y}_2^2 \\ &\le \sum_{s = 1}^{t} \norm{x}_2^2 \\ &\le \sum_{s = 1}^{t} L^2 \\ &= n L^2 \end{split} \end{equation*}$

where in the third inequality we've made use of the Cauchy inequality and the assumption that $\norm{x}_2 \le L$ .

The factor of 2 shows up because we do

$\begin{equation*} \norm{\tilde{\theta}_t - \theta_{\ast}} \le \norm{\tilde{\theta}_t - \theta_{t - 1}} + \norm{\theta_{t - 1} - \theta_{\ast}} \end{equation*}$

and each of these terms are bounded by $\sqrt{\beta }_t$ due to the vectors being in $\mathcal{C }_t$ .

20.

20.1

Errata

p. 79: in point 3), there's a sum without any terms…

Linear bandits

Contextual bandits

Batched bandits

han20_sequen_batch_learn_finit_action
- Analysis of both adverserial and stochastic contexts (both with stochastic rewards)
- Finite number of arms
- Number of batches $M$ is allowed to scale wrt. $T$ , and for the relevant results, they provide lower-bounds on the number of batches necessary to achieve the desired regret.
- Analysis is done by analysing the prediction error of a linear regression model which has seen $\lfloor (m - 1) \times \frac{T}{M} \rfloor$ observations, where $m$ is the index of the batch. From this we can deduce the regret of following the predictions based on these previous observations rather than the online version.
- Algorithm presented is essentially LinUCB?
- Adverserial contexts
  - Lower-bound on the regret of any sequential batched algoritm for the case of $K = 2$ han20_sequen_batch_learn_finit_action
    
    $\begin{equation*} \sup_{\theta^{\ast}: \norm{\theta^{\ast}}_2 \le 1} \mathbb{E} \big[ R_T(\mathrm{Alg}) \big] \ge c \cdot \Bigg( \sqrt{d T} + \bigg( \frac{T \sqrt{d}}{M} \land \frac{T}{\sqrt{M}} \bigg) \Bigg) \end{equation*}$
    
    where
    - $K$ is the number of arms
    - $d$ is dimensionality of $\theta$
    - Reward is modelled as
      
      $\begin{equation*} r_t = \left\langle \theta^{\ast}, a_t \right\rangle + \xi_t \end{equation*}$
      
      where $\xi_t \in \mathrm{SG}(1)$ .
    - $T$ is horizon
    - $M$ is number of batches
    - $\mathrm{Alg} = (\mathcal{T}, \pi)$ denotes the algorithm which consists of a "time-grid" $\mathcal{T}$ (with each "interval" in the grid corresponding to a single batch, so that $\abs{\mathcal{T}} = M$ ) and $\pi$ the learning algorithm/policy which maps previous observations to next action to take.
    - $c > 0$ is a universal constant independent of $(T, M, d)$
- Stochastic contexts
  - Upper-bound on regret for what's essentially LinUCB han20_sequen_batch_learn_finit_action