Bandits

Table of Contents

Sequential decision problems

  • Reward/feedback type:
    • Bandit
      • Only observe information/feedback for the action taken.
    • Full information
      • Observe information/feedback for all the actions.
    • Partial information
      • Observe information/feedback for a subset (potentially sparse) of possible actions.

Bandit book lattimore2018bandit

Notation

  • $\nu = \big( P_a: a \in \mathcal{A} \big)$ denotes a stochastic bandit
  • $\Delta_a(\nu) = \mu^*(\nu) - \mu_a(\nu)$ is the suboptimality gap or action gap or immediate regret of action $a$
  • Number of times action $a$ was chosen by the learner at the end of round $t$

    \begin{equation*}
T_a(t) = \sum_{s = 1}^{t} \1{A_S = a }
\end{equation*}

    In general, this is random, since the action taken at time $s \le t$ is a random variable depending on all previous observatiosn and actions.

2

2.2 σ-algebras and knowledge

  • Collection of sequential observations (rvs.) $X_{1:t} = X_1, \dots, X_n$
  • The set of maps $f$ such that $f \circ X_{1:t}: \mathbb{R}^t \to \mathbb{R}$ is measurable then characterizes all the possible ways the learner can respond.
    • I.e. sufficies to concentrate on σ-algebra $\mathcal{F}_t = \sigma \big( X_{1:t} \big) \subseteq \mathcal{F}$
  • Reminder: $\big( X_t \big)_{t = 1}^n$ is adapted to filtration $\mathbb{F} = \big( \mathcal{F}_t \big)_{t = 1}^n$ if $X_t$ is $\mathcal{F}_t \text{-measurable}$ for each $t \in [n]$.

3

Markov Chains

Selection_072_2020-10-01_20-50-40.png

4

4.5 Decomposing regret

For any policy $\pi$ and stochastic bandit environment $\nu$ with $\mathcal{A}$ finite or countable and horizon $n \in \mathbb{N}$, the regret $R_n$ of policy $\pi$ in $\nu$ satisfies

\begin{equation*}
R_n = \sum_{a \in \mathcal{A}}^{} \Delta_a \mathbb{E} \big[ T_a(n) \big]
\end{equation*}

5.

Exercises

5.3

Eq. (5.2):

\begin{equation*}
\mathbb{P} \big( | \hat{\mu} - \mu| \ge \varepsilon \big) \le \frac{\sigma^2}{n \varepsilon^2}
\end{equation*}

Eq. (5.4):

\begin{equation*}
\mathbb{P} \big( \hat{\mu} \ge \mu + \varepsilon \big) \le \sqrt{\frac{\sigma^2}{2 \pi n \varepsilon^2}} \exp \bigg( - \frac{n \varepsilon^2}{2 \sigma^2} \bigg)
\end{equation*}

which means

\begin{equation*}
\mathbb{P} \big( \left| \hat{\mu} - \mu \right| \ge \varepsilon \big) \le \sqrt{\frac{2 \sigma^2}{\pi n \varepsilon^2}} \exp \bigg( - \frac{n \varepsilon^2}{2 \sigma^2} \bigg)
\end{equation*}

Then the ratio between the two rates:

\begin{equation*}
\frac{\frac{\sigma^2}{n \varepsilon^2}}{\frac{\sqrt{2} \sigma}{\pi \sqrt{n} \varepsilon} \exp \big( - \frac{n \varepsilon^2}{2 \sigma^2} \big)} = 
\end{equation*}
\begin{equation*}
r = \frac{\sigma^2}{n \varepsilon^2} \cdot \frac{\sqrt{\pi n} \varepsilon}{\sqrt{2} \sigma} \exp \bigg( \frac{n \varepsilon^2}{2 \sigma^2} \bigg) = \frac{1}{\varepsilon} \cdot \frac{1}{\sqrt{n}} \cdot \sqrt{\frac{\pi \sigma^2}{2}} \exp \bigg( \frac{n \varepsilon^2}{2 \sigma^2} \bigg)
\end{equation*}

Then if $\varepsilon > 1$, we have $\varepsilon \le \varepsilon^2$ and $\frac{1}{\varepsilon}$ is dominated by $\exp \bigg( \frac{n \varepsilon^2}{2 \sigma^2} \bigg)$

Alternatively,

\begin{equation*}
\alpha(\varepsilon, n) := \sqrt{\frac{n \varepsilon^2}{2 \sigma^2}}
\end{equation*}

then

\begin{equation*}
\big(1 + \alpha^2(\varepsilon, n) \big) \le \exp \big( \alpha^2(\varepsilon, n) \big)
\end{equation*}

so,

\begin{equation*}
r = \frac{\sqrt{\pi}}{2} \frac{1}{\alpha(\varepsilon, n)} \exp \big( \alpha(\varepsilon, n) \big) \ge \frac{\sqrt{\pi}}{2} \frac{1}{\alpha(\varepsilon, n)} \big( 1 + \alpha^2(\varepsilon, n) \big) \ge \frac{\sqrt{\pi}}{2} \bigg( \frac{1}{\alpha(\varepsilon, n)} + \alpha(\varepsilon, n) \bigg)
\end{equation*}

As a result, on $\alpha \in [1, \infty)$ the lower-bound of the ratio is monotonically increasing in $\alpha$. On $\alpha \in (0, 1)$, the lower-bound of the ratio decreases monotonically in $\alpha$. For the increase, we can for certain say that

We know for $\alpha \in [1, \infty)$, the $\exp$ factor completely dominates. For $\alpha \in (0, 1)$, we can use the fact that

\begin{equation*}
\alpha^2 \le \big( 1 + \alpha^2 \big) \le \exp( \alpha^2 )
\end{equation*}

which means that

\begin{equation*}
r \ge \frac{\sqrt{\pi}}{2} \frac{1}{\alpha} \cdot \alpha^2 = \frac{\sqrt{\pi}}{2} \alpha
\end{equation*}

But for the question of when $r < 1$, i.e. Eq. (5.2) is best, and $r > 1$, i.e. Eq. (5.4) is best, observe that if

\begin{equation*}
\alpha = \frac{2}{\sqrt{\pi}} \quad \implies \quad r \ge 1
\end{equation*}

which means Eq. (5.4) is best (BUT, $\frac{2}{\sqrt{\pi}} \not\in [0, 1)$ so it doesn't actually provide us with any more information!).

Moreover, in the case $\alpha > 1$, we know for certain that $r \ge 1$ and thus Eq. (5.4) is the best.

We can refine our lower-bound of when $r \ge 1$ by using the slightly tighter lower-bound for $\exp$, where we have

\begin{equation*}
r \ge \frac{\sqrt{\pi}}{2} \bigg( \frac{1}{\alpha} + \alpha \bigg)
\end{equation*}

and we can solve for RHS = 1, or equiv.,

\begin{equation*}
\frac{1}{\alpha} + \alpha = \frac{2}{\sqrt{\pi}}
\end{equation*}

which is equivalent to the quadratic

\begin{equation*}
\alpha^2 - \frac{2}{\sqrt{\pi}} \alpha + 1 = 0
\end{equation*}

which has solutions

\begin{equation*}
\alpha = \frac{1}{\sqrt{\pi}} \pm \underbrace{\sqrt{\frac{1}{\pi} - 1}}_{\le 0}
\end{equation*}

Therefore $\alpha$ has non-real imaginary component, i.e. this doesn't provide us with any more insight.

But, we can consider when this quadratic is $< 0$ and $> 0$ which gives us

6. The Explore-then-Commit Algorithm

Exercises

6.1 subGaussian empirical estimates

Let $\pi$ be the policy of ETC and $P_1, \dots, P_k$ be the 1-subGaussian distributions associated with the $k$ arms.

Provide a fully rigorous proof of the claim that

\begin{equation*}
\hat{\mu}_i(mk) - \mu_i - \hat{\mu}_1(mk) + \mu_1
\end{equation*}

is $\sqrt{2 / m} \text{-subGaussian}$. You should only use the definitions and the interaction protocol, which states that

  1. $\mathbb{P}\big(A_t \in \cdot \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1}\big) \overset{\text{a.s.}}{=} \pi \big( \cdot \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1} \big)$
  2. $\mathbb{P} \big( X_t \in \cdot \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1}, A_t \big) \overset{\text{a.s.}}{=} P_{A_t}(\cdot)$

Recall that

  • The mean reward for arm $i$ is estimated as

    \begin{equation*}
\hat{\mu}_i(t) := \frac{1}{T_i(t)} \sum_{s=1}^{t} \1{ A_s = i} X_s
\end{equation*}

    where

    \begin{equation*}
T_i(t) = \sum_{s=1}^{t} \1{A_s = i}
\end{equation*}

    is the number of times action $i$ has been played after round $t$.

  • The ETC policy for parameter $m \in \mathbb{N}$, which we denote $\pi_m$, is defined by

    \begin{equation*}
A_t := 
\begin{cases}
  (t \ \mathrm{mod} \ k) + 1, & \text{if } t \le mk \\
  \underset{i}{\argmax}\ \hat{\mu}_i(mk), & \text{if }t > mk
\end{cases}
\end{equation*}

First we note that is sufficient to show that $(\hat{\mu}_i(mk) - \mu_i)$ is $\frac{1}{\sqrt{m}} \text{-subGaussian}$ for all $i$ and that they're independent, since we can decompose the expression as:

\begin{equation*}
\big( \hat{\mu}_i(mk) - \mu_i \big) - \big( \hat{\mu}_1(mk) - \mu_1 \big)
\end{equation*}

and by Lemma 5.4 (c) (which we need to prove!) a sum of two subGaussian rvs is a subGaussian [Easy to prove though (Y)]

[I guess we need the interaction protocol to actually prove that we can consider the mean-estimates as independent?]

7. The Upper Condience Bound (UCB) Algorithm

Notation

Optimism principle

11. The Exp3 Algorithm

Notation

  • Conditional expectation given history up to time $t$

    \begin{equation*}
E_t \big[ \wc \big] = \mathbb{E} \big[ \wc \mid A_1, X_1, \dots, A_{t - 1}, X_{t - 1} \big]
\end{equation*}
  • IS estimator of $x_{t, i}$ is

    \begin{equation*}
\hat{X}_{t, i} = \frac{\1 \left\{ A_t = i \right\}}{P_{t, i}} X_t
\end{equation*}

    which satisfies

    \begin{equation*}
\mathbb{E}_t \big[ \hat{X}_{t, i} \big] = x_{t, i}
\end{equation*}

    i.e. unbiased estimator. Note that in this case

    \begin{equation*}
\hat{X}_{t, i} \in [0, \infty)
\end{equation*}
  • Alternative (unbiased) estimator

    \begin{equation*}
\hat{X}_{t, i} = 1 - \frac{\1 \left\{ A_t = i \right\}}{P_{t, i}} \big( 1 - X_t \big)
\end{equation*}

    which is more easily seen by rearranging

    \begin{equation*}
1 - \hat{X}_{t, i} = \frac{\1 \left\{ A_t = i  \right\}}{P_{t, i}} \big( 1 - X_t \big)
\end{equation*}

    i.e. RHS is an estimate of the "loss" $(1 - x_{t, i})$. Note that in this case

    \begin{equation*}
\hat{X}_{t, i} \in (-\infty, 1]
\end{equation*}

Overview

  • Adverserial bandits

11.3 The Exp3 algorithm

  • Exp3 = Exponential-weight algorihtm for Exploration and Exploitation
  • The idea is to construct unbiased estimators for the rewards from each arm by the use of IS.
  • Exp3 is a particular way of obtaining a distribution for each arm $i$ at time $t$, $P_{ti}$:

    \begin{equation*}
P_{t, i} := \frac{\exp \big( \eta \hat{S}_{t - 1, i} \big)}{\sum_{j = 1}^{k} \exp \big( \eta \hat{S}_{t - 1, j} \big)}
\end{equation*}

    where

    \begin{equation*}
S_{t, i} = \sum_{s = 1}^{t} \hat{X}_{s, i}
\end{equation*}

    where

    \begin{equation*}
\hat{X}_{t, i} := \frac{\1 \left\{ A_t = i \right\}}{P_{t, i}} x_{t, i}
\end{equation*}

    where $x_{t, i}$ is the reward at time-step $t$ chosen for arm $i$ by the adversary.

  • $\eta$ is referred to as the learning rate
    • In simple case will be chosen depending on number of arms $k$ and horizon $n$

Let

  • $x \in [0, 1]^{n \times k}$
  • $\pi$ be the policy of Exp3 with learning rate

    \begin{equation*}
\eta = \sqrt{\frac{\log (k)}{nk}}
\end{equation*}

Then

\begin{equation*}
R_n(\pi, x) \le 2 \sqrt{nk \log (k)}
\end{equation*}

For any arm $i$ define

\begin{equation*}
R_{n, i} = \sum_{t = 1}^{n} x_{t, i} - \mathbb{E} \bigg[ \sum_{t = 1}^{n} X_t \bigg]
\end{equation*}

which is the expected regret relative to using action $i$ in all the rounds.

We will bound $R_{n, i}$ for all $i$, including the optimal arm.

18. Contextual Bandits

For any $m^{\ast} \in [M]$, it holds that

\begin{equation*}
\sum_{i = 1}^{n} \tilde{X}_{tm} - \sum_{i = 1}^{n} \sum_{m = 1}^{M} Q_{tm} \tilde{X}_{tm} \le \frac{\log (M)}{\eta} + \frac{\eta}{2} \sum_{t = 1}^{n} \sum_{m = 1}^{M} P_{tm} \big( 1 - \hat{X}_{tm} \big)^2
\end{equation*}

Recall that

\begin{equation*}
\tilde{X}_t = E^{(t)} \hat{X}_t
\end{equation*}

i.e.

\begin{equation*}
\tilde{X}_{t, m} = \sum_{m = 1}^{M} E_{m, }^{(t)} \hat{X}_{t, m}
\end{equation*}

19. Linear Stochastic Bandits

19.3 Regret Analysis

Let $V_0$ be positive definite and $x_1, \dots, x_n \in \mathbb{R}^d$ be a sequence of vectors with $\norm{x_t} \le L < \infty$ for all $t \in [n]$.

Then

\begin{equation*}
\sum_{t = 1}^{n} \bigg( 1 \wedge \norm{x_t}_{V_{t - 1}^{-1}}^2 \bigg) \le 2 \log \bigg( \frac{\det V_n}{\det V_0} \bigg) \le 2 d \log \bigg( \frac{\tr(V_0) + n L^2}{d \det^{1 / d}(V_0)} \bigg)
\end{equation*}

In the final inequality we have the following argument.

Operator norm of the matrix $V_t$ is going to be bounded by $n L^2$, since

\begin{equation*}
\begin{split}
  \sup_{y: \norm{y} \le 1} y^T V_t y &= \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} y^T x x^T y \\
  &= \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} (x^T y)^T (x^T y) \\
  &\le \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} \left| \left\langle x, y \right\rangle \right|^2 \\
  &\le \sup_{y: \norm{y} \le 1} \sum_{s = 1}^{t} \norm{x}_2^2 \norm{y}_2^2 \\
  &\le \sum_{s = 1}^{t} \norm{x}_2^2 \\
  &\le \sum_{s = 1}^{t} L^2 \\
  &= n L^2
\end{split}
\end{equation*}

where in the third inequality we've made use of the Cauchy inequality and the assumption that $\norm{x}_2 \le L$.

The factor of 2 shows up because we do

\begin{equation*}
\norm{\tilde{\theta}_t - \theta_{\ast}} \le \norm{\tilde{\theta}_t - \theta_{t - 1}} + \norm{\theta_{t - 1} - \theta_{\ast}} 
\end{equation*}

and each of these terms are bounded by $\sqrt{\beta }_t$ due to the vectors being in $\mathcal{C }_t$.

20.

20.1

Errata

  • p. 79: in point 3), there's a sum without any terms…

Linear bandits

Contextual bandits

Batched bandits

  • han20_sequen_batch_learn_finit_action
    • Analysis of both adverserial and stochastic contexts (both with stochastic rewards)
    • Finite number of arms
    • Number of batches $M$ is allowed to scale wrt. $T$, and for the relevant results, they provide lower-bounds on the number of batches necessary to achieve the desired regret.
    • Analysis is done by analysing the prediction error of a linear regression model which has seen $\lfloor (m - 1) \times \frac{T}{M} \rfloor$ observations, where $m$ is the index of the batch. From this we can deduce the regret of following the predictions based on these previous observations rather than the online version.
    • Algorithm presented is essentially LinUCB?
    • Adverserial contexts
      • Lower-bound on the regret of any sequential batched algoritm for the case of $K = 2$ han20_sequen_batch_learn_finit_action

        \begin{equation*}
\sup_{\theta^{\ast}: \norm{\theta^{\ast}}_2 \le 1} \mathbb{E} \big[ R_T(\mathrm{Alg}) \big] \ge c \cdot \Bigg( \sqrt{d T} + \bigg( \frac{T \sqrt{d}}{M} \land \frac{T}{\sqrt{M}} \bigg) \Bigg)
\end{equation*}

        where

        • $K$ is the number of arms
        • $d$ is dimensionality of $\theta$
        • Reward is modelled as

          \begin{equation*}
r_t = \left\langle \theta^{\ast}, a_t \right\rangle + \xi_t
\end{equation*}

          where $\xi_t \in \mathrm{SG}(1)$.

        • $T$ is horizon
        • $M$ is number of batches
        • $\mathrm{Alg} = (\mathcal{T}, \pi)$ denotes the algorithm which consists of a "time-grid" $\mathcal{T}$ (with each "interval" in the grid corresponding to a single batch, so that $\abs{\mathcal{T}} = M$) and $\pi$ the learning algorithm/policy which maps previous observations to next action to take.
        • $c > 0$ is a universal constant independent of $(T, M, d)$
    • Stochastic contexts

Kernel bandits

UCB

LinRel

SupLinRel