# Notes on: Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, (2018): A tutorial on thompson sampling

## Table of Contents

## Notation

- actions
- are the probabilities of
*success*for the corresponding action / arm - denotes a graph with vertices and edges
Historical data

- denotes reward obtained by taking
*action*leading to*observed outcome* *Per-period regret*is the difference between exptected reward using*optimal*action and the reward received by current action. In the Armed bandit problem:- is the conditional distribution of the
*system*(i.e. what we want to learn)

## Overview

- Useful for
*online decision problems*

## Armed bandit

- Suppose there are actions, and when played, any action yields either a success or failure.
- Action produces a success with probability .
- The success probabilities are unknown to the agent, but are fixed over time, and can therefore be learned by experimentation
- Objective: maximize number of successes over periods, where is relatively large compared to the number of arms

## Thompson sampling

*Posterior sampling*and*probability matching*- Methods such as attempts to balance exploration by choosing a random action with probability and greedy step with probability
*Inefficient*since it will sample actions regardless of how unlikely they are to be optimal; Thompson sampling attempts to address this

The difference between a *greedy* algorithm and Thompson sampling is that a greedy algorithm would use the expectation over the current estimate of to obtain the new estimate of , while Thompson sampling *samples* the new estimate of .