Notes on: Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, (2018): A tutorial on thompson sampling
Table of Contents
Notation
actions
are the probabilities of success for the corresponding action / arm
denotes a graph with vertices
and edges
Historical data
denotes reward obtained by taking action
leading to observed outcome
Per-period regret is the difference between exptected reward using optimal action and the reward received by current action. In the Armed bandit problem:
is the conditional distribution of the system (i.e. what we want to learn)
Overview
- Useful for online decision problems
Armed bandit
- Suppose there are
actions, and when played, any action yields either a success or failure.
- Action
produces a success with probability
.
- The success probabilities
are unknown to the agent, but are fixed over time, and can therefore be learned by experimentation
- Objective: maximize number of successes over
periods, where
is relatively large compared to the number of arms
Thompson sampling
- Posterior sampling and probability matching
- Methods such as
attempts to balance exploration by choosing a random action with probability
and greedy step with probability
- Inefficient since it will sample actions regardless of how unlikely they are to be optimal; Thompson sampling attempts to address this
The difference between a greedy algorithm and Thompson sampling is that a greedy algorithm would use the expectation over the current estimate of to obtain the new estimate of
, while Thompson sampling samples the new estimate of
.