# Notes on: Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, (2018): A tutorial on thompson sampling

## Notation

• actions
• are the probabilities of success for the corresponding action / arm
• denotes a graph with vertices and edges • Historical data • denotes reward obtained by taking action leading to observed outcome • Per-period regret is the difference between exptected reward using optimal action and the reward received by current action. In the Armed bandit problem: • is the conditional distribution of the system (i.e. what we want to learn)

## Overview

• Useful for online decision problems

## Armed bandit

• Suppose there are actions, and when played, any action yields either a success or failure.
• Action produces a success with probability .
• The success probabilities are unknown to the agent, but are fixed over time, and can therefore be learned by experimentation
• Objective: maximize number of successes over periods, where is relatively large compared to the number of arms ## Thompson sampling

• Posterior sampling and probability matching
• Methods such as attempts to balance exploration by choosing a random action with probability and greedy step with probability • Inefficient since it will sample actions regardless of how unlikely they are to be optimal; Thompson sampling attempts to address this The difference between a greedy algorithm and Thompson sampling is that a greedy algorithm would use the expectation over the current estimate of to obtain the new estimate of , while Thompson sampling samples the new estimate of .