Notes on: Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., Wen, Z., & others, (2018): A tutorial on thompson sampling

Table of Contents

Notation

  • russo2018tutorial_b7877159df57cd3ea874b9f821b2b6af31661c74.png actions
  • russo2018tutorial_e135323a86f1872860f8f5d100e772b93f604b08.png are the probabilities of success for the corresponding action / arm
  • russo2018tutorial_a5177eee309f705d90af4a252c1cb0e6a1e3a05d.png denotes a graph with vertices russo2018tutorial_280309b21f7cec8e2b277edc497625652899fd22.png and edges russo2018tutorial_eb1ae3221478a4de7d78db3073ecd5ac731671ad.png
  • Historical data

    russo2018tutorial_767615115ccf88249fb78c870c3288443fc67cc9.png

  • russo2018tutorial_c0d07f52cbd7d2a7439c554b6adaa7c659ede93f.png denotes reward obtained by taking action russo2018tutorial_7884f14c2d8c731d5182670d7fbc841d5a5efa67.png leading to observed outcome russo2018tutorial_058ed08188eec3022f512ee03ca08bf616437fed.png
  • Per-period regret is the difference between exptected reward using optimal action and the reward received by current action. In the Armed bandit problem:

    russo2018tutorial_a81671e043f10907f7c1a5fba686d4e61bfd1f02.png

  • russo2018tutorial_00688efa24a232b7415397cf727004992294b335.png is the conditional distribution of the system (i.e. what we want to learn)

Overview

  • Useful for online decision problems

Armed bandit

  • Suppose there are russo2018tutorial_d80b633218f04f7a25f6a9dd7e4617a38e88f263.png actions, and when played, any action yields either a success or failure.
  • Action russo2018tutorial_b7877159df57cd3ea874b9f821b2b6af31661c74.png produces a success with probability russo2018tutorial_90cd8c2a4ae865ca218e2ddb7156f262c60b48e4.png.
  • The success probabilities russo2018tutorial_e135323a86f1872860f8f5d100e772b93f604b08.png are unknown to the agent, but are fixed over time, and can therefore be learned by experimentation
  • Objective: maximize number of successes over russo2018tutorial_b3731d37cd447bd6c31809075a6be43b3d0b04ec.png periods, where russo2018tutorial_b3731d37cd447bd6c31809075a6be43b3d0b04ec.png is relatively large compared to the number of arms russo2018tutorial_d80b633218f04f7a25f6a9dd7e4617a38e88f263.png

Thompson sampling

  • Posterior sampling and probability matching
  • Methods such as russo2018tutorial_4601902bdc800f07f769239bf371a3d23f7e88da.png attempts to balance exploration by choosing a random action with probability russo2018tutorial_03d96c52df5ea6a510607e2260e0745d11e4bd85.png and greedy step with probability russo2018tutorial_31f4302df393f1f6ed00bce083220279dfda3774.png
    • Inefficient since it will sample actions regardless of how unlikely they are to be optimal; Thompson sampling attempts to address this

russo2018tutorial_9c1a6503a4988c3e72d03053b6a6dd0f0332ed26.png

The difference between a greedy algorithm and Thompson sampling is that a greedy algorithm would use the expectation over the current estimate of russo2018tutorial_fefe9e556d399665a26a37824ec578cbffb0cabe.png to obtain the new estimate of russo2018tutorial_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png, while Thompson sampling samples the new estimate of russo2018tutorial_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png.

Approximations

Langevin Monte Carlo