Notes on: Kocsis, L., & Szepesvári, C. (2006): Bandit Based Monte-Carlo Planning

Table of Contents

Notation

  • Kocsis_2006_3b4672f4daa50549084f26685be4992faceeb7ec.png is the state space
  • Kocsis_2006_ecffcd3cd2275dd07d0edd842ad40fa1bf9ff1eb.png is the action space
  • Kocsis_2006_ac3c725a3c948d4a945ee9ae85121d4ef0cea138.png is the transition probabilities
  • Kocsis_2006_bdf2885146249a0265c09f94c75d5f3a20c03e78.png is the reward space
  • Kocsis_2006_339c704fc9bf0c2f6780a0a0e68b51e128bb7f74.png is the discount factor for future reward
  • Markov Decision Process is defined by the 5-tuple Kocsis_2006_f24b5f52a9857700eb936b62a83473d0cb752cb1.png
  • episode is a sequence of state-action-reward triplets
  • Kocsis_2006_9493b9776211ef51195c1379ad5f5540a04e566e.png is a state
  • Kocsis_2006_24db75342343379bc61072ed218d34613269537e.png is a depth
  • Kocsis_2006_f07b932b12c91bca424fd428d383495413b5b6a9.png denotes the number of episodes
  • Kocsis_2006_7d1840d8753bd5fa6edffa220892cc521d91cb42.png is estimated value of action Kocsis_2006_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png at state Kocsis_2006_9493b9776211ef51195c1379ad5f5540a04e566e.png at depth Kocsis_2006_24db75342343379bc61072ed218d34613269537e.png and time Kocsis_2006_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png
  • Kocsis_2006_02a57b884ced156a18e9d9844e40dd9b457ebe14.png is the number of times state Kocsis_2006_9493b9776211ef51195c1379ad5f5540a04e566e.png has been visited up to time Kocsis_2006_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png at depth Kocsis_2006_24db75342343379bc61072ed218d34613269537e.png
  • Kocsis_2006_bbf8c8f38a07b77c4af1924c4ede33e85468f839.png is the number of times action Kocsis_2006_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png was selected when state Kocsis_2006_9493b9776211ef51195c1379ad5f5540a04e566e.png has been visited, up to time Kocsis_2006_eb5d809ed7c492fae7d4927a6fc9a5e22f9b3831.png at depth Kocsis_2006_24db75342343379bc61072ed218d34613269537e.png
  • failure probability refers to the probability of the estimated optimal machine / arm not being equal to the "real" optimal machine / arm

Overview

  • Introduces a new algorithm: UCT which is the UCB1 algorithm described in auer02 adapted to Monte-Carlo planning
  • Finite-horizon or discounted MDPs the algorithm is shown ot be consistent and finite sample bounds are derived on the estimation error due to sampling

The UCT algorithm

Rollout-based planning

  • Builds lookahead tree by repeatedly sampling episodes from initial state
  • Rollout-based allow us to keep track of estimates of the actions' values at the sampled states encountered in earlier episodes => some state is reencountered then the estimated action-values can be used to bias the choice of what action to follow, potentially speeding up convergence of the value estimates
  • Thus, for domains where the set of successor states concentrates to a few states only, rollout-based algorithms implementing selective sampling might have an advantage of other methods

Stochastic bandit problems and UCB1

This algorithm uses UCB1 in the internal nodes to /select the actions to be sampled next.

UCB1 keeps track of average rewards Kocsis_2006_49996d3eaefee440050ebc1b59c22fd0df1a9b93.png for all the gambling machines / arms and chooses the machine / arm with the best upper confidence bound :

Kocsis_2006_777cc812f70a554dd88b9be85bac8438754eb40b.png

where Kocsis_2006_463cfb015bd8fe5b7128589911c5f935f70e9192.png is a bias sequence chosen to be

Kocsis_2006_88a9cad96f1e4c823fe727765bb00a6919aac64f.png

The bias sequence is such that if Kocsis_2006_939d94f1be81b4cb5166479153ca51d60f314112.png were i.i.d. then the inequalities

Kocsis_2006_2b5e80ebd02bbe724edd3e0635a0172c2d7fc11e.png

were satisfied. This follows from Hoeffding's inequality.

Algorithm

  • Treats the action selection problem as a separate multi-armed bandit for every (explored) internal node
  • Arms / machines corresponds to actions and the payoffs to the cumulated (discounted) rewards of the paths originating at the nodes.

Kocsis_2006_747d6dd608d68fdc32bfb07e5d9a58e32f9b9392.png

Assumptions

  • There is a single machine / arm which is optimal
  • Expected values of the averages Kocsis_2006_af6b2b059ee1b94d1539d38af63056d2edd2940c.png converge
  • Kocsis_2006_eaa617e8dc5e900453baa2fffe204bbaa9b382fc.png and Kocsis_2006_904fba841432d69041e5baca0021a50e53ba22e1.png
  • Kocsis_2006_8a01f6247639d7dc274482bbf2240810be6cddc7.png
  • Assume tail inequalities for UCB1 described earlier to be satisfied for Kocsis_2006_d8074aa62a884192e1f2f59a95dbc76cbc21e288.png with some Kocsis_2006_50c33be89c65b45789fb3a4aadec053fd60cd9b1.png

Main result

Consider a finite-horizon MDP with rewards scaled to lie in the Kocsis_2006_14ad9fc7ad19fd0d1dd65feedc0d5b8c171b6486.png interval. Let the horizon of the MDP be Kocsis_2006_b689cba8d7566f6adaf605a844e193a27e155078.png, and the number of actions per state be Kocsis_2006_d80b633218f04f7a25f6a9dd7e4617a38e88f263.png.

Consider algorithm UCT such that the bias terms of UCB1 are multiplied by Kocsis_2006_b689cba8d7566f6adaf605a844e193a27e155078.png, the MDP horizon. Then the bias of the estimated expected payoff, Kocsis_2006_e573129dfd6785e98ad80296eab9306e827353f7.png, is Kocsis_2006_73caa67086fc04c6c08c75b60ed05fa4a31d6daa.png.

Further, the failure probability at the root converges to zero at a polynomial rate as the Kocsis_2006_2a9ea0f86955fdea4f30af2f97e4e0722109bdb3.png.

The fact that the failure probability of the root converges to zero at a polynomial rate is pretty neat.

There's also a theorem in the paper which specifies what this bound actually is.

  • [auer02] Peter Auer, Nicolò Cesa-Bianchi & Paul Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47(2/3), 235-256 (2002). link. doi.