# Notes on: Kocsis, L., & Szepesvári, C. (2006): Bandit Based Monte-Carlo Planning

## Table of Contents

## Notation

- is the
**state space** - is the
**action space** - is the
**transition probabilities** - is the
**reward space** - is the discount factor for future reward
**Markov Decision Process**is defined by the 5-tuple**episode**is a sequence of state-action-reward triplets- is a
**state** - is a
**depth** - denotes the number of episodes
- is
**estimated value**of action at state at depth and time - is the number of times state has been visited up to time at depth
- is the number of times action was selected when state has been visited, up to time at depth
**failure probability**refers to the probability of the estimated optimal machine / arm**not**being equal to the "real" optimal machine / arm

## Overview

- Introduces a new algorithm:
**UCT**which is the**UCB1**algorithm described in auer02 adapted to Monte-Carlo planning - Finite-horizon or discounted MDPs the algorithm is shown ot be consistent and finite sample bounds are derived on the estimation error due to sampling

## The UCT algorithm

### Rollout-based planning

- Builds lookahead tree by repeatedly sampling episodes from initial state
- Rollout-based allow us to keep track of estimates of the actions' values at the sampled states encountered in earlier episodes => some state is
*reencountered*then the estimated action-values can be used to*bias*the choice of what action to follow, potentially speeding up convergence of the value estimates - Thus, for domains where the set of successor states concentrates to a few states only, rollout-based algorithms implementing selective sampling might have an advantage of other methods

### Stochastic bandit problems and UCB1

This algorithm uses **UCB1** in the internal nodes to /select the actions to be sampled next.

**UCB1** keeps track of average rewards for all the gambling machines / arms and chooses the machine / arm with the best *upper confidence bound* :

where is a bias sequence chosen to be

The bias sequence is such that if were i.i.d. then the inequalities

were satisfied. This follows from Hoeffding's inequality.

### Algorithm

- Treats the
*action selection problem*as a separate multi-armed bandit for every (explored) internal node - Arms / machines corresponds to actions and the payoffs to the cumulated (discounted) rewards of the paths originating at the nodes.

### Assumptions

- There is a single machine / arm which is optimal
- Expected values of the averages converge
- and
- Assume tail inequalities for
**UCB1**described earlier to be satisfied for with some

### Main result

Consider a finite-horizon MDP with rewards scaled to lie in the interval. Let the horizon of the MDP be , and the number of actions per state be .

Consider algorithm **UCT** such that the bias terms of **UCB1** are multiplied by , the MDP horizon. Then the bias of the estimated expected payoff, , is .

Further, the **failure probability** at the root converges to zero at a *polynomial rate* as the .

The fact that the **failure probability** of the root converges to zero at a polynomial rate is pretty neat.

There's also a theorem in the paper which specifies what this bound actually is.