# Notes on: Kitchen, A., & Benedetti, M. (2018): Expit-oos: towards learning from planning in imperfect information games

## Table of Contents

## 1 Notation

**infoset**: group of states which are indistinguishable to a player (in this case, states with distinguishable*history*are in different infosets; basically taking the "state" to be the entire history)- "targets" refer to "labeled examples", i.e. if a simulation is "targeted", it means that we will use this as a training example
**OSS**: Online Outcome Sampling**ExIt**: Expert Iteration- \(\mathbf{p}_a\) is the probability of taking action \(a\) given the infoset encoding \(I_s\) at state \(s\)
- \(f_{\theta}(I_s)\) is the model with parameters \(\theta\)
- \(l\) is the "loss"; in this case, the KL divergence between the OSS expert provided target and \(f_{\theta}(I_s)\)

## 2 Online Outcome Sampling

- See lanctot2014search for info.
- Sampling algorithm taht uses regret matching to minimize the counterfactual regret locally at each infoset in a game tree

## 3 Expert Iteration

## 4 Bibliography

- [lanctot2014search] Lanctot, Lis\`y & Bowling, Search in Imperfect Information Games using Online Monte Carlo Counterfactual Regret Minimization, in in: AAAI Workshop on Computer Poker and Imperfect Information, edited by (2014)