# Notes on: Kitchen, A., & Benedetti, M. (2018): Expit-oos: towards learning from planning in imperfect information games

## 1 Notation

• infoset: group of states which are indistinguishable to a player (in this case, states with distinguishable history are in different infosets; basically taking the "state" to be the entire history)
• "targets" refer to "labeled examples", i.e. if a simulation is "targeted", it means that we will use this as a training example
• OSS: Online Outcome Sampling
• ExIt: Expert Iteration
• $$\mathbf{p}_a$$ is the probability of taking action $$a$$ given the infoset encoding $$I_s$$ at state $$s$$
• $$f_{\theta}(I_s)$$ is the model with parameters $$\theta$$
• $$l$$ is the "loss"; in this case, the KL divergence between the OSS expert provided target and $$f_{\theta}(I_s)$$

## 2 Online Outcome Sampling

• See lanctot2014search for info.
• Sampling algorithm taht uses regret matching to minimize the counterfactual regret locally at each infoset in a game tree

## 4 Bibliography

• [lanctot2014search] Lanctot, Lis\`y & Bowling, Search in Imperfect Information Games using Online Monte Carlo Counterfactual Regret Minimization, in in: AAAI Workshop on Computer Poker and Imperfect Information, edited by (2014)