Notes on: Kitchen, A., & Benedetti, M. (2018): Expit-oos: towards learning from planning in imperfect information games

Table of Contents

1 Notation

  • infoset: group of states which are indistinguishable to a player (in this case, states with distinguishable history are in different infosets; basically taking the "state" to be the entire history)
  • "targets" refer to "labeled examples", i.e. if a simulation is "targeted", it means that we will use this as a training example
  • OSS: Online Outcome Sampling
  • ExIt: Expert Iteration
  • \(\mathbf{p}_a\) is the probability of taking action \(a\) given the infoset encoding \(I_s\) at state \(s\)
  • \(f_{\theta}(I_s)\) is the model with parameters \(\theta\)
  • \(l\) is the "loss"; in this case, the KL divergence between the OSS expert provided target and \(f_{\theta}(I_s)\)

2 Online Outcome Sampling

  • See lanctot2014search for info.
  • Sampling algorithm taht uses regret matching to minimize the counterfactual regret locally at each infoset in a game tree

3 Expert Iteration

4 Bibliography

  • [lanctot2014search] Lanctot, Lis\`y & Bowling, Search in Imperfect Information Games using Online Monte Carlo Counterfactual Regret Minimization, in in: AAAI Workshop on Computer Poker and Imperfect Information, edited by (2014)