# Notes on: Lanctot, M., Lis\`y, Viliam, & Bowling, M. (2014): Search in imperfect information games using online monte carlo counterfactual regret minimization

## 1 Notation

• Chance $$c$$ refers to the fixed stochastic strategy which is the environment (i.e. which card comes up on the flop in Texas Hold'em)
• $$h \in H$$ denotes a history
• $$i \in N$$ indexes the players, and $$P(h)$$ denotes the player who's turn it is next
• $$z \in Z \subseteq H$$ denotes a terminal history, which is a full game from start to end
• $$u_i(z)$$ denotes the payoff for player $$i$$ from the terminal history $$z$$
• $$a \in \mathcal{A}$$ denotes the action space and $$A(I)$$ denotes the set of possible actions given infoset $$I \in \mathcal{I}_i$$
• $$ha$$ denotes the history $$h$$ followed by action $$a$$, and $$h \sqsubset h'$$ is notation for "sucessor history"

\begin{equation*} h' = ha \end{equation*}
• Probability of terminal history under $$\sigma$$

\begin{equation*} \pi^{\sigma}(z) = \prod_{i \in N} \pi_i(z) \end{equation*}

where

\begin{equation*} \pi_i(z) := \prod_{ha \sqsubset z, P(h) = i} \sigma_i \Big( I(h), a \Big) \end{equation*}

I.e. the joint probability of the strategies by each player $$i$$ in the case where the final action was taken by player $$i$$

• $$\sigma_{-i}$$ and $$\pi_{-i}^\sigma$$ refer to player i's opponent strategy and products of the opponent's and chance's actions
• Behavioral strategy for player $$i$$ is

\begin{equation*} \begin{split} \sigma_i: \quad & \mathcal{I}_i \to \mathcal{P} (\mathcal{A}) \\ & I \mapsto \sigma_i(I) := p \Big( A(I) \Big) \end{split} \end{equation*}
• Counterfactual value:

\begin{equation*} v_i(I, \sigma) = \sum_{(h, z) \in Z_I}^{} \pi_{-i}^{\sigma}(h) \ \pi_i^{\sigma}(h, z) \ u_i(z) \end{equation*}

where

\begin{equation*} Z_I = \left\{ (h, z) \mid z \in Z,\ h \in I, \ h \sqsubset z \right\} \end{equation*}

## 2 Stuff

An $$\epsilon \text{-Nash equilibrium}$$, $$\sigma$$, is a ordered tuple $$(\sigma_1, \dots, \sigma_N)$$ such that for any given player $$i$$, the benefit of switching strategy from $$\sigma_i$$ to some $$\sigma_i'$$ is bounded by $$\epsilon$$, i.e.

\begin{equation*} \Big( \max_{\textcolor{red}{\sigma_i'} \in \Sigma_i} \underset{\sigma_1, \dots, \textcolor{red}{\sigma_i'}, \dots, \sigma_N}{\mathbb{E}} [u_i(z)] \Big) - \underset{\sigma_1, \dots, \textcolor{green}{\sigma_i} , \dots, \sigma_N}{\mathbb{E}} [u_i(z)] \le \epsilon \end{equation*}

If $$\epsilon = 0$$, it's called a Nash equilibrium.

A zero-sum game is a game such that

\begin{equation*} \sum_{i=1}^{N} u_i(z) = k, \quad \forall z \in Z \end{equation*}

for some constant $$k$$.

The exploitability of a profile $$(\sigma_1, \dots, \sigma_N)$$ is defined

\begin{equation*} \epsilon_{\sigma} = \sum_{i=1}^{N} \max_{\sigma_i' \in \Sigma_i} \Big( \underset{\sigma_1, \dots, \sigma_i' , \dots, \sigma_N}{\mathbb{E}} [u_i(z)] \Big) \end{equation*}

i.e. difference between the optimial strategy for each player given the strategies of the other players.