# Notes on: Mehta, P., Bukov, M., Wang, C., Day, A. G. R., Richardson, C., Fisher, C. K., & Schwab, D. J. (2018): A high-bias, low-variance introduction to machine learning for physicists

## 1 Notation

• $$\text{Tr}_{\mathbf{x}} p(\mathbf{x})$$ denotes the "trace of $$p$$", which integrating over all $$\mathbf{x}$$ if continuous , or summing over all $$\mathbf{x}$$ if discrete

## 2 Variational methods and Mean-field Theory (MFT)

### 2.1 Notation

• $$Z_p$$ denotes the partition function of the distribution $$p$$
• Spin configuration $$\mathbf{s}$$ specifies the values $$s_i$$ of the spins at every lattice site
• Energy of configuration $$\mathbf{s}$$:

\begin{equation*} E(\mathbf{s}, \mathbf{J}) = - \frac{1}{2} \sum_{i, j}^{} J_{ij} s_i s_j - \sum_{i}^{} h_i s_i \end{equation*}

where

• $$h_i$$ is the local magnetic field acting on the spin $$s_i$$
• $$J_{ij}$$ is the interaction strength / coupling between spins $$s_i$$ and $$s_j$$
• Probability of system being in spin configuration $$\mathbf{s}$$ at "temperature" $$\beta^{-1}$$:

\begin{equation*} \begin{split} p( \mathbf{s} \mid \mathbf{J}) &= \frac{1}{Z_p(\mathbf{J})} e^{- \beta E(\mathbf{s}, \mathbf{J})} \\ Z_p(\mathbf{J}) &= \sum_{\left\{ s_i = \pm 1 \right\}}^{} e^{- \beta E(\mathbf{s}, \mathbf{J})} \end{split} \end{equation*}

## 3 Energy based models: Maximum Entropy (MaxEnt) Principle, Generative models, and Boltzmann learning

### 3.1 Maximum entropy models: the simplest energy-based generative models

• Shannon entropy of a distribution is defined

\begin{equation*} S_p = - \text{Tr}_{\mathbf{x}} p(\mathbf{x}) \log p(\mathbf{x}) \end{equation*}
• Suppose we have a set of functions \$\{ fi(\mathbf{x}) \} whose average value we want to fix to some observed values $$\left\langle f_i \right\rangle_{\text{obs}}$$
• Principle of Maximum Entropy states that we should choose the distribution with largest $$S_p$$ subject to constraints that $$\left\langle f_i \right\rangle_{\text{model}} = \left\langle f_i \right\rangle_{\text{obs}}$$ and $$\text{Tr}_{\mathbf{x}} p(\mathbf{x}) = 1$$; using Lagrange multipliers this becomes

\begin{equation*} \mathcal{L}[p] = - S_p + \sum_{i}^{} \lambda_i \Bigg( \left\langle f_i \right\rangle_{\text{obs}} - \int d \mathbf{x} \ f_i(\mathbf{x}) p(\mathbf{x}) \Bigg) + \gamma \Bigg( 1 - \int d \mathbf{x} \ p(\mathbf{x}) \Bigg) \end{equation*}

where

• second term enforce equality of average
• third term enforce that $$p$$ is a probability distribution
• Taking functional derivate and setting to zero

\begin{equation*} 0 = \frac{\delta \mathcal{L}}{\delta p} = \big( \log p(\mathbf{x}) + 1 \big) - \sum_{i}^{} \lambda_i f_i(\mathbf{x}) - \gamma \end{equation*}

which gives us the general form of the maximum entropy distribution:

\begin{equation*} p(\mathbf{x}) = \frac{1}{Z} e^{\sum_i \lambda_i f_i(\mathbf{x})} \end{equation*}
• Maximum entropy distribution is just the usual Boltzmann distribution with energy

\begin{equation*} E(\mathbf{x}) = - \sum_{i}^{} \lambda_i f_i(\mathbf{x}) \end{equation*}