Notes on: Mehta, P., Bukov, M., Wang, C., Day, A. G. R., Richardson, C., Fisher, C. K., & Schwab, D. J. (2018): A high-bias, low-variance introduction to machine learning for physicists

Table of Contents

1 Notation

  • \(\text{Tr}_{\mathbf{x}} p(\mathbf{x})\) denotes the "trace of \(p\)", which integrating over all \(\mathbf{x}\) if continuous , or summing over all \(\mathbf{x}\) if discrete

2 Variational methods and Mean-field Theory (MFT)

2.1 Notation

  • \(Z_p\) denotes the partition function of the distribution \(p\)
  • Spin configuration \(\mathbf{s}\) specifies the values \(s_i\) of the spins at every lattice site
  • Energy of configuration \(\mathbf{s}\):

    \begin{equation*} E(\mathbf{s}, \mathbf{J}) = - \frac{1}{2} \sum_{i, j}^{} J_{ij} s_i s_j - \sum_{i}^{} h_i s_i \end{equation*}


    • \(h_i\) is the local magnetic field acting on the spin \(s_i\)
    • \(J_{ij}\) is the interaction strength / coupling between spins \(s_i\) and \(s_j\)
  • Probability of system being in spin configuration \(\mathbf{s}\) at "temperature" \(\beta^{-1}\):

    \begin{equation*} \begin{split} p( \mathbf{s} \mid \mathbf{J}) &= \frac{1}{Z_p(\mathbf{J})} e^{- \beta E(\mathbf{s}, \mathbf{J})} \\ Z_p(\mathbf{J}) &= \sum_{\left\{ s_i = \pm 1 \right\}}^{} e^{- \beta E(\mathbf{s}, \mathbf{J})} \end{split} \end{equation*}

2.2 Variational mean-field theory for the Ising

3 Energy based models: Maximum Entropy (MaxEnt) Principle, Generative models, and Boltzmann learning

3.1 Maximum entropy models: the simplest energy-based generative models

  • Shannon entropy of a distribution is defined

    \begin{equation*} S_p = - \text{Tr}_{\mathbf{x}} p(\mathbf{x}) \log p(\mathbf{x}) \end{equation*}
  • Suppose we have a set of functions $\{ fi(\mathbf{x}) \} whose average value we want to fix to some observed values \(\left\langle f_i \right\rangle_{\text{obs}}\)
  • Principle of Maximum Entropy states that we should choose the distribution with largest \(S_p\) subject to constraints that \(\left\langle f_i \right\rangle_{\text{model}} = \left\langle f_i \right\rangle_{\text{obs}}\) and \(\text{Tr}_{\mathbf{x}} p(\mathbf{x}) = 1\); using Lagrange multipliers this becomes

    \begin{equation*} \mathcal{L}[p] = - S_p + \sum_{i}^{} \lambda_i \Bigg( \left\langle f_i \right\rangle_{\text{obs}} - \int d \mathbf{x} \ f_i(\mathbf{x}) p(\mathbf{x}) \Bigg) + \gamma \Bigg( 1 - \int d \mathbf{x} \ p(\mathbf{x}) \Bigg) \end{equation*}


    • second term enforce equality of average
    • third term enforce that \(p\) is a probability distribution
  • Taking functional derivate and setting to zero

    \begin{equation*} 0 = \frac{\delta \mathcal{L}}{\delta p} = \big( \log p(\mathbf{x}) + 1 \big) - \sum_{i}^{} \lambda_i f_i(\mathbf{x}) - \gamma \end{equation*}

    which gives us the general form of the maximum entropy distribution:

    \begin{equation*} p(\mathbf{x}) = \frac{1}{Z} e^{\sum_i \lambda_i f_i(\mathbf{x})} \end{equation*}
  • Maximum entropy distribution is just the usual Boltzmann distribution with energy

    \begin{equation*} E(\mathbf{x}) = - \sum_{i}^{} \lambda_i f_i(\mathbf{x}) \end{equation*}