Notes on: Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012): Foundations of machine learning

Table of Contents

1 Notation

  • \(\mathcal{X}\) denotes the input space
  • \(\mathcal{Y}\) denotes the output space
  • \(c: \mathcal{X} \to \mathcal{Y}\) denotes a concept
  • \(C\) denotes a class of concepts
  • \(h \in H\) denotes a hypothesis
  • \(D\) denotes the underlying distribution
  • \(S = \big( x_1, \dots, x_m \big)\) are i.i.d. drawn acoording to \(D\) with labels \(\big( c(x_1), \dots, c(x_m) \big)\) for some concept \(c \in C\)
  • \(\text{size}(c)\) denotes the maximal cost of the computational representation of \(c \in C\)

2 PAC Learning

A concept class \(C\) is said to be PAC-learnable if there exists an algorithm \(\mathcal{A}\) and a polynomial function \(\text{poly}(\cdot, \cdot, \cdot, \cdot)\) such that for any \(\varepsilon > 0\) and \(\delta > 0\), for all distributions \(D\) on \(\mathcal{X}\) for any target concept \(c \in C\), the followign holds

\begin{equation*} \underset{S \sim D^m}{\rm{Pr}} \big[ R \big( h_S \big) \le \varepsilon \big] \ge 1 - \delta \end{equation*}

for any sample size \(m\) satisfying

\begin{equation*} m \ge \text{poly} \big( 1 / \varepsilon, 1 / \delta, n, \text{size}(c) \big) \end{equation*}

If \(\mathcal{A}\) further runs in \(\text{poly} \big( 1 / \varepsilon, 1 / \delta, n, \text{size}(c) \big)\), then \(C\) is said to be efficiently PAC-learnable. When such an algorithm \(\mathcal{A}\) exists, it is called a PAC-learning algorithm for \(C\).

We often call

  • \(1 - \delta\) the confidence
  • \(1 - \varepsilon\) the accuracy

A concept class \(c\) is thus PAC-learnable if the hypothesis retured by the algorithm after observing a number of points polynomial in \(1 / \varepsilon\) and \(1 / \delta\) is approximately correct (with at most error \(\varepsilon\)) with high probabiliy (at least \(1 - \delta\)), which justifies the PAC-terminology.