# Notes on: Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012): Foundations of machine learning

## 1 Notation

• $$\mathcal{X}$$ denotes the input space
• $$\mathcal{Y}$$ denotes the output space
• $$c: \mathcal{X} \to \mathcal{Y}$$ denotes a concept
• $$C$$ denotes a class of concepts
• $$h \in H$$ denotes a hypothesis
• $$D$$ denotes the underlying distribution
• $$S = \big( x_1, \dots, x_m \big)$$ are i.i.d. drawn acoording to $$D$$ with labels $$\big( c(x_1), \dots, c(x_m) \big)$$ for some concept $$c \in C$$
• $$\text{size}(c)$$ denotes the maximal cost of the computational representation of $$c \in C$$

## 2 PAC Learning

A concept class $$C$$ is said to be PAC-learnable if there exists an algorithm $$\mathcal{A}$$ and a polynomial function $$\text{poly}(\cdot, \cdot, \cdot, \cdot)$$ such that for any $$\varepsilon > 0$$ and $$\delta > 0$$, for all distributions $$D$$ on $$\mathcal{X}$$ for any target concept $$c \in C$$, the followign holds

\begin{equation*} \underset{S \sim D^m}{\rm{Pr}} \big[ R \big( h_S \big) \le \varepsilon \big] \ge 1 - \delta \end{equation*}

for any sample size $$m$$ satisfying

\begin{equation*} m \ge \text{poly} \big( 1 / \varepsilon, 1 / \delta, n, \text{size}(c) \big) \end{equation*}

If $$\mathcal{A}$$ further runs in $$\text{poly} \big( 1 / \varepsilon, 1 / \delta, n, \text{size}(c) \big)$$, then $$C$$ is said to be efficiently PAC-learnable. When such an algorithm $$\mathcal{A}$$ exists, it is called a PAC-learning algorithm for $$C$$.

We often call

• $$1 - \delta$$ the confidence
• $$1 - \varepsilon$$ the accuracy

A concept class $$c$$ is thus PAC-learnable if the hypothesis retured by the algorithm after observing a number of points polynomial in $$1 / \varepsilon$$ and $$1 / \delta$$ is approximately correct (with at most error $$\varepsilon$$) with high probabiliy (at least $$1 - \delta$$), which justifies the PAC-terminology.