Theory

Overview
Notation
- Finite Hypothesis Classes
Probably Approximately Correct (PAC)
- Agnostic PAC learning - relaxing realization assumption
Uniform convergence
Bias-Complexity Tradeoff
The VC-Dimension
Generalization
Appendix
- Measure Concentration
Bibliography

Overview

This document covers a lot of the more "pure" sides of Machine Learning, e.g. attempts at establishing bounds on convergence.

Most of this is built upon the Probably Approximately Correct (PAC) framework, which attempts to bound the errors the generalization of models.

Good sources, where I've gotten a lot of this from:

Notation

$[m] = \left\{ 1, \dots, m \right\}$ , for a $m \in \mathbb{N}$
$h_S: \mathcal{X} \to \mathcal{Y}$ denotes the result of applying $\rm{ERM}_{\mathcal{H}}$ to $S$ ,

$\begin{equation*} h_S \in \underset{h \in \mathcal{H}}{\text{argmin}}\ L_S(h) \end{equation*}$
$L_{(\mathcal{D}, f)}(h_S)$ denotes the true risk of $h_S$ , where $\mathcal{D}$ is the distribution, and $f$ the labelling-function
$\delta$ denotes denotes the probability of getting a nonrepresentative sample, and we say $(1 - \delta)$ is the confidence parameter of our prediction
$\varepsilon$ denotes the accuracy parameter, dealing with the fact that we cannot guarantee perfect label prediction: $L_{(\mathcal{D}, f)}(h_S) > \varepsilon$ is a failure of the learner, while $L_{(\mathcal{D}, f)}(h_S) \le \varepsilon$ means we view the output of the learner as approximately correct
$S|_X = (x_1, \dots, x_m)$ denotes an instance of the training set
Empirical Risk Minimization rule is denoted:

$\begin{equation*} \rm{ERM}_{\mathcal{H}}(S) \in \underset{h \in \mathcal{H}}{\text{argmin}}\ L_S(h) \end{equation*}$

i.e. the set of learners which minimizes the empirical risk $L_S$ .

bb* Empirical Risk Minimization (EMR) Standard Empirical Risk Minimization: minimize the empirical loss

$\begin{equation*} L_S(h) = \frac{\big| i \in [m] : h(x_i) \ne y_i \big|}{m} \end{equation*}$

A risk function is the expected loss of classifier $h \in \mathcal{H}$ wrt. the probability distribution $\mathcal{D}$ over $Z$ , namely

$\begin{equation*} L_{\mathcal{D}}(h) := \mathbb{E}_{z \sim \mathcal{D}} \big[ \ell(h, z) \big] \end{equation*}$

Empirical risk is the expected loss over a given sample $S = (z_1, \dots, z_m) \in Z^m$ , namely,

$\begin{equation*} L_S(h) = \frac{1}{m} \sum_{i=1}^{m} \ell(h, z_i) \end{equation*}$

Finite Hypothesis Classes

Simple restriction in attempt to avoid "overfitting" is to bound the size of the hypotheses; i.e. consider a restricted hypothesis class $\mathcal{H}$

For now we make the following assumption:

There exists $h^* \in \mathcal{H}$ such that

$\begin{equation*} L_{(\mathcal{D}, f)}(h^*) = 0 \end{equation*}$

That is, there exists an optimal hypothesis in the restricted hypothesis class $\mathcal{H}$ which obtaines the global minimum true risk, i.e. minimum loss over the full data distribution.

Hence, we would like to upper bound the probability to sample m-tuple of instances which lead to failure. That is, find an upper bound for

$\begin{equation*} D^m \big( \left\{ S|_x : L_{(\mathcal{D}, f)} (h_S) > \varepsilon \right\} \big) \end{equation*}$

Then let

$\begin{equation*} \mathcal{H}_{B} = \left\{ h \in \mathcal{H} : L_{(\mathcal{D}, f)} (h_S) > \varepsilon \right\} \end{equation*}$

i.e. all the "bad" hypotheses, such that the probability of failure of the learner is greater than $\varepsilon$ .

Further, let

$\begin{equation*} M = \left\{ S|_x : \exists h \in \mathcal{H}_B, L_S(h) = 0 \right\} \end{equation*}$

i.e. $M$ is set of all training instances which would "mislead" us into believing one of the wrong hypotheses ( $h \in \mathcal{H}_B$ ) is "good" (i.e. has zero average loss on the training data).

With our Realizability Assumption (i.e. that $L_S(h_S) = 0$ ), then

$\begin{equation*} L_{(\mathcal{D}, f)} > \varepsilon \end{equation*}$

can only happen if for some $h \in \mathcal{H}_B$ we have $L_S(h) = 0$ . That is,

$\begin{equation*} L_{(\mathcal{D}, f)} > \varepsilon \text{ and } h \in \mathcal{H}_B \implies h \in M \end{equation*}$

which means that

$\begin{equation*} \left\{ S|_x : L_{(\mathcal{D}, f)}(h_S) > \varepsilon \right\} \subseteq M \end{equation*}$

Further, we can rewrite

$\begin{equation*} M = \bigcup_{h \in \mathcal{H}_B} \left\{ S|_x : L_S(h) = 0 \right\} \end{equation*}$

Hence,

$\begin{equation*} \mathcal{D}^m \big( \left\{ S|_x : L_{(\mathcal{D}, f)}(h_S) > \varepsilon \right\} \big) \le \mathcal{D}^m(M) = \mathcal{D}^m \Big( \bigcup_{h \in \mathcal{H}_B} \left\{ S|_x : L_S(h) = 0 \right\} \Big) \end{equation*}$

To summarize: probability of learner failing in an $\varepsilon$ sense, is bounded above by the probability of the learner being a misleading learner, i.e. having minimal training error and being in the "bad" hypotheses class.

And because of how probability-measures work, probabilities of finite unions can be written

$\begin{equation*} \mathcal{D}^m \Big( \bigcup_{h \in \mathcal{H}_B} \left\{ S|_x : L_S(h) = 0 \right\} \Big) = \sum_{h \in \mathcal{H}_B}^{} \mathcal{D}^m \big( \left\{ S|_x : L_S(h) = 0 \right\} \big) \end{equation*}$

Suppose now that $h \in \mathcal{H}_B$ . By definition,

$\begin{equation*} L_S(h) = 0 \iff h(x_i) = f(x_i), \quad \forall i \in [m] \end{equation*}$

and due to the i.i.d. sampling of the training set, the joint probability of this occuring is then

$\begin{equation*} \mathcal{D}^m \big( \left\{ S|_x: L_S(h) = 0 \right\} \big) = \prod_{i=1}^{m} \mathcal{D}^m \big( \left\{ x_i : h(x_i) = f(x_i) \right\} \big) \end{equation*}$

Further, for each individual sample in the training set we have

$\begin{equation*} \mathcal{D} \big( \left\{ x_i : h(x_i) = y_i \right\} \big) = 1 - L_{(\mathcal{D}, f)}(h) \le 1 - \varepsilon \end{equation*}$

where the last inequality follows from the fact that $h \in \mathcal{H}_B$ .

$\begin{equation*} \mathcal{D} \big( \left\{ x_i : h(x_i) = y_i \right\} \big) \end{equation*}$

simply denotes the probability of this $h \in \mathcal{H}_B$ correctly labelling the $x \in \mathcal{X}$ over the entire true distribution $\mathcal{D}$ . Since these equalities are indicator variables, thus binary, this corresponds to the expectation

$\begin{equation*} \mathbb{E} \big[ h(x) = y \big] \end{equation*}$

and since $L_{(\mathcal{D}, f)}(h) = \mathbb{E}[h(x) \ne y]$ by definition, we have

$\begin{equation*} \mathcal{D} \big( \left\{ x_i : h(x_i) = y_i \right\} \big) = 1 - L_{(\mathcal{D}, f)} \end{equation*}$

which is the second equality in the above expression.

Finally, observing that for $\varepsilon > 0$ , we have

$\begin{equation*} 1 - \varepsilon \le e^{- \varepsilon} \end{equation*}$

we obtain

$\begin{equation*} \mathcal{D}^m \big( \left\{ S|_x : L_S(h) = 0 \right\} \big) \le \big( 1 - \varepsilon \big)^m \le e^{- \varepsilon m}, \quad \forall h \in \mathcal{H}_B \end{equation*}$

Combining this with the sum over the $h \in \mathcal{H}_B$ we saw earlier, we conclude that

$\begin{equation*} \mathcal{D}^m \big( \left\{ S|_x : L_{(\mathcal{D}, f)}(h_S) > \varepsilon \right\} \big) \le |\mathcal{H}_B| e^{- \varepsilon m} \le |\mathcal{H}| e^{- \varepsilon m} \end{equation*}$

Let

$\mathcal{H}$ be a finite hypothesis class
$\delta \in (0, 1)$
$\varepsilon > 0$
$m \in \mathbb{N}$ such that

$\begin{equation*} m \ge \frac{\log \Big( |\mathcal{H}| / \delta \Big)}{\varepsilon} \end{equation*}$

Then, for any labeling function $f$ , and any distribution $\mathcal{D}$ , for which the realizability assumption holds, over the choice of an i.i.d. sample $S$ of size $m$ , we have that for every ERM hypothesis, $h_S$ ,

$\begin{equation*} L_{(\mathcal{D}, f)}(h_S) \le \varepsilon \quad \text{with probabilty} \quad 1 - \delta \end{equation*}$

That is, the $\rm{ERM}_{\mathcal{H}}$ rule over a finite hypothesis class will be probably (with confidence $1 - \delta$ ) approximately (up to error $\varepsilon$ ) correct (PAC).

Probably Approximately Correct (PAC)

A hypothesis class $\mathcal{H}$ is PAC learnable if there exist a function $m_{\mathcal{H}}: \big( 0, 1 \big)^2 \to \mathbb{N}$ and a learning algorithm with the property:

for every $\varepsilon, \delta \in (0, 1)$
for every distribution $\mathcal{D}$ over $\mathcal{X}$
for every labeling function $f: \mathcal{X} \to \left\{ 0, 1 \right\}$

if the realizable assumption holds wrt. $\mathcal{H}$ , $\mathcal{D}$ , and $f$ , then when running the learning algorithm on $m \ge m_{\mathcal{H}}(\varepsilon, \delta)$ i.i.d. examples generated by $\mathcal{D}$ and labeled by $f$ , the algorithm returns a hypothesis $h$ s.t.

$\begin{equation*} L_{(\mathcal{D}, f)}(h) \le \varepsilon \quad \text{with probability } 1 - \delta \end{equation*}$

We will often refer to $m_{\mathcal{H}}$ as the sample complexity of learning $\mathcal{H}$ .

$\varepsilon$ defines how far the output classifier can be from the optimal one
$\delta$ is how likely the classifier is to meet the accuracy requirement

Every finite hypothesis class is PAC learnable with sample complexity

$\begin{equation*} m_{\mathcal{H}}(\varepsilon, \delta) \le \left\lceil \frac{\log \big( |\mathcal{H}| / \delta \big)}{\varepsilon} \right\rceil \end{equation*}$

Agnostic PAC learning - relaxing realization assumption

A hypothesis class $\mathcal{H}$ is agnostic PAC learnable wrt a set $Z$ and a loss function $\ell: \mathcal{H} \times Z \to \mathbb{R}_{ + }$ , if there exists a function $m_{\mathcal{H}}: \big( 0, 1 \big)^2 \to \mathbb{N}$ and an algorithm with the property

for every $\varepsilon, \delta \in (0, 1)$
for every distribution $\mathcal{D}$ over $Z$

when running the learning algorithm on $m \ge m_{\mathcal{H}}(\varepsilon, \delta)$ i.i.d. examples drawn from $\mathcal{D}$ , the algorithm returns $h \in \mathcal{H}$ such that

$\begin{equation*} L_{\mathcal{D}}(h) \le \min_{h' \in \mathcal{H}} L_{\mathcal{D}}(h') + \varepsilon \quad \text{ with prob. } 1 - \delta \end{equation*}$

where

$\begin{equation*} L_{\mathcal{D}}(h) = \underset{z \sim \mathcal{D}}{\mathbb{E}} \big[ \ell(h, z) \big] \end{equation*}$

Uniform convergence

A training sample $S$ is called $\varepsilon \text{-representative}$ wrt. domain $Z$ , hypothesis class $\mathcal{H}$ , loss function $\ell$ , and distribution $\mathcal{D}$ , if

$\begin{equation*} \left| L_S(h) - L_{\mathcal{D}}(h) \right| \le \varepsilon, \quad \forall h \in \mathcal{H} \end{equation*}$

Assume that a training set is $\frac{\varepsilon}{2}\text{-representative}$ .

Then, any output of $\rm{ERM}_{\mathcal{H}}(S)$ , namely, any

$\begin{equation*} h_S \in \underset{h \in \mathcal{H}}{\text{argmin}}\ L_S(h) \end{equation*}$

satisfies

$\begin{equation*} L_{\mathcal{D}}(h_S) \le \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \varepsilon \end{equation*}$

From Lemma 4.2 we know that a $\rm{ERM}$ rule $h_S$ is an agnostic PAC-learner if $S$ is $\varepsilon \text{-representative}$ with probability $1 - \delta$ !

This is motivates the following definition of Uniform Convergence:

We say that a hypothesis class $\mathcal{H}$ has the uniform convergence property (wrt. domain $Z$ and loss $\ell$ ) if there exists a function $m_{\mathcal{H}}^{\rm{UC}}: \big( 0 ,1 \big)^2 \to \mathbb{N}$ such that for

every $\varepsilon, \delta \in (0, 1)$
every probability distribution $\mathcal{D}$ over $Z$

if $S$ is a sample of $m \ge m_{\mathcal{H}}^{\rm{UC}}$ examples drawn i.i.d. according to $\mathcal{D}$ , then $S$ is $\varepsilon \text{-representative}$ , with at least probability $1 - \delta$ .

Let $\theta_1, \dots, \theta_m$ be a sequence of i.i.d. random variables and assume that for all i,

$\begin{equation*} \mathbb{E}[\theta_i] = \mu, \quad \mathbb{P} \big[ a \le \theta_i \le b \big] = 1 \end{equation*}$

Then, for any $\varepsilon > 0$ ,

$\begin{equation*} \mathbb{P} \Bigg[ \frac{1}{m} \left| \sum_{i=1}^{m} \theta_i - \mu \right| > \varepsilon \Bigg] \le 2 \exp \Big( - 2 m \varepsilon^2 / (b - a)^2 \Big) \end{equation*}$

Using Hoeffding's Inequality, one can prove the following:

Let $\mathcal{H}$ be a finite hypothesis class, let $Z$ be a domain, and let $\ell: \mathcal{H} \times Z \to [0, 1]$ be a loss function.

Then, $\mathcal{H}$ enjoys the uniform convergence property with sample complexity

$\begin{equation*} m_{\mathcal{H}}^{\rm{UC}}(\varepsilon, \delta) \le \left\lceil \frac{\log \Big( 2 |\mathcal{H}| / \delta \Big)}{2 \varepsilon^2} \right\rceil \end{equation*}$

Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity

$\begin{equation*} m_{\mathcal{H}}(\varepsilon, \delta) \le m_{\mathcal{H}}^{\rm{UC}}\big(\varepsilon / 2, \delta \big) \le \left\lceil \frac{2 \log \big( 2 |\mathcal{H}| / \delta \big)}{\varepsilon^2} \right\rceil \end{equation*}$

Even though Corollary 4.6 only holds for finite hypothesis in practice it might also provide us with upper bounds to "infinite" hypothesis classes, due to parameters most often being represented using floating point representations, which consists of 64 bits.

That is, if we're computationally approximating some $\theta_i \in \mathbb{R}$ , there is only $2^64$ different values $\theta_i$ actually can take on. If we then have $d$ parameters, we can use Corollary 4.6 to obtain the following bound:

$\begin{equation*} \frac{128d + 2 \log \big( 2 / \delta \big)}{\varepsilon^2} \end{equation*}$

This is often referred to as the discretization trick.

Bias-Complexity Tradeoff

Let $A$ be any learning algorithm for the task of binary classification wrt. to the 0-1 loss over a domain $\mathcal{X}$ . Let $m$ be any number smaller than $|\mathcal{X}| / 2$ , representing a training set size.

Then, there exists a distribution $\mathcal{D}$ over $\mathcal{X} \times \left\{ 0, 1 \right\}$ such that:

There exists a function $\mathcal{X} \to \left\{ 0, 1 \right\}$ with $L_{\mathcal{D}}(f) = 0$
With probability of at least $1 / 7$ over the choice of $S \sim \mathcal{D}^m$ we have that

$\begin{equation*} L_{\mathcal{D}} \big( A(S) \big) \ge 1 / 8 \end{equation*}$

No-Free-Lunch theorem states that for every learner, there exists a task on which it fails even though that task can be successfully learned by another learner.

Let $C \subset \mathcal{X}$ such that $|C| = 2m$ .

We will prove this using the following intuition:

Any learning algorithm that observes only half of the instances in $C$ has no information on what should be the labels of the rest of the instances in $C$
Therefore, there exists a "reality", that is some target function $f$ , that would contradict the labels that $A(S)$ predicts on the unobserved instances in $C$

Since $|C| = 2m$ by assumption, there are $T = 2^{2m}$ unique mappings $f_i: C \to \left\{ 0, 1 \right\}$ . Denote these $f_1, \dots, f_T$ . For each $f_i$ , let $\mathcal{D}_i$ be a distribution over $C \times \left\{ 0, 1 \right\}$ defined by

$\begin{equation*} \mathcal{D}_i \Big( \left\{ (x, y) \right\} \Big) = \begin{cases} \frac{1}{|C|} & \text{if } y = f_i(x) \\ 0 & \text{otherwise} \end{cases} \end{equation*}$

Then we have

$\begin{equation*} L_{\mathcal{D}_i}(f_i) = 0 \end{equation*}$

We can combine Lemma lemma:no-free-lunch-theorem-intermediate-lemma-1 with Lemma lemma:shalev2014understanding-B1, from which we observe

$\begin{equation*} \mathbb{P} \big[ Z > 1 / 8 \big] \ge \frac{\mu - 1 / 8}{1 - 1 / 8} = \frac{\mu - 1 / 8}{7 / 8} = \frac{8}{7} \mu - \frac{1}{7} \ge \frac{8}{7} \cdot \frac{1}{4} - \frac{1}{7} = \frac{2}{7} - \frac{1}{7} = \frac{1}{7} \end{equation*}$

where we've used $\mu = \mathbb{E} \big[ L_{\mathcal{D}} \big( A(S) \big) \big] \ge \frac{1}{4}$ . Hence,

$\begin{equation*} \mathbb{P} \big[ Z > 1 / 8 \big] \ge \frac{1}{7} \end{equation*}$

concluding our proof.

For every algorithm $A'$ , that receives a training set of $m$ examples from $\mathcal{X} \times \left\{ 0, 1 \right\}$ there exists a function $f: \mathcal{X} \to \left\{ 0, 1 \right\}$ and a distribution $\mathcal{D}$ over $\mathcal{X} \times \left\{ 0, 1 \right\}$ , such that $L_{\mathcal{D}}(f) = 0$ and

$\begin{equation*} \underset{S \sim D^m}{\mathbb{E}} \big[ L_{\mathcal{D}} \big( A(S) \big) \big] \ge \frac{1}{4} \end{equation*}$

Let $A$ be some algorithm that receives a training set of $m$ examples from $C \times \left\{ 0, 1 \right\}$ and returns a function $A(S). C \to \left\{ 0, 1 \right\}$ , i.e.

$\begin{equation*} A: \big( C \times \left\{ 0, 1 \right\} \big)^m \to \mathcal{H} \end{equation*}$

There are $k = \big( 2m \big)^m$ possible sequences of $m$ samples from $C$ . Let

$\begin{equation*} S_j = \big( x_1, \dots, x_m \big) \end{equation*}$

and

$\begin{equation*} S_j^i = \Big( \big( x_1, f_i(x_1) \big), \dots, \big( x_m, f_i(x_m) \big) \Big) \end{equation*}$

i.e. $S_j^i$ denotes the sequences containing instances in $S_j$ labeled by $f_i$ .

If the distribution is $\mathcal{D}_i$ , as defined above, then the possible training sets $A$ can receive are $S_1^i, \dots, S_k^i$ , and all these training sets have the same probability of being sampled. Therefore,

$\begin{equation*} \underset{S \sim \mathcal{D}_i^m}{\mathbb{E}} \big[ L_{\mathcal{D}_i} \big( A(S) \big) \big] = \frac{1}{k} \sum_{j=1}^{k} L_{\mathcal{D}_i} \big( A(S_j^i) \big) \end{equation*}$

Using the fact that maximum is larger than average, and average is larger than minimum, we have

$\begin{equation*} \begin{split} \max_{i \in [T]} \frac{1}{k} \sum_{j=1}^{k} L_{\mathcal{D}_i} \big( A(S_j^i) \big) & \ge \frac{1}{T} \sum_{i=1}^{T} \frac{1}{k} \sum_{j=1}^{k} L_{\mathcal{D}_i} \big( A(S_j^i) \big) \\ &= \frac{1}{k} \sum_{j=1}^{k} \frac{1}{T} \sum_{i=1}^{T} L_{\mathcal{D}_i} \big( A(S_j^i) \big) \\ & \ge \min_{j \in [k]} \frac{1}{T} \sum_{i=1}^{T} L_{\mathcal{D}_i} \big( A(S_j^i) \big) \end{split} \end{equation*}$

The above equation is saying that the maximum risk expected over possible rearrangements of $S_j$ achieved by $\{ f_i \}$ is greater than the minimum of the expected risk over possible $f_i$ using the optimal arrangement ordering of $S_j$ .

Now, fix some $j \in [k]$ , and denote $S_j = \big( x_1, \dots, x_m \big)$ and let $v_1, \dots, v_p$ be the examples in $C$ that do not appear in $S_j$ , i.e.

$\begin{equation*} v_a : v_a \ne x_b, \quad \forall b \in [m], \quad \forall a \in [p] \end{equation*}$

Clearly $p \ge m$ , since we're assuming $|C| = 2m$ and $|S_j| = m$ , and therefore the "best" case scenario is if we get $m$ distinct samples.

Therefore, for every function $h: C \to \left\{ 0, 1 \right\}$ and every $i \in [T]$ we have

$\begin{equation*} \begin{split} L_{\mathcal{D}_i}(h) &= \frac{1}{2m} \sum_{x \in C}^{} \mathbbm{1} \big[ h(x) \ne f_i(x) \big] \\ &\ge \frac{1}{2m} \sum_{r=1}^{p} \mathbbm{1} \big[ h(v_r) \ne f_i(v_r) \big] \\ &\ge \frac{1}{2p} \sum_{r=1}^{p} \mathbbm{1} \big[ h(v_r) \ne f_i(v_r) \big] \end{split} \end{equation*}$

Hence,

$\begin{equation*} \begin{split} \frac{1}{T} \sum_{i=1}^{T} L_{\mathcal{D}_i} \Big( A(S_j^i) \Big) & \ge \frac{1}{T} \sum_{i=1}^{T} \frac{1}{2p} \sum_{r=1}^{p} \mathbbm{1} \big[ A(S_j^i)(v_r) \ne f_i(v_r) \big] \\ & = \frac{1}{2p} \sum_{r=1}^{p} \frac{1}{T} \sum_{i=1}^{T} \mathbbm{1} \big[ A(S_j^i)(v_r) \ne f_i(v_r) \big] \\ &\ge \frac{1}{2} \min_{r \in [p]} \frac{1}{T} \sum_{i=1}^{T} \mathbbm{1} \big[ A(S_j^i)(v_r) \ne f_i(v_r) \big] \end{split} \end{equation*}$

Next, fix some $r \in [p]$ .

We can partition all the functions in $f_1, \dots, f_T$ into $T / 2$ disjoint pairs, where for a pair $\big( f_i, f_{i'} \big)$ we have

$\begin{equation*} f_i(c) = f_{i'}(c), \quad \forall c \in C \iff c = v_r \end{equation*}$

Since such a pair must have $S_j^i = S_j^{i'}$ , it follows that

$\begin{equation*} \mathbbm{1} \big[ A(S_j^i)(v_r) \ne f_i(v_r) \big] + \mathbbm{1} \big[ A(S_j^{i'})(v_r) \ne f_i(v_r) \big] = 1 \end{equation*}$

which yields

$\begin{equation*} \frac{1}{T} \sum_{i=1}^{T} \mathbbm{1} \big[ A(S_j^i)(v_r) \ne f_i(v_r) \big] = \frac{1}{2} \end{equation*}$

Substituting this back into

$\begin{equation*} \frac{1}{T} \sum_{i=1}^{T} L_{\mathcal{D}_i} \Big( A(S_j^i) \Big) \ge \frac{1}{2} \min_{r \in [p]} \frac{1}{T} \sum_{i=1}^{T} \mathbbm{1} \big[ A(S_j^i)(v_r) \ne f_i(v_r) \big] = \frac{1}{4} \end{equation*}$

and this into

$\begin{equation*} \max_{i \in [T]} \underset{S \sim \mathcal{D}_i^m}{\mathbb{E}} \big[ L_{\mathcal{D}_i} \big( A(S_j^i) \big) \big] \ge \max_{i \in [T]} \frac{1}{k} \sum_{j=1}^{k} L_{\mathcal{D}_i} \big( A(S_j^i) \big) \ge \frac{1}{T} \sum_{i=1}^{T} L_{\mathcal{D}_i} \big( A(S_j^i) \big) \ge \frac{1}{4} \end{equation*}$

Hence, there exists a sample such that

$\begin{equation*} \underset{S \sim \mathcal{D}^m}{\mathbb{E}} \Big[ L_{\mathcal{D}} \big( A'(S) \big) \Big] \ge \frac{1}{4} \end{equation*}$

for every algorithm $A'$ .

We decompose the error of an $\text{ERM}_{\mathcal{H}}$ predictor into two components as follows.

Let $h_S$ be an $\text{ERM}_{\mathcal{H}}$ hypothesis. Then, we can write

$\begin{equation*} L_{\mathcal{D}}(h_S) = \varepsilon_{\text{app}} + \varepsilon_{\text{est}} \end{equation*}$

where

$\begin{equation*} \varepsilon_{\text{app}} = \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h), \quad \varepsilon_{\text{est}} = L_{\mathcal{D}}(h_S) - \varepsilon_{\text{app}} \end{equation*}$

$\varepsilon_{\text{app}}$ denotes the approximation error, which is the minimum possible risk achievable by a predictor in the given hypothesis class $\mathcal{H}$
$\varepsilon_{\text{est}}$ denotes the estimation error, which is the difference between the best achivable error and the error achieved by $h_S$

The VC-Dimension

Let $\mathcal{H}$ be a class of functions from $\mathcal{X}$ to $\{ 0, 1 \}$ , and let $C = \left\{ c_1, \dots, c_m \right\} \subset \mathcal{X}$ .

The restriction of $\mathcal{H}$ to $C$ is the set of functions $h: C \to \left\{ 0, 1 \right\}$ which can be derived from $\mathcal{H}$ . That is,

$\begin{equation*} \mathcal{H}_C = \left\{ \Big( h(c_1), \dots, h(c_m) \Big) : h \in \mathcal{H} \right\} \end{equation*}$

A hypothesis class $\mathcal{H}$ shatters a finite set $C \subset \mathcal{X}$ if the restriction of $\mathcal{H}$ to $C$ is the set of all functions $h: C \to \left\{ 0, 1 \right\}$ .

That is, if $|\mathcal{H}_C| = 2^{|C|}$ .

Shattering is basically a way of stating

All functions mapping from the subset $C$ to the set of binary labels
If the hypothesis-class we're trying to learn contains all possible functions from $C$ to $\{ 0, 1 \}$ then we say it's shattered, and this indicates that it will be difficult (if not impossible) to learn due to the fact that if $|C| \ge 2m$ and $C$ is shattered by $\mathcal{H}$ , i.e. $\mathcal{H}_C$ is the set of all possible mappings $C \to \left\{ 0, 1 \right\}$ , then $\mathcal{H}$ is not PAC-learnable, since we cannot satisfy arbitrary bounds simply by adjusting the number of samples.

The VC-dimension of a hypothesis class $\mathcal{H}$ denoted $\text{VCdim}(\mathcal{H})$ , is the maximal size of a set $C \subset \mathcal{X}$ that can be shattered by $\mathcal{H}$ .

If $\mathcal{H}$ can shatter sets of arbitrarily large size, we say that $\mathcal{H}$ has infinite VC-dimension.

Let $\mathcal{H}$ be a class of infinite VC-dimension. Then, $\mathcal{H}$ is not PAC-learnable.

Generalization

Here we will consider generalization on classification tasks using PAC-learning theory.

Notation

$z_i = \big( x_i, y_i \big)$ samples of inputs and outputs
$D$ distribution on labelled data
$S_{m}$ denotes $m$ i.i.d. samples of data bpoints $z_i$ from $D$ (often just write $S$ , and have the $m$ be implied)
$\mathcal{H}$ denotes the set of classifiers / hypotheses $h$ which we consider to explain the data
$l(h, z)$ denotes the loss of classifier $h \in \mathcal{H}$ on point $z = (x, y)$
$L_{S_m}(h)$ denotes the average loss of classifier $h$ on the sample-set $S_m$ , i.e.

$\begin{equation*} L_{S_m}(h) = \mathbb{E}_{z \sim S_m} \big[ l(h, z) \big] \end{equation*}$
$L_D(h)$ denotes the average loss over the entire distribution $D$ :

$\begin{equation*} L_D(h) = \mathbb{E}_{z \sim D} \big[ l(h, z) \big] \end{equation*}$
Generalization error

$\begin{equation*} \Delta_S(h) = L_D(h) - L_S(h) \end{equation*}$

Generalization under ERM

The objective is

$\begin{equation*} h_S = \underset{h}{\text{argmin}}\ L_S (h) \end{equation*}$

i.e. $h$ which best explains the dataset, which is Empirical Risk Minimization since we're minimizing the empirical loss.

We then define the generalization error as follows:

The generalization error is defined

$\begin{equation*} \Delta_S(h) = L_D(h) - L_S(h) \end{equation*}$

which measures how well hypothesis $h$ generalizes to a distribution $D$ when its performance on a sample $S$ drawn from $D$ is known.

Generalization Theory

Generalization theory attempts to upper bound the $\Delta_S(h)$ from the generalization error

Rademacher Complexity

Formalize notion of complexity of a hypothesis $h$

Let

$\mathcal{H}$ is the hypothesis class
$S$ is $2m$ i.i.d. samples from $D$
And then the coefficients

$\begin{equation*} \sigma_i = \begin{cases} 1 & i \in \left\{ 1, \dots, m \right\} \\ -1 & i \in \left\{ m + 1, \dots, 2m \right\} \end{cases} \end{equation*}$

Then the Rademacher Complexity of $\mathcal{H}$ on a distribution $D$ is defined as

$\begin{equation*} \mathcal{R}_{m, D}(\mathcal{H}}) = \underset{S \sim D^m}{\mathbb{E}} \Bigg[ \frac{1}{2m} \sup_{h \in \mathcal{H}} \Big| \sum_{i=1}^{2m} \sigma_i h(z_i) \Big| \Bigg] \end{equation*}$

The $\sigma_i$ simply flip the sign of the classification (we're assuming $y_i \in \left\{ -1, 1 \right\}$ for all $i$ ) simply changes the classification of $h$ to opposite one.

This therefore corresponds to flipping the labels of half the datapoints randomly and retaining the labels of the other half. We choose the $h \in \mathcal{H}$ which correlates well with this random relabeling (if half classified as positive and half classified as negative, and then for this specific $h$ we just happen to flip all the the ones which are positive to negative, then we get our $\sup$ ).

I struggle a bit with interpreting the Rademacher Complexity intuitively. Why would this indicate high complexity of the hypothesis class?

Because we can consider ….

For a given loss function $l(h, z)$ , we have that the generalization error of all hypothesis $h \in \mathcal{H}$ on a sample $S$ of $m$ i.i.d. samples drawn from a distribution $D$ , is

$\begin{equation*} \Delta_S(h) \le 2 \mathcal{R}_{m, D} (\mathcal{H}) + \mathcal{O} \Bigg( \frac{1}{m} \ln \bigg( \frac{1}{\delta} \bigg) \Bigg) \end{equation*}$

for all $\delta > 0$ , with probability $> 1 - \delta$ .

Hence, the generalization error $\Delta_S(h)$ of a hypothesis $h \in \mathcal{H}$ can be upper bounded by the Rademacher complexity.

This is more a sketch of the poof.

Suppose for a random sample $S$ the generalization error is high. Then

Split $S_m$ into sets $S_1$ and $S_2$ randomly, with sets being of equal size
For a given $h$ (picked independently of $S$ ), consider $L_{S_1}(h)$ and $L_{S_2}(h)$
For large $m$ , we have $L_{S_2} \approx L_{D}(h)$ and thus

$\begin{equation*} L_D(h) - L_{S_1}(h) \approx L_{S_2}(h) - L_{S_1}(h) \end{equation*}$

(with $S_2$ being sort of like our "test" set and $S_1$ the "training" set). Thus

$\begin{equation*} \Delta_S(h) \approx L_{S_1}(h) - L_{S_2}(h) \end{equation*}$
Since $S_1$ and $S_2$ are picked randomly, we can just consider $S_1$ as the first half of the samples $S$ and then the difference reduces to

$\begin{equation*} \begin{split} \underset{S \sim D^m}{\mathbb{E}} \Bigg[ \underset{z \sim S_2}{\mathbb{E}} \big[ L(h, z) \big] - \underset{z \sim S_1}{\mathbb{E}} \big[ L(h, z) \big] \Bigg] &\le \underset{S \sim D^m}{\mathbb{E}} \Bigg[ \frac{1}{m} \Big| \sum_{i=1}^{m} \sigma_i h_i(z_i) \Big| \Bigg] \\ & \le \sup_{h \in \mathcal{H}} \underset{S \sim D^m}{\mathbb{E}} \Bigg[ \frac{1}{m} \Big| \sum_{i=1}^{m} \sigma_i h(z_i) \Big| \Bigg] \end{split} \end{equation*}$
Thus we have

$\begin{equation*} \begin{split} \Delta_S(h) & \le \sup_{h \in \mathcal{H}} \underset{S \sim D^m}{\mathbb{E}} \Bigg[ \Big| \frac{1}{m} \sum_{i=1}^{m} \sigma_i h(z_i) \Big| \Bigg] \\ & \le \underset{S \sim D^m}{ \mathbb{E}} \Bigg[ \sup_{h \in \mathcal{H}} \frac{1}{m} \Big| \sum_{i=1}^{m} \sigma_i h(z_i) \Big| \Bigg] = 2 \mathcal{R}_{m, D}(\mathcal{H}) \end{split} \end{equation*}$

where the factor of $2$ is simply due to our definition of Rademacher complexity involving $2m$ . We also leave out the "concentration term" (I believe this depends on the "variance" of the underlying distribution)

Let

$\big(z_1, \dots, z_m\big)$ be a sequence of independent random variables taking on values in a set $A$
$\big( c_1, \dots, c_n \big)$ be a sequence of positive real constants

If $\varphi: A^n \to \mathbb{R}$ satisfies

$\begin{equation*} \sup_{z_1, \dots, z_m, \tilde{z}_i \in A} \left| \varphi(z_1, \dots, z_{i - 1}, z_i, z_{i + 1}, \dots, z_m) - \varphi(z_1, \dots, z_{i - 1}, \textcolor{green}{\tilde{z}_i}, z_{i + 1}, \dots, z_m) \right| \le c_i \end{equation*}$

for $1 \le i \le n$ , then

$\begin{equation*} \mathbb{P} \big( \left\{ \varphi(z_1, \dots, z_m) - \mathbb{E}_{z_1, \dots, z_m} \big[ \varphi(z_1, \dots, z_m) \big] \right\} \ge \varepsilon \big) \le \exp \bigg( - \frac{2 \varepsilon^2}{\sum_{i=1}^{n} c_i^2} \bigg) \end{equation*}$

Let

$\mathcal{D}$ be a fixed distribution over the space $Z$
$\delta \in (0, 1)$ be a given constant
$\mathcal{F}$ be a set of bounded functions, wlog. we in particular assume $\mathcal{F} \subseteq \left\{ f: Z \to [0, 1] \right\}$
$S = \big( z_1, \dots, z_m \big)$ be a sequence of i.i.d. samples from $D$

Then, with probability at least $1 - \delta$ over the draw of $S$ , for every function $f \in \mathcal{F}$ ,

$\begin{equation*} \mathbb{E}_D \big[ f(z) \big] \le \widehat{\mathbb{E}}_S \big[ f(z) \big] + 2 R_m(\mathcal{F}) + \sqrt{\frac{\log (1 / \delta)}{m}} \end{equation*}$

We have

$\begin{equation*} \mathbb{E}_D \big[ f(z) \big] - \widehat{\mathbb{E}}_S \big[ f(z) \big] \le \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_D \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \bigg) \end{equation*}$

$\begin{equation*} \mathbb{E}_D \big[ f(z) \big] \le \widehat{\mathbb{E}}_S \big[ f(z) \big] + \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_D \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \bigg) \end{equation*}$

Consider letting

$\begin{equation*} \varphi(S) = \sup_{h \in \mathcal{F}} \mathbb{E}_D \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \end{equation*}$

Let $\tilde{S}_i = \big( z_1, \dots, z_{i - 1}, \textcolor{green}{\tilde{z}_i}, z_{i + 1}, \dots, z_m \big)$ with $\tilde{S} = \big( \tilde{z}_1, \dots, \tilde{z}_n \big)$ being identitically distributed as $S$ . Then

$\begin{equation*} \left| \varphi \big( S \big) - \varphi \big( \tilde{S}_i \big) \right| = \left| \widehat{\mathbb{E}}_{S} \big[ h(z) \big] - \widehat{\mathbb{E}}_{\tilde{S}_i} \big[ h(z) \big] \right| = \frac{1}{m} \sup_{h \in \mathcal{F}} \left| h(z_i) - h(\tilde{z}_i) \right| \end{equation*}$

and, finally, since $z_i$ are bounded and we assume $\mathcal{F}$ are all bounded, we can wlog. assume

$\begin{equation*} \left| h(z_i) - h(\tilde{z}_i) \right| \le 1 \end{equation*}$

Thus,

$\begin{equation*} \left| \varphi(S) - \varphi(\tilde{S}_i) \right| \le \frac{1}{m} \end{equation*}$

Using McDarmid's inequality, this implies that

$\begin{equation*} \mathbb{P} \big( \varphi(S) - \mathbb{E}_S \big[ \varphi(S) \big] \ge t \big) \le e^{- t^2 / \big( \frac{1}{m} \sum_{i=1}^{m} \frac{1}{m^2} \big)} = e^{- t^2 m} \end{equation*}$

One issue: don't yet know how to compute $\mathbb{E} [ \varphi(S) ]$ !

First let's solve for $t$ such that the above probability is $\le \delta$ :

$\begin{equation*} e^{-t^2 m} \le \delta \quad \impliedby \quad t \ge \sqrt{\frac{\log \big( 1 / \delta \big)}{m}} \end{equation*}$

And since

$\begin{equation*} \mathbb{P} \Big( \left\{ \varphi(S) - \mathbb{E}_S[\varphi(S)] \le t \right\} \Big) = 1 - \mathbb{P} \Big( \left\{ \varphi(S) - \mathbb{E}_S \big[ \varphi(S) \big] \ge t \right\} \Big) \ge 1 - \delta \end{equation*}$

which is what we want. Thus, with probability at least $1 - \delta$ , we have deduced that

$\begin{equation*} \varphi(S) \le \mathbb{E}_S \big[ \varphi(S) \big] + \underbrace{\sqrt{\frac{\log(1 / \delta)}{m}}}_{t} \end{equation*}$

and thus, substituting into our original bound on $\mathbb{E}_D \big[ f(z) \big]$ ,

$\begin{equation*} \begin{split} \mathbb{E}_D \big[ f(z) \big] &= \widehat{\mathbb{E}}_S \big[ f(z) \big] + \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_D \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \bigg) \\ &= \widehat{\mathbb{E}}_S \big[ f(z) \big] + \varphi(S) \\ &\le \widehat{\mathbb{E}}_S \big[ f(z) \big] + \mathbb{E}_S \big[ \varphi(S) \big] + \sqrt{\frac{\log(1 / \delta)}{m}} \\ &= \widehat{\mathbb{E}}_S \big[ f(z) \big] + \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_D\big[h(z)\big] - \widehat{\mathbb{E}}_S \big[h(z)\big] \bigg) \Bigg] + \sqrt{\frac{\log(1 / \delta)}{m}} \end{split} \end{equation*}$

with probability at least $1 - \delta$ .

We're now going to make use of a classic symmetrization technique. Observe that due to the independence of the samples

$\begin{equation*} \mathbb{E}_{\tilde{S}} \Big[ \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] \mid S \Big] = \frac{1}{m} \sum_{i=1}^{m} \underbrace{\mathbb{E}_{\tilde{S}} \big[ h(\tilde{z}_i) \mid S \big]}_{= \mathbb{E}_D [ h(z)]} = \frac{1}{m} \sum_{i=1}^{m} \mathbb{E}_D \big[h(z) \big] = \mathbb{E}_D \big[ h(z) \big] \end{equation*}$

and

$\begin{equation*} \mathbb{E}_{\tilde{S}} \Big[ \widehat{\mathbb{E}}_S \big[ h(z) \big] \mid S \Big] = \widehat{\mathbb{E}}_S \big[ h(z) \big] \end{equation*}$

The reason why we're writing $\mathbb{E}_{\tilde{S}} \big[ \hat{\mathbb{E}}_S[h(z)] \mid S \big]$ rather than just $\mathbb{E}_{\tilde{S}} \big[ \hat{\mathbb{E}}_S[h(z)] \big]$ is because, if we're being rigorous, we're really taking the expectation over the 2m-dimensional product-space.

Therefore, we can write

$\begin{equation*} \begin{split} \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_D \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \bigg) \Bigg] &= \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_{\tilde{S}} \Big[ \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] \mid S \Big] - \mathbb{E}_{\tilde{S}} \Big[ \widehat{\mathbb{E}}_S \big[ h(z) \big] \Big] \bigg) \Bigg] \\ &= \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_{\tilde{S}} \Big[ \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \mid S \Big] \bigg) \Bigg] \end{split} \end{equation*}$

Moreover, we can use the fact that $\sup$ is a convex function, therefore Jensen's inequality tells us that

$\begin{equation*} \begin{split} \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_{\tilde{S}} \Big[ \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \mid S \Big] \bigg) \Bigg] &\le \mathbb{E}_S \bigg[ \mathbb{E}_{\tilde{S}} \Big[ \sup_{h \in \mathcal{F}} \Big( \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \Big) \mid S \Big] \bigg] \\ &= \mathbb{E}_{S, \tilde{S}} \bigg[ \sup_{h \in \mathcal{F}} \bigg( \Big[ \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \Big] \bigg) \bigg] \end{split} \end{equation*}$

Since

$\begin{equation*} \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] = \frac{1}{m} \sum_{i=1}^{m} \big( h(\tilde{z}_i) - h(z_i) \big) \end{equation*}$

the above is then

$\begin{equation*} \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_{\tilde{S}} \Big[ \widehat{\mathbb{E}}_{\tilde{S}} \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \mid S \Big] \bigg) \Bigg] \le \mathbb{E}_{S, \tilde{S}} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \Big( h(\tilde{z}_i) - h(z_i) \Big) \bigg) \Bigg] \end{equation*}$

This is where the famous Rademacher random variables $\left\{ \sigma_1, \dots, \sigma_m \right\}$ , with probability $\mathbb{P} \big( \left\{ \sigma_i = 1 \right\} \big) = \mathbb{P} \big( \left\{ \sigma_i = -1 \right\} \big) = \frac{1}{2}$ , come in: note that

$\begin{equation*} \mathbb{E}_{\sigma, \mathcal{D}} \big[ \sigma f(z) \big] = \mathbb{E}_D \big[ f(z) \big] \end{equation*}$

for any integrable function $f$ . In particular,

$\begin{equation*} \begin{split} \mathbb{E}_{S, \tilde{S}} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \Big( h(\tilde{z}_i) - h(z_i) \Big) \bigg) \Bigg] &= \mathbb{E}_{S, \tilde{S}, \sigma} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i \Big( h(\tilde{z}_i) - h(z_i) \Big) \bigg) \Bigg] \\ &= \mathbb{E}_{S, \tilde{S}, \sigma} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i h(\tilde{z}_i) + \frac{1}{m} \sum_{i=1}^{m} (- \sigma_i) h(z_i) \bigg) \Bigg] \\ &\le \mathbb{E}_{S, \tilde{S}, \sigma} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i h(\tilde{z}_i) \bigg) + \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i h(z_i) \bigg) \Bigg] \\ &= \mathbb{E}_{\tilde{S}, \sigma} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i h(\tilde{z}_i) \bigg) \Bigg] + \mathbb{E}_{S, \sigma} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i h(z_i) \bigg) \Bigg] \\ &= 2 \underbrace{\mathbb{E}_{S, \sigma} \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i h(z_i) \bigg) \Bigg]}_{= R_m(\mathcal{F})} \\ &= 2 R_m(\mathcal{F}) \end{split} \end{equation*}$

That is, we have now proven that

$\begin{equation*} \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_D \big[ h(z) \big] - \widehat{\mathbb{E}}_S \big[ h(z) \big] \bigg) \Bigg] \le 2 R_m(\mathcal{F}) \end{equation*}$

Finally, combining it all:

$\begin{equation*} \begin{split} \mathbb{E}_D \big[ f(z) \big] &\le \widehat{\mathbb{E}}_S \big[ f(z) \big] + \mathbb{E}_S \Bigg[ \sup_{h \in \mathcal{F}} \bigg( \mathbb{E}_D\big[h(z)\big] - \widehat{\mathbb{E}}_S \big[h(z)\big] \bigg) \Bigg] + \sqrt{\frac{\log(1 / \delta)}{m}} \\ &\le \widehat{\mathbb{E}}_S \big[ f(z) \big] + 2 R_m(\mathcal{F}) + \sqrt{\frac{\log(1 / \delta)}{m}} \end{split} \end{equation*}$

QED boys.

For each $f \in \mathcal{F}$ , we have

$\begin{equation*} \mathbb{E}_D \big[ f(z) \big] \le \widehat{\mathbb{E}}_S \big[ f(z) \big] + 2 \hat{R}_m (\mathcal{F}) + 3 \sqrt{\frac{\log (2 / \delta)}{m}} \end{equation*}$

First notice that $R_m(\mathcal{F}) - \hat{R}_m(\mathcal{F})$ satisfies the necessary conditions of McDiarmid's inequality, and so we immediately have

$\begin{equation*} \mathbb{P} \big( \left\{ R_m(\mathcal{F}) - \hat{R}_m(\mathcal{F}) \ge t_R \right\} \big) \le e^{-t_R^2 m} \end{equation*}$

for any $t_R \in \mathbb{R}_{ + }$ . Again we can solve for RHS being less than $\delta_R > 0$ , giving us

$\begin{equation*} t_R \ge \sqrt{\frac{\log (1 / \delta_R)}{m}} \end{equation*}$

If we denote $t_D > 0$ the constant obtained for our earlier application of McDiarmid's inequality to bound the difference $\varphi(S) - \mathbb{E}_S \big[ \varphi(S) \big]$ , we then have

$\begin{equation*} \mathbb{E}_D \big[ f(z) \big] \le \widehat{\mathbb{E}}_S \big[ f(z) \big] + 2 \hat{R}_m(\mathcal{F}) + 2 \sqrt{\frac{\log (1 / \delta_R)}{m}} + \sqrt{\frac{\log (1 / \delta_D)}{m}} \end{equation*}$

with probability at least $1 - \delta_R - \delta_D$ . Setting $\delta_R = \delta_D = \frac{\delta}{2}$ , we get

$\begin{equation*} \mathbb{E}_D \big[ f(z) \big] \le \widehat{\mathbb{E}}_S \big[ f(z) \big] + 2 \hat{R}_m(\mathcal{F}) + 3 \sqrt{\frac{\log(2 / \delta)}{m}} \end{equation*}$

with probability at least $1 - \delta$ , as wanted.

PAC-Bayesian generalization bounds

Now consider Bayesian approach where we have a prior on the hypothesis class and using data to arrive at a posterior distribution

Notation

$P$ is a prior distribution on the hypothesis class $\mathcal{H}$
$Q$ posterior distribution over the hypothesis class $\mathcal{H}$
$D( Q || P)$ denotes the KL-divergence of $Q$ and $P$

Stuff

Prediction for input $x$ is to pick random hypothesis $h \in \mathcal{H}$ according to $Q$ and predict $h(x)$
Gives rise to the error

$\begin{equation*} \underset{h \sim Q}{\mathbb{E}} \big[ L_D(h) \big] \end{equation*}$

Consider a distribution $D$ on the data. Let

$P$ be a prior distribution over hypothesis class $\mathcal{H}$
$\delta > 0$

Then with probability $\ge 1 - \delta$ , on a i.i.d. sample $S = (z_1, z_2, \dots, z_m)$ of size $m$ from $D$ , for all distributions $Q$ over $\mathcal{H}$ (which could possibly depend on $S$ ), we have that

$\begin{equation*} \Delta_S \Big( Q(\mathcal{H}) \Big) = \underset{h \sim Q}{\mathbb{E}} \Big[ L_D(h) \Big] - \underset{h \sim Q}{ \mathbb{E}} \Big[ L_S(h) \Big] \le \sqrt{\frac{D \big( Q || P \big) + \ln m / \delta}{2 (m - 1)}} \end{equation*}$

This is saying that the generalization error is bounded above by the square root of the KL-divergence of the distributions (plus some terms that arise from the "concentration bounds").

From this theorem, in order to minimize the error on the real distribution $D$ , we should try to simultaneously minimize the empirical error as well as the KL-divergence between the posterior and the prior.

Observe that for fixed $h$ , using a standard Hoeffdings inequality, we have that

$\begin{equation*} \underset{S}{\text{Pr}} \big[ \Delta(h) > \varepsilon \big] \le e^{- 2m \varepsilon^2} \end{equation*}$

which means the generalization error of a given hypothesis exceeds a given constant is exponentially small. Thus, with very high probability, the generalization error is bounded by a small constant. Roughly, this says that $\sqrt{m} \delta_S(h)$ behaves like a univariate gaussian. Using concentration bounds, this further implies that

$\begin{equation*} \underset{S}{\mathbb{E}} \Big[ \exp \Big(2 (m - 1) \Delta(h)^2 \Big) \Big] \le m \end{equation*}$

and therefore, with high probability over $S$ ,

$\begin{equation*} \exp \Big( 2 (m - 1) \Delta(h)^2 \Big) = \mathcal{O}(m) \end{equation*}$

Then consider the expression (which can be obtained by working backwards from the statement of the claim)

$\begin{equation*} 2 ( m- 1) \underset{h \sim Q}{\mathbb{E}} \textcolor{green}{\big[ \Delta (h) \big]^2} - D(Q||P) \le 2 ( m- 1) \underset{h \sim Q}{\mathbb{E}} \textcolor{red}{\big[ \Delta (h)^2 \big]} - D(Q||P) \end{equation*}$

where the inequality is by convexity of squares. This in turn is now

$\begin{equation*} \begin{split} 2 (m - 1) \underset{h \sim Q}{\mathbb{E}} \big[ \Delta (h)^2 \big] - D(Q || P) &= \underset{h \sim Q}{\mathbb{E}} \Bigg[ 2(m - 1) \Delta(h)^2 - \ln \frac{Q(h)}{P(h)} \Bigg] \\ &= \underset{h \sim Q}{\mathbb{E}} \Bigg[ \ln \Bigg( \exp \Big( 2 (m - 1) \Delta(h)^2 \Big) \frac{P(h)}{Q(h)} \Bigg) \Bigg] \\ &\le \ln \underset{h \sim Q}{\mathbb{E}} \Bigg[ \ln \Bigg( \exp \Big( 2 (m - 1) \Delta(h)^2 \Big) \frac{P(h)}{Q(h)} \Bigg) \Bigg] \end{split} \end{equation*}$

where the last inequality uses Jensen's Inequality along with the concavity of $\ln$ . Also, we have

$\begin{equation*} \ln \underset{h \sim Q}{\mathbb{E}} \Bigg[ \Bigg( \exp \Big( 2 (m - 1) \Delta(h)^2 \Big) \frac{P(h)}{Q(h)} \Bigg) \Bigg] = \ln \underset{h \sim P}{\mathbb{E}} \Big[ \exp \big( 2 (m - 1) \Delta(h)^2 \big) \Big] \end{equation*}$

where the last step is a standard trick; using the KL-divergence term to switch the distribution over which expectation is taken (same as used in Importance Sampling).

Hence,

$\begin{equation*} 2 (m - 1) \underset{h \sim Q}{\mathbb{E}} \big[ \Delta(h) \big]^2 - D(Q || P) \le \ln \bigg( \underset{h \sim P}{\mathbb{E}} \Big[ \exp \big( 2 (m - 1) \Delta(h)^2 \big) \Big] \bigg) \end{equation*}$

Now, substituting this what we found earlier

$\begin{equation*} \underset{S}{\mathbb{E}} \Bigg[ \underset{h \sim P}{\mathbb{E}} \Big[ \exp \Big( 2 (m - 1) \Delta(h)^2 \Big) \Big] \Bigg] = \mathbb{E}_{h \sim P} \Bigg[ \underset{S}{\mathbb{E}} \Big[ \exp \Big( 2 (m- 1) \Delta(h)^2 \Big) \Big] \Bigg] \end{equation*}$

since $P$ is independent of $S$ . This then implies (again, from what we found above)

$\begin{equation*} \underset{h \sim P}{\mathbb{E}} \Big[ \exp \Big( 2(m - 1) \Delta(h)^2 \Big) \Big] = \mathcal{O}(m) \end{equation*}$

Combining the two previous expressions

$\begin{equation*} \begin{split} 2(m - 1) \underset{h \sim Q}{\mathbb{E}} \Big[ \Delta(h) \Big]^2 - D(Q || P) & \le \mathcal{O} \Big( \ln (m) \Big) \\ \implies \underset{h \sim Q}{\mathbb{E}} \Big[ \Delta(h) \Big]^2 & \le \frac{\mathcal{O} \big( \ln(m) \big) + D(Q || P)}{2 (m - 1)} \\ \underset{h \sim Q}{\mathbb{E}} \Big[ \Delta(h) \Big] & \le \sqrt{\frac{\mathcal{O} \big( \ln(m) \big) + D(Q || P)}{2 (m - 1)}} \end{split} \end{equation*}$