Theory
Table of Contents
Overview
This document covers a lot of the more "pure" sides of Machine Learning, e.g. attempts at establishing bounds on convergence.
Most of this is built upon the Probably Approximately Correct (PAC) framework, which attempts to bound the errors the generalization of models.
Good sources, where I've gotten a lot of this from:
Notation
, for a 
denotes the result of applying
to
,
denotes the true risk of
, where
is the distribution, and
the labelling-function
denotes denotes the probability of getting a nonrepresentative sample, and we say
is the confidence parameter of our prediction
denotes the accuracy parameter, dealing with the fact that we cannot guarantee perfect label prediction:
is a failure of the learner, while
means we view the output of the learner as approximately correct
denotes an instance of the training setEmpirical Risk Minimization rule is denoted:
i.e. the set of learners which minimizes the empirical risk
.
bb* Empirical Risk Minimization (EMR) Standard Empirical Risk Minimization: minimize the empirical loss
A risk function is the expected loss of classifier
wrt. the probability distribution
over
, namely
Empirical risk is the expected loss over a given sample
, namely,
Finite Hypothesis Classes
- Simple restriction in attempt to avoid "overfitting" is to bound the size of the hypotheses; i.e. consider a restricted hypothesis class

For now we make the following assumption:
There exists
such that
That is, there exists an optimal hypothesis in the restricted hypothesis class
which obtaines the global minimum true risk, i.e. minimum loss over the full data distribution.
Hence, we would like to upper bound the probability to sample m-tuple of instances which lead to failure. That is, find an upper bound for
Then let
i.e. all the "bad" hypotheses, such that the probability of failure of the learner is greater than
.
Further, let
i.e.
is set of all training instances which would "mislead" us into believing one of the wrong hypotheses (
) is "good" (i.e. has zero average loss on the training data).
With our Realizability Assumption (i.e. that
), then
can only happen if for some
we have
. That is,
which means that
Further, we can rewrite
Hence,
To summarize: probability of learner failing in an
sense, is bounded above by the probability of the learner being a misleading learner, i.e. having minimal training error and being in the "bad" hypotheses class.
And because of how probability-measures work, probabilities of finite unions can be written
Suppose now that
. By definition,
and due to the i.i.d. sampling of the training set, the joint probability of this occuring is then
Further, for each individual sample in the training set we have
where the last inequality follows from the fact that
.
simply denotes the probability of this
correctly labelling the
over the entire true distribution
. Since these equalities are indicator variables, thus binary, this corresponds to the expectation
and since
by definition, we have
which is the second equality in the above expression.
Finally, observing that for
, we have
we obtain
Combining this with the sum over the
we saw earlier, we conclude that
Let
be a finite hypothesis class

such that
Then, for any labeling function
, and any distribution
, for which the realizability assumption holds, over the choice of an i.i.d. sample
of size
, we have that for every ERM hypothesis,
,
That is, the
rule over a finite hypothesis class will be probably (with confidence
) approximately (up to error
) correct (PAC).
Probably Approximately Correct (PAC)
A hypothesis class
is PAC learnable if there exist a function
and a learning algorithm with the property:
- for every

- for every distribution
over 
- for every labeling function

if the realizable assumption holds wrt.
,
, and
, then when running the learning algorithm on
i.i.d. examples generated by
and labeled by
, the algorithm returns a hypothesis
s.t.
We will often refer to
as the sample complexity of learning
.
defines how far the output classifier can be from the optimal one
is how likely the classifier is to meet the accuracy requirement
Every finite hypothesis class is PAC learnable with sample complexity
Agnostic PAC learning - relaxing realization assumption
A hypothesis class
is agnostic PAC learnable wrt a set
and a loss function
, if there exists a function
and an algorithm with the property
- for every

- for every distribution
over 
when running the learning algorithm on
i.i.d. examples drawn from
, the algorithm returns
such that
where
Uniform convergence
A training sample
is called
wrt. domain
, hypothesis class
, loss function
, and distribution
, if
Assume that a training set is
.
Then, any output of
, namely, any
satisfies
From Lemma 4.2 we know that a
rule
is an agnostic PAC-learner if
is
with probability
!
This is motivates the following definition of Uniform Convergence:
We say that a hypothesis class
has the uniform convergence property (wrt. domain
and loss
) if there exists a function
such that for
- every

- every probability distribution
over 
if
is a sample of
examples drawn i.i.d. according to
, then
is
, with at least probability
.
Let
be a sequence of i.i.d. random variables and assume that for all i,
Then, for any
,
Using Hoeffding's Inequality, one can prove the following:
Let
be a finite hypothesis class, let
be a domain, and let
be a loss function.
Then,
enjoys the uniform convergence property with sample complexity
Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity
Even though Corollary 4.6 only holds for finite hypothesis in practice it might also provide us with upper bounds to "infinite" hypothesis classes, due to parameters most often being represented using floating point representations, which consists of 64 bits.
That is, if we're computationally approximating some
, there is only
different values
actually can take on. If we then have
parameters, we can use Corollary 4.6 to obtain the following bound:
This is often referred to as the discretization trick.
Bias-Complexity Tradeoff
Let
be any learning algorithm for the task of binary classification wrt. to the 0-1 loss over a domain
.
Let
be any number smaller than
, representing a training set size.
Then, there exists a distribution
over
such that:
- There exists a function
with 
With probability of at least
over the choice of
we have that
No-Free-Lunch theorem states that for every learner, there exists a task on which it fails even though that task can be successfully learned by another learner.
Let
such that
.
We will prove this using the following intuition:
- Any learning algorithm that observes only half of the instances in
has no information on what should be the labels of the rest of the instances in 
- Therefore, there exists a "reality", that is some target function
, that would contradict the labels that
predicts on the unobserved instances in 
Since
by assumption, there are
unique mappings
. Denote these
. For each
, let
be a distribution over
defined by
Then we have
We can combine Lemma lemma:no-free-lunch-theorem-intermediate-lemma-1 with Lemma lemma:shalev2014understanding-B1, from which we observe
where we've used
. Hence,
concluding our proof.
For every algorithm
, that receives a training set of
examples from
there exists a function
and a distribution
over
, such that
and
Let
be some algorithm that receives a training set of
examples from
and returns a function
, i.e.
There are
possible sequences of
samples from
. Let
and
i.e.
denotes the sequences containing instances in
labeled by
.
If the distribution is
, as defined above, then the possible training sets
can receive are
, and all these training sets have the same probability of being sampled. Therefore,
Using the fact that maximum is larger than average, and average is larger than minimum, we have
The above equation is saying that the maximum risk expected over possible rearrangements of
achieved by
is greater than the minimum of the expected risk over possible
using the optimal arrangement ordering of
.
Now, fix some
, and denote
and let
be the examples in
that do not appear in
, i.e.
Clearly
, since we're assuming
and
, and therefore the "best" case scenario is if we get
distinct samples.
Therefore, for every function
and every
we have
Hence,
Next, fix some
.
We can partition all the functions in
into
disjoint pairs, where for a pair
we have
Since such a pair must have
, it follows that
which yields
Substituting this back into
and this into
Hence, there exists a sample such that
for every algorithm
.
We decompose the error of an
predictor into two components as follows.
Let
be an
hypothesis. Then, we can write
where
denotes the approximation error, which is the minimum possible risk achievable by a predictor in the given hypothesis class 
denotes the estimation error, which is the difference between the best achivable error and the error achieved by 
The VC-Dimension
Let
be a class of functions from
to
, and let
.
The restriction of
to
is the set of functions
which can be derived from
. That is,
A hypothesis class
shatters a finite set
if the restriction of
to
is the set of all functions
.
That is, if
.
Shattering is basically a way of stating
- All functions mapping from the subset
to the set of binary labels - If the hypothesis-class we're trying to learn contains all possible functions from
to
then we say it's shattered, and this indicates that it will be difficult (if not impossible) to learn due to the fact that if
and
is shattered by
, i.e.
is the set of all possible mappings
, then
is not PAC-learnable, since we cannot satisfy arbitrary bounds simply by adjusting the number of samples.
The VC-dimension of a hypothesis class
denoted
, is the maximal size of a set
that can be shattered by
.
If
can shatter sets of arbitrarily large size, we say that
has infinite VC-dimension.
Let
be a class of infinite VC-dimension. Then,
is not PAC-learnable.
Generalization
Here we will consider generalization on classification tasks using PAC-learning theory.
Notation
samples of inputs and outputs
distribution on labelled data
denotes
i.i.d. samples of data bpoints
from
(often just write
, and have the
be implied)
denotes the set of classifiers / hypotheses
which we consider to explain the data
denotes the loss of classifier
on point 
denotes the average loss of classifier
on the sample-set
, i.e.
denotes the average loss over the entire distribution
:
Generalization error
Generalization under ERM
The objective is
i.e.
which best explains the dataset, which is Empirical Risk Minimization since we're minimizing the empirical loss.
We then define the generalization error as follows:
The generalization error is defined
which measures how well hypothesis
generalizes to a distribution
when its performance on a sample
drawn from
is known.
Generalization Theory
- Generalization theory attempts to upper bound the
from the generalization error
Rademacher Complexity
- Formalize notion of complexity of a hypothesis

Let
is the hypothesis class
is
i.i.d. samples from 
And then the coefficients
Then the Rademacher Complexity of
on a distribution
is defined as
The
simply flip the sign of the classification (we're assuming
for all
) simply changes the classification of
to opposite one.
This therefore corresponds to flipping the labels of half the datapoints randomly and retaining the labels of the other half. We choose the
which correlates well with this random relabeling (if half classified as positive and half classified as negative, and then for this specific
we just happen to flip all the the ones which are positive to negative, then we get our
).
I struggle a bit with interpreting the Rademacher Complexity intuitively. Why would this indicate high complexity of the hypothesis class?
- Because we can consider ….
For a given loss function
, we have that the generalization error of all hypothesis
on a sample
of
i.i.d. samples drawn from a distribution
, is
for all
, with probability
.
Hence, the generalization error
of a hypothesis
can be upper bounded by the Rademacher complexity.
This is more a sketch of the poof.
Suppose for a random sample
the generalization error is high. Then
- Split
into sets
and
randomly, with sets being of equal size - For a given
(picked independently of
), consider
and 
For large
, we have
and thus
(with
being sort of like our "test" set and
the "training" set). Thus
Since
and
are picked randomly, we can just consider
as the first half of the samples
and then the difference reduces to
Thus we have
where the factor of
is simply due to our definition of Rademacher complexity involving
.
We also leave out the "concentration term" (I believe this depends on the "variance" of the underlying distribution)
Let
be a sequence of independent random variables taking on values in a set 
be a sequence of positive real constants
If
satisfies
for
, then
Let
be a fixed distribution over the space 
be a given constant
be a set of bounded functions, wlog. we in particular assume ![$\mathcal{F} \subseteq \left\{ f: Z \to [0, 1] \right\}$](../../assets/latex/theory_b9b00a1f655c36361ead9aae84eefee57fde4236.png)
be a sequence of i.i.d. samples from 
Then, with probability at least
over the draw of
, for every function
,
We have
so
Consider letting
Let
with
being identitically distributed as
. Then
and, finally, since
are bounded and we assume
are all bounded, we can wlog. assume
Thus,
Using McDarmid's inequality, this implies that
One issue: don't yet know how to compute
!
First let's solve for
such that the above probability is
:
And since
which is what we want. Thus, with probability at least
, we have deduced that
and thus, substituting into our original bound on
,
with probability at least
.
We're now going to make use of a classic symmetrization technique. Observe that due to the independence of the samples
and
The reason why we're writing
rather than just
is because, if we're being rigorous, we're really taking the expectation over the 2m-dimensional product-space.
Therefore, we can write
Moreover, we can use the fact that
is a convex function, therefore Jensen's inequality tells us that
Since
the above is then
This is where the famous Rademacher random variables
, with probability
, come in: note that
for any integrable function
. In particular,
That is, we have now proven that
Finally, combining it all:
QED boys.
For each
, we have
First notice that
satisfies the necessary conditions of McDiarmid's inequality, and so we immediately have
for any
. Again we can solve for RHS being less than
, giving us
If we denote
the constant obtained for our earlier application of McDiarmid's inequality to bound the difference
, we then have
with probability at least
. Setting
, we get
with probability at least
, as wanted.
PAC-Bayesian generalization bounds
- Now consider Bayesian approach where we have a prior on the hypothesis class and using data to arrive at a posterior distribution
Notation
is a prior distribution on the hypothesis class 
posterior distribution over the hypothesis class 
denotes the KL-divergence of
and 
Stuff
- Prediction for input
is to pick random hypothesis
according to
and predict 
Gives rise to the error
Consider a distribution
on the data. Let
be a prior distribution over hypothesis class 

Then with probability
, on a i.i.d. sample
of size
from
, for all distributions
over
(which could possibly depend on
), we have that
This is saying that the generalization error is bounded above by the square root of the KL-divergence of the distributions (plus some terms that arise from the "concentration bounds").
From this theorem, in order to minimize the error on the real distribution
, we should try to simultaneously minimize the empirical error as well as the KL-divergence between the posterior and the prior.
Observe that for fixed
, using a standard Hoeffdings inequality, we have that
which means the generalization error of a given hypothesis exceeds a given constant is exponentially small. Thus, with very high probability, the generalization error is bounded by a small constant. Roughly, this says that
behaves like a univariate gaussian. Using concentration bounds, this further implies that
and therefore, with high probability over
,
Then consider the expression (which can be obtained by working backwards from the statement of the claim)
where the inequality is by convexity of squares. This in turn is now
where the last inequality uses Jensen's Inequality along with the concavity of
. Also, we have
where the last step is a standard trick; using the KL-divergence term to switch the distribution over which expectation is taken (same as used in Importance Sampling).
Hence,
Now, substituting this what we found earlier
since
is independent of
. This then implies (again, from what we found above)
Combining the two previous expressions
as claimed!
Appendix
Measure Concentration
Let
be a rv. that takes values in
. Assume that
. Then, for any
,
This also implies that for every
:
Bibliography
Bibliography
- [shalev2014understanding] Shalev-Shwartz & Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press (2014).