Theory
Table of Contents
Overview
This document covers a lot of the more "pure" sides of Machine Learning, e.g. attempts at establishing bounds on convergence.
Most of this is built upon the Probably Approximately Correct (PAC) framework, which attempts to bound the errors the generalization of models.
Good sources, where I've gotten a lot of this from:
Notation
- , for a
denotes the result of applying to ,
- denotes the true risk of , where is the distribution, and the labelling-function
- denotes denotes the probability of getting a nonrepresentative sample, and we say is the confidence parameter of our prediction
- denotes the accuracy parameter, dealing with the fact that we cannot guarantee perfect label prediction: is a failure of the learner, while means we view the output of the learner as approximately correct
- denotes an instance of the training set
Empirical Risk Minimization rule is denoted:
i.e. the set of learners which minimizes the empirical risk .
bb* Empirical Risk Minimization (EMR) Standard Empirical Risk Minimization: minimize the empirical loss
A risk function is the expected loss of classifier wrt. the probability distribution over , namely
Empirical risk is the expected loss over a given sample , namely,
Finite Hypothesis Classes
- Simple restriction in attempt to avoid "overfitting" is to bound the size of the hypotheses; i.e. consider a restricted hypothesis class
For now we make the following assumption:
There exists such that
That is, there exists an optimal hypothesis in the restricted hypothesis class which obtaines the global minimum true risk, i.e. minimum loss over the full data distribution.
Hence, we would like to upper bound the probability to sample m-tuple of instances which lead to failure. That is, find an upper bound for
Then let
i.e. all the "bad" hypotheses, such that the probability of failure of the learner is greater than .
Further, let
i.e. is set of all training instances which would "mislead" us into believing one of the wrong hypotheses () is "good" (i.e. has zero average loss on the training data).
With our Realizability Assumption (i.e. that ), then
can only happen if for some we have . That is,
which means that
Further, we can rewrite
Hence,
To summarize: probability of learner failing in an sense, is bounded above by the probability of the learner being a misleading learner, i.e. having minimal training error and being in the "bad" hypotheses class.
And because of how probability-measures work, probabilities of finite unions can be written
Suppose now that . By definition,
and due to the i.i.d. sampling of the training set, the joint probability of this occuring is then
Further, for each individual sample in the training set we have
where the last inequality follows from the fact that .
simply denotes the probability of this correctly labelling the over the entire true distribution . Since these equalities are indicator variables, thus binary, this corresponds to the expectation
and since by definition, we have
which is the second equality in the above expression.
Finally, observing that for , we have
we obtain
Combining this with the sum over the we saw earlier, we conclude that
Let
- be a finite hypothesis class
such that
Then, for any labeling function , and any distribution , for which the realizability assumption holds, over the choice of an i.i.d. sample of size , we have that for every ERM hypothesis, ,
That is, the rule over a finite hypothesis class will be probably (with confidence ) approximately (up to error ) correct (PAC).
Probably Approximately Correct (PAC)
A hypothesis class is PAC learnable if there exist a function and a learning algorithm with the property:
- for every
- for every distribution over
- for every labeling function
if the realizable assumption holds wrt. , , and , then when running the learning algorithm on i.i.d. examples generated by and labeled by , the algorithm returns a hypothesis s.t.
We will often refer to as the sample complexity of learning .
- defines how far the output classifier can be from the optimal one
- is how likely the classifier is to meet the accuracy requirement
Every finite hypothesis class is PAC learnable with sample complexity
Agnostic PAC learning - relaxing realization assumption
A hypothesis class is agnostic PAC learnable wrt a set and a loss function , if there exists a function and an algorithm with the property
- for every
- for every distribution over
when running the learning algorithm on i.i.d. examples drawn from , the algorithm returns such that
where
Uniform convergence
A training sample is called wrt. domain , hypothesis class , loss function , and distribution , if
Assume that a training set is .
Then, any output of , namely, any
satisfies
From Lemma 4.2 we know that a rule is an agnostic PAC-learner if is with probability !
This is motivates the following definition of Uniform Convergence:
We say that a hypothesis class has the uniform convergence property (wrt. domain and loss ) if there exists a function such that for
- every
- every probability distribution over
if is a sample of examples drawn i.i.d. according to , then is , with at least probability .
Let be a sequence of i.i.d. random variables and assume that for all i,
Then, for any ,
Using Hoeffding's Inequality, one can prove the following:
Let be a finite hypothesis class, let be a domain, and let be a loss function.
Then, enjoys the uniform convergence property with sample complexity
Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity
Even though Corollary 4.6 only holds for finite hypothesis in practice it might also provide us with upper bounds to "infinite" hypothesis classes, due to parameters most often being represented using floating point representations, which consists of 64 bits.
That is, if we're computationally approximating some , there is only different values actually can take on. If we then have parameters, we can use Corollary 4.6 to obtain the following bound:
This is often referred to as the discretization trick.
Bias-Complexity Tradeoff
Let be any learning algorithm for the task of binary classification wrt. to the 0-1 loss over a domain . Let be any number smaller than , representing a training set size.
Then, there exists a distribution over such that:
- There exists a function with
With probability of at least over the choice of we have that
No-Free-Lunch theorem states that for every learner, there exists a task on which it fails even though that task can be successfully learned by another learner.
Let such that .
We will prove this using the following intuition:
- Any learning algorithm that observes only half of the instances in has no information on what should be the labels of the rest of the instances in
- Therefore, there exists a "reality", that is some target function , that would contradict the labels that predicts on the unobserved instances in
Since by assumption, there are unique mappings . Denote these . For each , let be a distribution over defined by
Then we have
We can combine Lemma lemma:no-free-lunch-theorem-intermediate-lemma-1 with Lemma lemma:shalev2014understanding-B1, from which we observe
where we've used . Hence,
concluding our proof.
For every algorithm , that receives a training set of examples from there exists a function and a distribution over , such that and
Let be some algorithm that receives a training set of examples from and returns a function , i.e.
There are possible sequences of samples from . Let
and
i.e. denotes the sequences containing instances in labeled by .
If the distribution is , as defined above, then the possible training sets can receive are , and all these training sets have the same probability of being sampled. Therefore,
Using the fact that maximum is larger than average, and average is larger than minimum, we have
The above equation is saying that the maximum risk expected over possible rearrangements of achieved by is greater than the minimum of the expected risk over possible using the optimal arrangement ordering of .
Now, fix some , and denote and let be the examples in that do not appear in , i.e.
Clearly , since we're assuming and , and therefore the "best" case scenario is if we get distinct samples.
Therefore, for every function and every we have
Hence,
Next, fix some .
We can partition all the functions in into disjoint pairs, where for a pair we have
Since such a pair must have , it follows that
which yields
Substituting this back into
and this into
Hence, there exists a sample such that
for every algorithm .
We decompose the error of an predictor into two components as follows.
Let be an hypothesis. Then, we can write
where
- denotes the approximation error, which is the minimum possible risk achievable by a predictor in the given hypothesis class
- denotes the estimation error, which is the difference between the best achivable error and the error achieved by
The VC-Dimension
Let be a class of functions from to , and let .
The restriction of to is the set of functions which can be derived from . That is,
A hypothesis class shatters a finite set if the restriction of to is the set of all functions .
That is, if .
Shattering is basically a way of stating
- All functions mapping from the subset to the set of binary labels
- If the hypothesis-class we're trying to learn contains all possible functions from to then we say it's shattered, and this indicates that it will be difficult (if not impossible) to learn due to the fact that if and is shattered by , i.e. is the set of all possible mappings , then is not PAC-learnable, since we cannot satisfy arbitrary bounds simply by adjusting the number of samples.
The VC-dimension of a hypothesis class denoted , is the maximal size of a set that can be shattered by .
If can shatter sets of arbitrarily large size, we say that has infinite VC-dimension.
Let be a class of infinite VC-dimension. Then, is not PAC-learnable.
Generalization
Here we will consider generalization on classification tasks using PAC-learning theory.
Notation
- samples of inputs and outputs
- distribution on labelled data
- denotes i.i.d. samples of data bpoints from (often just write , and have the be implied)
- denotes the set of classifiers / hypotheses which we consider to explain the data
- denotes the loss of classifier on point
denotes the average loss of classifier on the sample-set , i.e.
denotes the average loss over the entire distribution :
Generalization error
Generalization under ERM
The objective is
i.e. which best explains the dataset, which is Empirical Risk Minimization since we're minimizing the empirical loss.
We then define the generalization error as follows:
The generalization error is defined
which measures how well hypothesis generalizes to a distribution when its performance on a sample drawn from is known.
Generalization Theory
- Generalization theory attempts to upper bound the from the generalization error
Rademacher Complexity
- Formalize notion of complexity of a hypothesis
Let
- is the hypothesis class
- is i.i.d. samples from
And then the coefficients
Then the Rademacher Complexity of on a distribution is defined as
The simply flip the sign of the classification (we're assuming for all ) simply changes the classification of to opposite one.
This therefore corresponds to flipping the labels of half the datapoints randomly and retaining the labels of the other half. We choose the which correlates well with this random relabeling (if half classified as positive and half classified as negative, and then for this specific we just happen to flip all the the ones which are positive to negative, then we get our ).
I struggle a bit with interpreting the Rademacher Complexity intuitively. Why would this indicate high complexity of the hypothesis class?
- Because we can consider ….
For a given loss function , we have that the generalization error of all hypothesis on a sample of i.i.d. samples drawn from a distribution , is
for all , with probability .
Hence, the generalization error of a hypothesis can be upper bounded by the Rademacher complexity.
This is more a sketch of the poof.
Suppose for a random sample the generalization error is high. Then
- Split into sets and randomly, with sets being of equal size
- For a given (picked independently of ), consider and
For large , we have and thus
(with being sort of like our "test" set and the "training" set). Thus
Since and are picked randomly, we can just consider as the first half of the samples and then the difference reduces to
Thus we have
where the factor of is simply due to our definition of Rademacher complexity involving . We also leave out the "concentration term" (I believe this depends on the "variance" of the underlying distribution)
Let
- be a sequence of independent random variables taking on values in a set
- be a sequence of positive real constants
If satisfies
for , then
Let
- be a fixed distribution over the space
- be a given constant
- be a set of bounded functions, wlog. we in particular assume
- be a sequence of i.i.d. samples from
Then, with probability at least over the draw of , for every function ,
We have
so
Consider letting
Let with being identitically distributed as . Then
and, finally, since are bounded and we assume are all bounded, we can wlog. assume
Thus,
Using McDarmid's inequality, this implies that
One issue: don't yet know how to compute !
First let's solve for such that the above probability is :
And since
which is what we want. Thus, with probability at least , we have deduced that
and thus, substituting into our original bound on ,
with probability at least .
We're now going to make use of a classic symmetrization technique. Observe that due to the independence of the samples
and
The reason why we're writing rather than just is because, if we're being rigorous, we're really taking the expectation over the 2m-dimensional product-space.
Therefore, we can write
Moreover, we can use the fact that is a convex function, therefore Jensen's inequality tells us that
Since
the above is then
This is where the famous Rademacher random variables , with probability , come in: note that
for any integrable function . In particular,
That is, we have now proven that
Finally, combining it all:
QED boys.
For each , we have
First notice that satisfies the necessary conditions of McDiarmid's inequality, and so we immediately have
for any . Again we can solve for RHS being less than , giving us
If we denote the constant obtained for our earlier application of McDiarmid's inequality to bound the difference , we then have
with probability at least . Setting , we get
with probability at least , as wanted.
PAC-Bayesian generalization bounds
- Now consider Bayesian approach where we have a prior on the hypothesis class and using data to arrive at a posterior distribution
Notation
- is a prior distribution on the hypothesis class
- posterior distribution over the hypothesis class
- denotes the KL-divergence of and
Stuff
- Prediction for input is to pick random hypothesis according to and predict
Gives rise to the error
Consider a distribution on the data. Let
- be a prior distribution over hypothesis class
Then with probability , on a i.i.d. sample of size from , for all distributions over (which could possibly depend on ), we have that
This is saying that the generalization error is bounded above by the square root of the KL-divergence of the distributions (plus some terms that arise from the "concentration bounds").
From this theorem, in order to minimize the error on the real distribution , we should try to simultaneously minimize the empirical error as well as the KL-divergence between the posterior and the prior.
Observe that for fixed , using a standard Hoeffdings inequality, we have that
which means the generalization error of a given hypothesis exceeds a given constant is exponentially small. Thus, with very high probability, the generalization error is bounded by a small constant. Roughly, this says that behaves like a univariate gaussian. Using concentration bounds, this further implies that
and therefore, with high probability over ,
Then consider the expression (which can be obtained by working backwards from the statement of the claim)
where the inequality is by convexity of squares. This in turn is now
where the last inequality uses Jensen's Inequality along with the concavity of . Also, we have
where the last step is a standard trick; using the KL-divergence term to switch the distribution over which expectation is taken (same as used in Importance Sampling).
Hence,
Now, substituting this what we found earlier
since is independent of . This then implies (again, from what we found above)
Combining the two previous expressions
as claimed!
Appendix
Measure Concentration
Let be a rv. that takes values in . Assume that . Then, for any ,
This also implies that for every :
Bibliography
Bibliography
- [shalev2014understanding] Shalev-Shwartz & Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press (2014).