# Theory

## Overview

This document covers a lot of the more "pure" sides of Machine Learning, e.g. attempts at establishing bounds on convergence.

Most of this is built upon the Probably Approximately Correct (PAC) framework, which attempts to bound the errors the generalization of models.

Good sources, where I've gotten a lot of this from:

## Notation

• , for a
• denotes the result of applying to ,

• denotes the true risk of , where is the distribution, and the labelling-function
• denotes denotes the probability of getting a nonrepresentative sample, and we say is the confidence parameter of our prediction
• denotes the accuracy parameter, dealing with the fact that we cannot guarantee perfect label prediction: is a failure of the learner, while means we view the output of the learner as approximately correct
• denotes an instance of the training set
• Empirical Risk Minimization rule is denoted:

i.e. the set of learners which minimizes the empirical risk .

bb* Empirical Risk Minimization (EMR) Standard Empirical Risk Minimization: minimize the empirical loss

A risk function is the expected loss of classifier wrt. the probability distribution over , namely

Empirical risk is the expected loss over a given sample , namely,

### Finite Hypothesis Classes

• Simple restriction in attempt to avoid "overfitting" is to bound the size of the hypotheses; i.e. consider a restricted hypothesis class

For now we make the following assumption:

There exists such that

That is, there exists an optimal hypothesis in the restricted hypothesis class which obtaines the global minimum true risk, i.e. minimum loss over the full data distribution.

Hence, we would like to upper bound the probability to sample m-tuple of instances which lead to failure. That is, find an upper bound for

Then let

i.e. all the "bad" hypotheses, such that the probability of failure of the learner is greater than .

Further, let

i.e. is set of all training instances which would "mislead" us into believing one of the wrong hypotheses () is "good" (i.e. has zero average loss on the training data).

With our Realizability Assumption (i.e. that ), then

can only happen if for some we have . That is,

which means that

Further, we can rewrite

Hence,

To summarize: probability of learner failing in an sense, is bounded above by the probability of the learner being a misleading learner, i.e. having minimal training error and being in the "bad" hypotheses class.

And because of how probability-measures work, probabilities of finite unions can be written

Suppose now that . By definition,

and due to the i.i.d. sampling of the training set, the joint probability of this occuring is then

Further, for each individual sample in the training set we have

where the last inequality follows from the fact that .

simply denotes the probability of this correctly labelling the over the entire true distribution . Since these equalities are indicator variables, thus binary, this corresponds to the expectation

and since by definition, we have

which is the second equality in the above expression.

Finally, observing that for , we have

we obtain

Combining this with the sum over the we saw earlier, we conclude that

Let

• be a finite hypothesis class
• such that

Then, for any labeling function , and any distribution , for which the realizability assumption holds, over the choice of an i.i.d. sample of size , we have that for every ERM hypothesis, ,

That is, the rule over a finite hypothesis class will be probably (with confidence ) approximately (up to error ) correct (PAC).

## Probably Approximately Correct (PAC)

A hypothesis class is PAC learnable if there exist a function and a learning algorithm with the property:

• for every
• for every distribution over
• for every labeling function

if the realizable assumption holds wrt. , , and , then when running the learning algorithm on i.i.d. examples generated by and labeled by , the algorithm returns a hypothesis s.t.

We will often refer to as the sample complexity of learning .

• defines how far the output classifier can be from the optimal one
• is how likely the classifier is to meet the accuracy requirement

Every finite hypothesis class is PAC learnable with sample complexity

### Agnostic PAC learning - relaxing realization assumption

A hypothesis class is agnostic PAC learnable wrt a set and a loss function , if there exists a function and an algorithm with the property

• for every
• for every distribution over

when running the learning algorithm on i.i.d. examples drawn from , the algorithm returns such that

where

## Uniform convergence

A training sample is called wrt. domain , hypothesis class , loss function , and distribution , if

Assume that a training set is .

Then, any output of , namely, any

satisfies

From Lemma 4.2 we know that a rule is an agnostic PAC-learner if is with probability !

This is motivates the following definition of Uniform Convergence:

We say that a hypothesis class has the uniform convergence property (wrt. domain and loss ) if there exists a function such that for

• every
• every probability distribution over

if is a sample of examples drawn i.i.d. according to , then is , with at least probability .

Let be a sequence of i.i.d. random variables and assume that for all i,

Then, for any ,

Using Hoeffding's Inequality, one can prove the following:

Let be a finite hypothesis class, let be a domain, and let be a loss function.

Then, enjoys the uniform convergence property with sample complexity

Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity

Even though Corollary 4.6 only holds for finite hypothesis in practice it might also provide us with upper bounds to "infinite" hypothesis classes, due to parameters most often being represented using floating point representations, which consists of 64 bits.

That is, if we're computationally approximating some , there is only different values actually can take on. If we then have parameters, we can use Corollary 4.6 to obtain the following bound:

This is often referred to as the discretization trick.

Let be any learning algorithm for the task of binary classification wrt. to the 0-1 loss over a domain . Let be any number smaller than , representing a training set size.

Then, there exists a distribution over such that:

1. There exists a function with
2. With probability of at least over the choice of we have that

No-Free-Lunch theorem states that for every learner, there exists a task on which it fails even though that task can be successfully learned by another learner.

Let such that .

We will prove this using the following intuition:

• Any learning algorithm that observes only half of the instances in has no information on what should be the labels of the rest of the instances in
• Therefore, there exists a "reality", that is some target function , that would contradict the labels that predicts on the unobserved instances in

Since by assumption, there are unique mappings . Denote these . For each , let be a distribution over defined by

Then we have

We can combine Lemma lemma:no-free-lunch-theorem-intermediate-lemma-1 with Lemma lemma:shalev2014understanding-B1, from which we observe

where we've used . Hence,

concluding our proof.

For every algorithm , that receives a training set of examples from there exists a function and a distribution over , such that and

Let be some algorithm that receives a training set of examples from and returns a function , i.e.

There are possible sequences of samples from . Let

and

i.e. denotes the sequences containing instances in labeled by .

If the distribution is , as defined above, then the possible training sets can receive are , and all these training sets have the same probability of being sampled. Therefore,

Using the fact that maximum is larger than average, and average is larger than minimum, we have

The above equation is saying that the maximum risk expected over possible rearrangements of achieved by is greater than the minimum of the expected risk over possible using the optimal arrangement ordering of .

Now, fix some , and denote and let be the examples in that do not appear in , i.e.

Clearly , since we're assuming and , and therefore the "best" case scenario is if we get distinct samples.

Therefore, for every function and every we have

Hence,

Next, fix some .

We can partition all the functions in into disjoint pairs, where for a pair we have

Since such a pair must have , it follows that

which yields

Substituting this back into

and this into

Hence, there exists a sample such that

for every algorithm .

We decompose the error of an predictor into two components as follows.

Let be an hypothesis. Then, we can write

where

• denotes the approximation error, which is the minimum possible risk achievable by a predictor in the given hypothesis class
• denotes the estimation error, which is the difference between the best achivable error and the error achieved by

## The VC-Dimension

Let be a class of functions from to , and let .

The restriction of to is the set of functions which can be derived from . That is,

A hypothesis class shatters a finite set if the restriction of to is the set of all functions .

That is, if .

Shattering is basically a way of stating

• All functions mapping from the subset to the set of binary labels
• If the hypothesis-class we're trying to learn contains all possible functions from to then we say it's shattered, and this indicates that it will be difficult (if not impossible) to learn due to the fact that if and is shattered by , i.e. is the set of all possible mappings , then is not PAC-learnable, since we cannot satisfy arbitrary bounds simply by adjusting the number of samples.

The VC-dimension of a hypothesis class denoted , is the maximal size of a set that can be shattered by .

If can shatter sets of arbitrarily large size, we say that has infinite VC-dimension.

Let be a class of infinite VC-dimension. Then, is not PAC-learnable.

## Generalization

Here we will consider generalization on classification tasks using PAC-learning theory.

### Notation

• samples of inputs and outputs
• distribution on labelled data
• denotes i.i.d. samples of data bpoints from (often just write , and have the be implied)
• denotes the set of classifiers / hypotheses which we consider to explain the data
• denotes the loss of classifier on point
• denotes the average loss of classifier on the sample-set , i.e.

• denotes the average loss over the entire distribution :

• Generalization error

### Generalization under ERM

The objective is

i.e. which best explains the dataset, which is Empirical Risk Minimization since we're minimizing the empirical loss.

We then define the generalization error as follows:

The generalization error is defined

which measures how well hypothesis generalizes to a distribution when its performance on a sample drawn from is known.

### Generalization Theory

• Formalize notion of complexity of a hypothesis

Let

• is the hypothesis class
• is i.i.d. samples from
• And then the coefficients

Then the Rademacher Complexity of on a distribution is defined as

The simply flip the sign of the classification (we're assuming for all ) simply changes the classification of to opposite one.

This therefore corresponds to flipping the labels of half the datapoints randomly and retaining the labels of the other half. We choose the which correlates well with this random relabeling (if half classified as positive and half classified as negative, and then for this specific we just happen to flip all the the ones which are positive to negative, then we get our ).

I struggle a bit with interpreting the Rademacher Complexity intuitively. Why would this indicate high complexity of the hypothesis class?

• Because we can consider ….

For a given loss function , we have that the generalization error of all hypothesis on a sample of i.i.d. samples drawn from a distribution , is

for all , with probability .

Hence, the generalization error of a hypothesis can be upper bounded by the Rademacher complexity.

This is more a sketch of the poof.

Suppose for a random sample the generalization error is high. Then

1. Split into sets and randomly, with sets being of equal size
2. For a given (picked independently of ), consider and
3. For large , we have and thus

(with being sort of like our "test" set and the "training" set). Thus

4. Since and are picked randomly, we can just consider as the first half of the samples and then the difference reduces to

5. Thus we have

where the factor of is simply due to our definition of Rademacher complexity involving . We also leave out the "concentration term" (I believe this depends on the "variance" of the underlying distribution)

Let

• be a sequence of independent random variables taking on values in a set
• be a sequence of positive real constants

If satisfies

for , then

Let

• be a fixed distribution over the space
• be a given constant
• be a set of bounded functions, wlog. we in particular assume
• be a sequence of i.i.d. samples from

Then, with probability at least over the draw of , for every function ,

We have

so

Consider letting

Let with being identitically distributed as . Then

and, finally, since are bounded and we assume are all bounded, we can wlog. assume

Thus,

Using McDarmid's inequality, this implies that

One issue: don't yet know how to compute !

First let's solve for such that the above probability is :

And since

which is what we want. Thus, with probability at least , we have deduced that

and thus, substituting into our original bound on ,

with probability at least .

We're now going to make use of a classic symmetrization technique. Observe that due to the independence of the samples

and

The reason why we're writing rather than just is because, if we're being rigorous, we're really taking the expectation over the 2m-dimensional product-space.

Therefore, we can write

Moreover, we can use the fact that is a convex function, therefore Jensen's inequality tells us that

Since

the above is then

This is where the famous Rademacher random variables , with probability , come in: note that

for any integrable function . In particular,

That is, we have now proven that

Finally, combining it all:

QED boys.

For each , we have

First notice that satisfies the necessary conditions of McDiarmid's inequality, and so we immediately have

for any . Again we can solve for RHS being less than , giving us

If we denote the constant obtained for our earlier application of McDiarmid's inequality to bound the difference , we then have

with probability at least . Setting , we get

with probability at least , as wanted.

#### PAC-Bayesian generalization bounds

• Now consider Bayesian approach where we have a prior on the hypothesis class and using data to arrive at a posterior distribution
##### Notation
• is a prior distribution on the hypothesis class
• posterior distribution over the hypothesis class
• denotes the KL-divergence of and
##### Stuff
• Prediction for input is to pick random hypothesis according to and predict
• Gives rise to the error

Consider a distribution on the data. Let

• be a prior distribution over hypothesis class

Then with probability , on a i.i.d. sample of size from , for all distributions over (which could possibly depend on ), we have that

This is saying that the generalization error is bounded above by the square root of the KL-divergence of the distributions (plus some terms that arise from the "concentration bounds").

From this theorem, in order to minimize the error on the real distribution , we should try to simultaneously minimize the empirical error as well as the KL-divergence between the posterior and the prior.

Observe that for fixed , using a standard Hoeffdings inequality, we have that

which means the generalization error of a given hypothesis exceeds a given constant is exponentially small. Thus, with very high probability, the generalization error is bounded by a small constant. Roughly, this says that behaves like a univariate gaussian. Using concentration bounds, this further implies that

and therefore, with high probability over ,

Then consider the expression (which can be obtained by working backwards from the statement of the claim)

where the inequality is by convexity of squares. This in turn is now

where the last inequality uses Jensen's Inequality along with the concavity of . Also, we have

where the last step is a standard trick; using the KL-divergence term to switch the distribution over which expectation is taken (same as used in Importance Sampling).

Hence,

Now, substituting this what we found earlier

since is independent of . This then implies (again, from what we found above)

Combining the two previous expressions

as claimed!

## Appendix

### Measure Concentration

Let be a rv. that takes values in . Assume that . Then, for any ,

This also implies that for every :

# Bibliography

• [shalev2014understanding] Shalev-Shwartz & Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press (2014).