Theory

Table of Contents

Overview

This document covers a lot of the more "pure" sides of Machine Learning, e.g. attempts at establishing bounds on convergence.

Most of this is built upon the Probably Approximately Correct (PAC) framework, which attempts to bound the errors the generalization of models.

Good sources, where I've gotten a lot of this from:

Notation

  • theory_993ed47d90a1cb78c82e050cdfa1d9ddaa53fa5f.png, for a theory_f56b556b5aef28b769c52c4c0bf2d3213e67eb2e.png
  • theory_227764adbf4cc41f9eaf4b7341bf03ad6038d0c3.png denotes the result of applying theory_bd761427f8cbfcd62c38e87c358ae3292efa26f1.png to theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png,

    theory_3cf73158d81cbf87826643f5d906e98b2f725387.png

  • theory_daa5f0bebefcdcd767fd4c6501fe66870b471866.png denotes the true risk of theory_6b76bf5bb36f0dd5fc5620005b858b74a44a60b6.png, where theory_336cc4816ec0737847752cbcbdf529fc618cae33.png is the distribution, and theory_93369077affb352dbdb97e8b3182fd50784f2b14.png the labelling-function
  • theory_18f46e29ae5df6d58883632ac9a73c6a3ea7e102.png denotes denotes the probability of getting a nonrepresentative sample, and we say theory_8ae4247d4033208c4bb17bc3a9eaa2ab29ee3966.png is the confidence parameter of our prediction
  • theory_224ae917d74dc2133d4403064c971bf562d4db50.png denotes the accuracy parameter, dealing with the fact that we cannot guarantee perfect label prediction: theory_c2e9938bc12958b727e02eedb5e6b726ff11982b.png is a failure of the learner, while theory_26f465031111cc9290beececcb2bef55c20b2302.png means we view the output of the learner as approximately correct
  • theory_c1239494c4d5d84765c926e1c68f1ec81783719c.png denotes an instance of the training set
  • Empirical Risk Minimization rule is denoted:

    theory_ec03f365510464e51236f178b4ff611bf954925b.png

    i.e. the set of learners which minimizes the empirical risk theory_e236cb24a267157fbfeff492908e624d2cd2ab1e.png.

Empirical Risk Minimization (EMR)

Standard Empirical Risk Minimization: minimize the empirical loss

theory_91d25949bf89ca2d5bd99096487581a84467c11d.png

A risk function is the expected loss of classifier theory_12c5d44e37b0c0f6f7a8a81278f297359791bfdb.png wrt. the probability distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png over theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png, namely

theory_4e1978edd6f87b0fdd571a2e06bee602a58a4c64.png

Empirical risk is the expected loss over a given sample theory_3a64bcd9ab09031b2960a828e1ffb1b2d3a9df4a.png, namely,

theory_6e0157fed09f6d767ec0d04418bd4606cd849c80.png

Finite Hypothesis Classes

  • Simple restriction in attempt to avoid "overfitting" is to bound the size of the hypotheses; i.e. consider a restricted hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png

For now we make the following assumption:

There exists theory_bad2d3f092fda591382621f706d152e30182dd32.png such that

theory_1b41b6074f0b34606d60d0e2d45a8e6edbb76bfb.png

That is, there exists an optimal hypothesis in the restricted hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png which obtaines the global minimum true risk, i.e. minimum loss over the full data distribution.

Hence, we would like to upper bound the probability to sample m-tuple of instances which lead to failure. That is, find an upper bound for

theory_40edefe0de4248ccc27a0749a4e3c1110cada313.png

Then let

theory_1f1070004bd4f04ae6842be514c2d66a58421220.png

i.e. all the "bad" hypotheses, such that the probability of failure of the learner is greater than theory_224ae917d74dc2133d4403064c971bf562d4db50.png.

Further, let

theory_0d459c07b39e26e2edc7559a85d571a5cd9ff386.png

i.e. theory_83bc72bfcb75deba5f65643ddbce6fa2cf599fa4.png is set of all training instances which would "mislead" us into believing one of the wrong hypotheses (theory_3285bd73e5f1b134f7be55b3caf38ba19d4d4b37.png) is "good" (i.e. has zero average loss on the training data).

With our Realizability Assumption (i.e. that theory_816f64e94959460e1aca54309d4d7516964c9085.png), then

theory_2ee4d48b3a25ad3eeeecc2098f683d883bdbcba7.png

can only happen if for some theory_3285bd73e5f1b134f7be55b3caf38ba19d4d4b37.png we have theory_9ff705bcb974098a884f09b5d08911d86514feea.png. That is,

theory_79a54c2622488acf41e633510d7b0a6ea211cdbb.png

which means that

theory_57faf9762f874c33eab8aa01c41f380dc6dec06c.png

Further, we can rewrite

theory_9a22c021a8d7a86658aa883b45033921837e8c16.png

Hence,

theory_0f529f0c6cf5559af0e51b22ddc5082f582224c8.png

To summarize: probability of learner failing in an theory_224ae917d74dc2133d4403064c971bf562d4db50.png sense, is bounded above by the probability of the learner being a misleading learner, i.e. having minimal training error and being in the "bad" hypotheses class.

And because of how probability-measures work, probabilities of finite unions can be written

theory_5012f5a435598938259801dafeae282056ba32d1.png

Suppose now that theory_3285bd73e5f1b134f7be55b3caf38ba19d4d4b37.png. By definition,

theory_c105cda2b494041ed796545303162ba53f216f0a.png

and due to the i.i.d. sampling of the training set, the joint probability of this occuring is then

theory_1c193235f5fb0e834727ea37ad509fd85ab47dc9.png

Further, for each individual sample in the training set we have

theory_749beecd8199ee5f41a4b1d5419d1c9ba5953cbd.png

where the last inequality follows from the fact that theory_3285bd73e5f1b134f7be55b3caf38ba19d4d4b37.png.

theory_63c60d983fabddc5423ed001ab8436abb84aa5ac.png

simply denotes the probability of this theory_3285bd73e5f1b134f7be55b3caf38ba19d4d4b37.png correctly labelling the theory_d64bfec44a431bb2b6c8c10ecf785039df3255ae.png over the entire true distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png. Since these equalities are indicator variables, thus binary, this corresponds to the expectation

theory_c8f4dd055d8efcbf1d533e96f49d09588cf07daa.png

and since theory_8bd718a589106f7e064b08934238b3ce83024e05.png by definition, we have

theory_f3db5eead0f97cd15db48e80c329059dbe37c990.png

which is the second equality in the above expression.

Finally, observing that for theory_d402086a34ea5a67dbbf6f855d57b4a438956cb7.png, we have

theory_974cc14b1d938726d9396db37286a0ecb595f42f.png

we obtain

theory_f0f57ad963ec11cd5cb08c145c371cf5e47fe8e5.png

Combining this with the sum over the theory_3285bd73e5f1b134f7be55b3caf38ba19d4d4b37.png we saw earlier, we conclude that

theory_8df330e4ff6eedd3c69da0a0fb05c3d1b7c79604.png

Let

  • theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png be a finite hypothesis class
  • theory_e1fcbb043f60614dac2be9156eefa91ef32c3dad.png
  • theory_d402086a34ea5a67dbbf6f855d57b4a438956cb7.png
  • theory_f56b556b5aef28b769c52c4c0bf2d3213e67eb2e.png such that

    theory_b475315349914732241db1b307f53c9c3b148090.png

Then, for any labeling function theory_93369077affb352dbdb97e8b3182fd50784f2b14.png, and any distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png, for which the realizability assumption holds, over the choice of an i.i.d. sample theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png of size theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png, we have that for every ERM hypothesis, theory_6b76bf5bb36f0dd5fc5620005b858b74a44a60b6.png,

theory_98e12ac0c048a4d015461961ef2e668411f276ba.png

That is, the theory_bd761427f8cbfcd62c38e87c358ae3292efa26f1.png rule over a finite hypothesis class will be probably (with confidence theory_273b96c81a3d1608ab1385ff559c8f2983abfe56.png) approximately (up to error theory_224ae917d74dc2133d4403064c971bf562d4db50.png) correct (PAC).

Probably Approximately Correct (PAC)

A hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png is PAC learnable if there exist a function theory_02bae84cb5d797aaa0af683434ca2debb515e761.png and a learning algorithm with the property:

  • for every theory_4a9ddd053bbd9006890a7028109222d82062363d.png
  • for every distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png over theory_8c8d4992aa385c22d79f66db8edced75a661b728.png
  • for every labeling function theory_bff14d6087c38be800fa323c324dc240064c2812.png

if the realizable assumption holds wrt. theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png, theory_336cc4816ec0737847752cbcbdf529fc618cae33.png, and theory_93369077affb352dbdb97e8b3182fd50784f2b14.png, then when running the learning algorithm on theory_5f72efb5fc4749c7d8e56085a4b301e3677b4726.png i.i.d. examples generated by theory_336cc4816ec0737847752cbcbdf529fc618cae33.png and labeled by theory_93369077affb352dbdb97e8b3182fd50784f2b14.png, the algorithm returns a hypothesis theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png s.t.

theory_005d577b410efb6310e2d4cce7a9a30822f2f8b5.png

We will often refer to theory_ee48943266f111265344c439d7c9e9c50329f8e0.png as the sample complexity of learning theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png.

  • theory_224ae917d74dc2133d4403064c971bf562d4db50.png defines how far the output classifier can be from the optimal one
  • theory_18f46e29ae5df6d58883632ac9a73c6a3ea7e102.png is how likely the classifier is to meet the accuracy requirement

Every finite hypothesis class is PAC learnable with sample complexity

theory_3337edaad9c8e13ad91d2bda3e62ffe8c7d1285f.png

Agnostic PAC learning - relaxing realization assumption

A hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png is agnostic PAC learnable wrt a set theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png and a loss function theory_f993f95b626f13c2816c3d5205c38cab37dc2cf8.png, if there exists a function theory_02bae84cb5d797aaa0af683434ca2debb515e761.png and an algorithm with the property

  • for every theory_4a9ddd053bbd9006890a7028109222d82062363d.png
  • for every distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png over theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png

when running the learning algorithm on theory_5f72efb5fc4749c7d8e56085a4b301e3677b4726.png i.i.d. examples drawn from theory_336cc4816ec0737847752cbcbdf529fc618cae33.png, the algorithm returns theory_12c5d44e37b0c0f6f7a8a81278f297359791bfdb.png such that

theory_56501e2099483702d2e22f576e6d9620e9e04a3a.png

where

theory_0d569108565cb9b9f33fd35cc6a7f0135f43eb1a.png

Uniform convergence

A training sample theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png is called theory_151a827f8eabc1f60bd9293133912bc6da9c8bf1.png wrt. domain theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png, hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png, loss function theory_4cd082f9db7ea80260b7e962dd0e38070da0d1a2.png, and distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png, if

theory_c2c967d955a9910ffae4eb846951ffe2279e283d.png

Assume that a training set is theory_0bb11250269a8e895f215053005084699955a642.png.

Then, any output of theory_a9eb042a960985c10dc36ba2b2e41ce4f211d8e0.png, namely, any

theory_3cf73158d81cbf87826643f5d906e98b2f725387.png

satisfies

theory_01a26951ca79007a05013bef505f50e3d2997740.png

From Lemma 4.2 we know that a theory_2dde67040fd82829c352fca3f779c8a1d9b63e89.png rule theory_6b76bf5bb36f0dd5fc5620005b858b74a44a60b6.png is an agnostic PAC-learner if theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png is theory_151a827f8eabc1f60bd9293133912bc6da9c8bf1.png with probability theory_273b96c81a3d1608ab1385ff559c8f2983abfe56.png!

This is motivates the following definition of Uniform Convergence:

We say that a hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png has the uniform convergence property (wrt. domain theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png and loss theory_4cd082f9db7ea80260b7e962dd0e38070da0d1a2.png) if there exists a function theory_a7a25a0827bacbc39dc9a4411c7167d0ff1efb01.png such that for

  • every theory_4a9ddd053bbd9006890a7028109222d82062363d.png
  • every probability distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png over theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png

if theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png is a sample of theory_8bda733ad9ff71ff69c9082bc5018504a923ce03.png examples drawn i.i.d. according to theory_336cc4816ec0737847752cbcbdf529fc618cae33.png, then theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png is theory_151a827f8eabc1f60bd9293133912bc6da9c8bf1.png, with at least probability theory_273b96c81a3d1608ab1385ff559c8f2983abfe56.png.

Let theory_ba6808972b5aaadc81df99a9f42c8bf9d168ba22.png be a sequence of i.i.d. random variables and assume that for all i,

theory_171853a03787022002ffa20a532ccf5d10b9037b.png

Then, for any theory_d402086a34ea5a67dbbf6f855d57b4a438956cb7.png,

theory_676496b4e120f33f1fc3c4d630da35916561423b.png

Using Hoeffding's Inequality, one can prove the following:

Let theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png be a finite hypothesis class, let theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png be a domain, and let theory_9c6d8ebeb0e75e3d138dd899699e125062a11aa9.png be a loss function.

Then, theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png enjoys the uniform convergence property with sample complexity

theory_e8a9906b92b855e073bdacdbb1c61b1bd0389cca.png

Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity

theory_158140220afff191f735c9216720ca392de6a8d9.png

Even though Corollary 4.6 only holds for finite hypothesis in practice it might also provide us with upper bounds to "infinite" hypothesis classes, due to parameters most often being represented using floating point representations, which consists of 64 bits.

That is, if we're computationally approximating some theory_576aa72a0ce49c85eb263d8893e965029ef45a5b.png, there is only theory_13771c0108a3341f9e5397c8fcde6c212d8c09b0.png different values theory_3379210dafd739839723c827682a6f9ae39ad754.png actually can take on. If we then have theory_f4e5566de1c75b7700b562b91a80d2308cc66d12.png parameters, we can use Corollary 4.6 to obtain the following bound:

theory_8e8a05ce7af3b46ade2af04f07296b40e59a0b01.png

This is often referred to as the discretization trick.

Bias-Complexity Tradeoff

Let theory_e46729bc781c25bbc7120ee2892cc1c0215af7da.png be any learning algorithm for the task of binary classification wrt. to the 0-1 loss over a domain theory_8c8d4992aa385c22d79f66db8edced75a661b728.png. Let theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png be any number smaller than theory_1f71684e08451f57a49e59b8e2ec0be67be8b157.png, representing a training set size.

Then, there exists a distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png over theory_b04a0fa8421f0c2fb4387ee7c69c08705f7d8a4a.png such that:

  1. There exists a function theory_a41f07bd24bdb1c755021c7a73aeafde8d2bd7ae.png with theory_7393152d1833f9c8cb807ccf6bda1505dfdc2b4c.png
  2. With probability of at least theory_8fb97f9ed813d2daccbed2ac6c340a5ccfe3afa1.png over the choice of theory_31a116a275ce184e84e0eda389a27deba34c945f.png we have that

    theory_92bb8333e54dbda158be4b4343215ce9d23eb570.png

No-Free-Lunch theorem states that for every learner, there exists a task on which it fails even though that task can be successfully learned by another learner.

Let theory_87a40c4f1365d4a69911b4a10c353938349e4f08.png such that theory_ca87d51155b60f29d30ec6612d75b5643393b570.png.

We will prove this using the following intuition:

  • Any learning algorithm that observes only half of the instances in theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png has no information on what should be the labels of the rest of the instances in theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png
  • Therefore, there exists a "reality", that is some target function theory_93369077affb352dbdb97e8b3182fd50784f2b14.png, that would contradict the labels that theory_f6d69a1ee785e18420ff0d6ef46f6bfe885ff538.png predicts on the unobserved instances in theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png

Since theory_ca87d51155b60f29d30ec6612d75b5643393b570.png by assumption, there are theory_fbbb98a8c8022d988a6bb40df4f13e192272b0a8.png unique mappings theory_0b8b0a60370264bc07ede13ac6e3096f7f9e168c.png. Denote these theory_168efd2aacfcb1152676fbedd17aa768b4a566ea.png. For each theory_8996d3648ae824690559c2960e2f71b3c35c0908.png, let theory_4f9b55612fe5cbaacdfc32571571f215c20607d3.png be a distribution over theory_fbbace385bbbaa44643e7ddca1e9f57bed6a1b09.png defined by

theory_7acd7e1e10852c94d5c786ed959080232268926e.png

Then we have

theory_69510e9b08955bfb0e4b49a2202be890eabca38d.png

We can combine Lemma lemma:no-free-lunch-theorem-intermediate-lemma-1 with Lemma lemma:shalev2014understanding-B1, from which we observe

theory_19f197ba580472b164abe1b11fc9a34283675cdd.png

where we've used theory_1540b8a94abab1a501da69e68c230b5557f55eee.png. Hence,

theory_00b46872ce6962eedec07aea3841a79cfa1418c5.png

concluding our proof.

For every algorithm theory_6ecabf46b855c7cbc7b0c57a13b6f1559d122040.png, that receives a training set of theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png examples from theory_b04a0fa8421f0c2fb4387ee7c69c08705f7d8a4a.png there exists a function theory_bff14d6087c38be800fa323c324dc240064c2812.png and a distribution theory_336cc4816ec0737847752cbcbdf529fc618cae33.png over theory_b04a0fa8421f0c2fb4387ee7c69c08705f7d8a4a.png, such that theory_7393152d1833f9c8cb807ccf6bda1505dfdc2b4c.png and

theory_921f5bc02650e65ac62712e778f119d9f96c4987.png

Let theory_e46729bc781c25bbc7120ee2892cc1c0215af7da.png be some algorithm that receives a training set of theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png examples from theory_fbbace385bbbaa44643e7ddca1e9f57bed6a1b09.png and returns a function theory_1cab8db7fdc5f9c8c2a7fac5eba307c6660c4e43.png, i.e.

theory_56a7cb0779af10e2f280759b685a5987147e7294.png

There are theory_85b3dda1727d0d7f4f57a1011ea5194045678ade.png possible sequences of theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png samples from theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png. Let

theory_a1c19a403dc89cde70a7e9660f45e7949c80b837.png

and

theory_0392a8cc03ec35b2f0732333830ce6eefdeed6d8.png

i.e. theory_613f0709a86062a37bc16320a541a71823270258.png denotes the sequences containing instances in theory_b70a53064cbcd7290052e360607435cae90aac3e.png labeled by theory_8996d3648ae824690559c2960e2f71b3c35c0908.png.

If the distribution is theory_4f9b55612fe5cbaacdfc32571571f215c20607d3.png, as defined above, then the possible training sets theory_e46729bc781c25bbc7120ee2892cc1c0215af7da.png can receive are theory_b075fcb2a1c294e758f64b8a12b550a39435871b.png, and all these training sets have the same probability of being sampled. Therefore,

theory_463605a26b65ceade8b3ecf8d587091551d059fe.png

Using the fact that maximum is larger than average, and average is larger than minimum, we have

theory_1eaf68478b61c1bc49ac877d9e486d95ba3c9f76.png

The above equation is saying that the maximum risk expected over possible rearrangements of theory_b70a53064cbcd7290052e360607435cae90aac3e.png achieved by theory_96578bc7677fa6e4ce9334969f9d5ca689d2784c.png is greater than the minimum of the expected risk over possible theory_8996d3648ae824690559c2960e2f71b3c35c0908.png using the optimal arrangement ordering of theory_b70a53064cbcd7290052e360607435cae90aac3e.png.

Now, fix some theory_63b218c5014d5654690bf98e02671c6ad54e82d9.png, and denote theory_71b4726ffdf729e9bada79bb13c6c7d49f28dee1.png and let theory_8ceec0f90cd92522bc9f89feaa709e765d91c5f5.png be the examples in theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png that do not appear in theory_b70a53064cbcd7290052e360607435cae90aac3e.png, i.e.

theory_0547ca1432e2c67fc260d1054a3169a3cf5f5790.png

Clearly theory_dd6786c9350955b8a3321bc355ccbd1b25d7d4e9.png, since we're assuming theory_ca87d51155b60f29d30ec6612d75b5643393b570.png and theory_cc255e94f0cc223df8fb12e1a6039abecd281834.png, and therefore the "best" case scenario is if we get theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png distinct samples.

Therefore, for every function theory_f823f5e4b92798a4c115ccc26cdfe2b7d4912d69.png and every theory_016a05e5db0dfc1ccab604a0e59e44145f436205.png we have

theory_cd7c348183951b36b00d3159f2723efdc8ed11ca.png

Hence,

theory_817221f8ab64f6d0b7e47a00f4462d5498fd22b1.png

Next, fix some theory_dca54a6ec0bc18bd6922f35121ca72a3012fd065.png.

We can partition all the functions in theory_168efd2aacfcb1152676fbedd17aa768b4a566ea.png into theory_cbe161389c5489607075d44877002b45ff1d25b3.png disjoint pairs, where for a pair theory_20cea8414a4a96aac22e3ee31fbcb53eb2a7f2be.png we have

theory_ef3ab65597e73f4574458f9d8046b0da8ed1258e.png

Since such a pair must have theory_c51b704168036d616de40bc93d8cbc62278efdf7.png, it follows that

theory_8b4be5b098fcef619a688c28a35aa03db9d0f5e1.png

which yields

theory_e2b651f23b543cce82ac80706640a4f38996ecd2.png

Substituting this back into

theory_f04b62103cf21a9535a0e62c79ccd7b3ad241ed3.png

and this into

theory_979c6fe8bd57997e780e66bb8599c73bb0b7e4d6.png

Hence, there exists a sample such that

theory_3335f333134fcf5662e6051bb279691eda21bb14.png

for every algorithm theory_6ecabf46b855c7cbc7b0c57a13b6f1559d122040.png.

We decompose the error of an theory_d9bfdfeefb0254ad5809596fc17968f30a45ad23.png predictor into two components as follows.

Let theory_6b76bf5bb36f0dd5fc5620005b858b74a44a60b6.png be an theory_d9bfdfeefb0254ad5809596fc17968f30a45ad23.png hypothesis. Then, we can write

theory_eb271c5693b8327ab71b4b6a448f093c1c2580a1.png

where

theory_9b90d50b099a4a6f4ce0c523d014048d7a6f89e6.png

  • theory_99d163094ee08b0568e433b051ef16d85d948b85.png denotes the approximation error, which is the minimum possible risk achievable by a predictor in the given hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png
  • theory_646dd9c513acea35d55cb068dd75062b5a108361.png denotes the estimation error, which is the difference between the best achivable error and the error achieved by theory_6b76bf5bb36f0dd5fc5620005b858b74a44a60b6.png

The VC-Dimension

Let theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png be a class of functions from theory_8c8d4992aa385c22d79f66db8edced75a661b728.png to theory_e7832931ac33d553a5f34521db0fecba78af1b81.png, and let theory_9b8b9eb8f633374dc268a584e27a9f7cb1b15ffe.png.

The restriction of theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png to theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png is the set of functions theory_f823f5e4b92798a4c115ccc26cdfe2b7d4912d69.png which can be derived from theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png. That is,

theory_6544e80a6d1f3297f7668a960d98612eff1bc5f0.png

A hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png shatters a finite set theory_87a40c4f1365d4a69911b4a10c353938349e4f08.png if the restriction of theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png to theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png is the set of all functions theory_f823f5e4b92798a4c115ccc26cdfe2b7d4912d69.png.

That is, if theory_7d592674d45c9665a6195f392f95dab2ad6f7044.png.

Shattering is basically a way of stating

  • All functions mapping from the subset theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png to the set of binary labels
  • If the hypothesis-class we're trying to learn contains all possible functions from theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png to theory_e7832931ac33d553a5f34521db0fecba78af1b81.png then we say it's shattered, and this indicates that it will be difficult (if not impossible) to learn due to the fact that if theory_c3f5f88eeecc1a18923d375a91d4a183cc6c9497.png and theory_9ccec48caa132706c57c7e0e22c7558bf8f1a48b.png is shattered by theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png, i.e. theory_062f0f1857b87f8399661c146355f0df93286d7b.png is the set of all possible mappings theory_ecd1378e8f4faad6d7d1b0f14f20b6ba0a7f765d.png, then theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png is not PAC-learnable, since we cannot satisfy arbitrary bounds simply by adjusting the number of samples.

The VC-dimension of a hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png denoted theory_ec146944b934369ff40d2398ca0a435089708381.png, is the maximal size of a set theory_87a40c4f1365d4a69911b4a10c353938349e4f08.png that can be shattered by theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png.

If theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png can shatter sets of arbitrarily large size, we say that theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png has infinite VC-dimension.

Let theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png be a class of infinite VC-dimension. Then, theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png is not PAC-learnable.

Generalization

Here we will consider generalization on classification tasks using PAC-learning theory.

Notation

Here we use quite a lot of notation from the PAC-literature.

  • theory_d38f0f12e0d25e670a9c8493a78bc55599f77659.png samples of inputs and outputs
  • theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png distribution on labelled data
  • theory_7697eb63b72dd0933f32825b32514620ccf11e1b.png denotes theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png i.i.d. samples of data bpoints theory_14ed2be9bd9a0889cad501cc416a19d0b0fd85bb.png from theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png (often just write theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png, and have the theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png be implied)
  • theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png denotes the set of classifiers / hypotheses theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png which we consider to explain the data
  • theory_2192bdc1bb45dd6838ff9d7daa36e60da15fe873.png denotes the loss of classifier theory_12c5d44e37b0c0f6f7a8a81278f297359791bfdb.png on point theory_0175212be7f24a1a4ebcc0fe80017b066cc38413.png
  • theory_cd9a4a2b8f20fb799e3efc3ca23a2e08aad06cb4.png denotes the average loss of classifier theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png on the sample-set theory_032dbbb34df1ac8ba667915d91c456258307eabc.png, i.e.

    theory_5a27c8c3c8c5aa47a6d58a17da3ca3f2d5feda1c.png

  • theory_1e92ca2aa504b67c4f3429713bf422493ebd3847.png denotes the average loss over the entire distribution theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png:

    theory_8dcff0aea393766903370a8e1310b8ef8f41955b.png

  • Generalization error

    theory_81cbe57e0175744e33fa2a71286198e652577b95.png

Generalization under ERM

The objective is

theory_81136f9ed0a117d1d6b23251c4de7b59016acdc4.png

i.e. theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png which best explains the dataset, which is Empirical Risk Minimization since we're minimizing the empirical loss.

We then define the generalization error as follows:

The generalization error is defined

theory_81cbe57e0175744e33fa2a71286198e652577b95.png

which measures how well hypothesis theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png generalizes to a distribution theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png when its performance on a sample theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png drawn from theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png is known.

Generalization Theory

Rademacher Complexity

  • Formalize notion of complexity of a hypothesis theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png

Let

  • theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png is the hypothesis class
  • theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png is theory_bb36f78c93ce14e842592f72fe9260b5b01375e6.png i.i.d. samples from theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png
  • And then the coefficients

    theory_93b86d2b30d7c0a7760dd817f7206dba35c08f35.png

Then the Rademacher Complexity of theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png on a distribution theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png is defined as

theory_033acedef02b1be04b8bedb1840df4cdb7cead85.png

The theory_23a41686415bf15002851802933b520d67febc2d.png simply flip the sign of the classification (we're assuming theory_36f219f564f9fe050a29e29f228f9898f99fa83f.png for all theory_f78e53d727e82c8e43205bb55f2368f6dc0affa3.png) simply changes the classification of theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png to opposite one.

This therefore corresponds to flipping the labels of half the datapoints randomly and retaining the labels of the other half. We choose the theory_12c5d44e37b0c0f6f7a8a81278f297359791bfdb.png which correlates well with this random relabeling (if half classified as positive and half classified as negative, and then for this specific theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png we just happen to flip all the the ones which are positive to negative, then we get our theory_ec49f31ce688125a33a0751efe77b73d2081f98c.png).

I struggle a bit with interpreting the Rademacher Complexity intuitively. Why would this indicate high complexity of the hypothesis class?

  • Because we can consider ….

For a given loss function theory_2192bdc1bb45dd6838ff9d7daa36e60da15fe873.png, we have that the generalization error of all hypothesis theory_12c5d44e37b0c0f6f7a8a81278f297359791bfdb.png on a smaple theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png of theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png i.i.d. samples drawn from a distribution theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png, is

theory_e5acaf162eff9f2652ba082bb97d69a3dda5584e.png

for all theory_0809b466cd9de765ec30e362bd8646103cb497eb.png, with probability theory_689890d6c79234ad05b0c296449463b8cf6313f2.png.

Hence, the generalization error theory_09b7bca6dcf03d083e8eb99cfbe11909f9aed00e.png of a hypothesis theory_12c5d44e37b0c0f6f7a8a81278f297359791bfdb.png can be upper bounded by the Rademacher complexity.

This is more a sketch of the poof.

Suppose for a random sample theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png the generalization error is high. Then

  1. Split theory_032dbbb34df1ac8ba667915d91c456258307eabc.png into sets theory_d628cc96f9875171d7e1e6e93e1912621b319d2b.png and theory_79463e3ac84fc19e2ffdf1cc595c89299b9c5ea1.png randomly, with sets being of equal size
  2. For a given theory_14a40f189f2ce698341c03cd5c099a337431c49f.png (picked independtly of theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png), consider theory_5f50bbda2d2a7b49de15bb38c81937c30bf5a59f.png and theory_a19ac0a1f8e614477b96daba170d7793aff2981d.png
  3. For large theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png, we have theory_f6d98f75f1a9a3ae2823b450c14a63062ad5d3ca.png and thus

    theory_8b6e5ada1dd3347f9ce33ad40c409f79b55ffaae.png

    (with theory_79463e3ac84fc19e2ffdf1cc595c89299b9c5ea1.png being sort of like our "test" set and theory_d628cc96f9875171d7e1e6e93e1912621b319d2b.png the "training" set). Thus

    theory_22f087fa34686b338654ffe493eaab2047b13dbd.png

  4. Since theory_d628cc96f9875171d7e1e6e93e1912621b319d2b.png and theory_79463e3ac84fc19e2ffdf1cc595c89299b9c5ea1.png are picked randomly, we can just consider theory_d628cc96f9875171d7e1e6e93e1912621b319d2b.png as the first half of the samples theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png and then the difference reduces to

    theory_e087c5ed5e188edc68ddb44cfcbedc2645e5ccc9.png

  5. Thus we have

    theory_2f12c7bdea8e31305feddfb2c2006bf1084e9f26.png

    where the factor of theory_a29e34be17b10a19ddb747289b0aec4da0fa854e.png is simply due to our definition of Rademacher complexity involving theory_bb36f78c93ce14e842592f72fe9260b5b01375e6.png. We also leave out the "concentration term" (I believe this depends on the "variance" of the underlying distribution)

PAC-Bayesian generalization bounds

  • Now consider Bayesian approach where we have a prior on the hypothesis class and using data to arrive at a posterior distribution
Notation
  • theory_a9090c77ce9916955c745920bf8f134a0932d59d.png is a prior distribution on the hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png
  • theory_9fe46e6ccfd065fb50f2eeb9cbdc65835e9ec321.png posterior distribution over the hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png
  • theory_6456725e1e28b66b22980624ac8f4d2cadf691dc.png denotes the KL-divergence of theory_9fe46e6ccfd065fb50f2eeb9cbdc65835e9ec321.png and theory_a9090c77ce9916955c745920bf8f134a0932d59d.png
Stuff
  • Prediction for input theory_7a84c9a383f9772338016d101ccc096be06af784.png is to pick random hypothesis theory_12c5d44e37b0c0f6f7a8a81278f297359791bfdb.png according to theory_9fe46e6ccfd065fb50f2eeb9cbdc65835e9ec321.png and predict theory_38770ebdb5d1972b50334e0b8124f38af437764e.png
  • Gives rise to the error

    theory_3428c02397a7eaee8a1b8e6dd320c46095186078.png

Consider a distribution theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png on the data. Let

  • theory_a9090c77ce9916955c745920bf8f134a0932d59d.png be a prior distribution over hypothesis class theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png
  • theory_0809b466cd9de765ec30e362bd8646103cb497eb.png

Then with probability theory_e859e8bc7ebecd1bd181fd9204b8c6771e90c2f1.png, on a i.i.d. sample theory_7b703c58dcb928b4999ba63a4882de3329bf40ca.png of size theory_2e87ddd81a8341ac85c5b5d94608b54b9a890a8d.png from theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png, for all distributions theory_9fe46e6ccfd065fb50f2eeb9cbdc65835e9ec321.png over theory_991879c7fa97073b972673f091ed54ad80ca2ecc.png (which could possibly depend on theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png), we have that

theory_2880a408722f9fafc1ff7dd9b1a4db019aca4a82.png

This is saying that the generalization error is bounded above by the square root of the KL-divergence of the distributions (plus some terms that arise from the "concentration bounds").

From this theorem, in order to minimize the error on the real distribution theory_a9d4a02ab56ca60e657921091aa6c4c80c08e8f7.png, we should try to simultaneously minimize the empirical error as well as the KL-divergence between the posterior and the prior.

Observe that for fixed theory_5dc59e8e67581e3469d6fd42edc34acb3f86c387.png, using a standard Hoeffdings inequality, we have that

theory_7735ca4ab8a3dc3c0669dd3b360d82133659b4f9.png

which means the generalization error of a given hypothesis exceeds a given constant is exponentially small. Thus, with very high probability, the generalization error is bounded by a small constant. Roughly, this says that theory_7caa76eb6d3d012669f8c521b30d6197e5848789.png behaves like a univariate gaussian. Using concentration bounds, this further implies that

theory_d10d876a7c3e8d4bcaa4950673cab0b8f46cc3b0.png

and therefore, with high probability over theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png,

theory_4403db6d6c29cce5bd8c4d063d805558c9c117fc.png

Then consider the expression (which can be obtained by working backwards from the statement of the claim)

theory_2ba951290babdd20ed0ab3eb1149f2405140a176.png

where the inequality is by convexity of squares. This in turn is now

theory_77523e4f1b689ba1aabbfeb4d6557229a9d841ae.png

where the last inequality uses Jensen's Inequality along with the concavity of theory_e82aae4aea050466d40f9f52f55d7ba840cebca8.png. Also, we have

theory_0c20784f67f5582dafe7d27cc7699a727273ebfb.png

where the last step is a standard trick; using the KL-divergence term to switch the distribution over which expectation is taken (same as used in Importance Sampling).

Hence,

theory_db705867e67ea0573b54d5fec990e3b443e39ebc.png

Now, substituting this what we found earlier

theory_8424b1bda441fb53bc663f6da09e2a0cf3ca70c1.png

since theory_a9090c77ce9916955c745920bf8f134a0932d59d.png is independent of theory_e1606e7346bacb31c433e48ab9a27150bcbdc1f8.png. This then implies (again, from what we found above)

theory_6356744b284941d1216f84ee6cf1680354527390.png

Combining the two previous expressions

theory_fbaa3c9427dd94549f307a07217a9a7cbdff38d2.png

as claimed!

Appendix

Measure Concentration

Let theory_ca2bcc80309e331f71dd2c1e1f4788cbc048c242.png be a rv. that takes values in theory_5cb819dbdaa11557a460fe04a5ef95eede596daa.png. Assume that theory_bdc0a6c93aee244ef73ef1800641d1f83bfb27ec.png. Then, for any theory_f669321c14c029084bfaa68010166589e530f2ac.png,

theory_90da0639be9a09a7f865cebce88f068dfe1a76a5.png

This also implies that for every theory_f669321c14c029084bfaa68010166589e530f2ac.png:

theory_14e455c80846e52d9332e4224bdbfed0265ecda8.png

Bibliography

  • [shalev2014understanding] Shalev-Shwartz & Ben-David, Understanding machine learning: From theory to algorithms, Cambridge university press (2014).