Probability & Statistics

Table of Contents

Notation

  • probability_and_statistics_76b571658194dba7df95d99f157fd065b37baca6.png denotes the distribution of some random variable probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png
  • probability_and_statistics_6f82ddac501ce5b054fc3b57a1f3227a1635189b.png means convergence in probability for random elements probability_and_statistics_b8f2ec52633cebb1dd2ce4e5e7a81b4f1c142376.png, refers to

    probability_and_statistics_740f77040e95c0745b000d45a1a6b2d791ac4f35.png

  • probability_and_statistics_63e40071feef04c1a02b6c24c537d86b96f40aad.png means convergence in distribution for random elements probability_and_statistics_b8f2ec52633cebb1dd2ce4e5e7a81b4f1c142376.png, i.e. probability_and_statistics_8a208b3eeac6c41afad48817500336a0be683bab.png in distribution if probability_and_statistics_da631b4f803a124bebb45cc9c50f2db5ee76fac5.png for all continuity points probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png of probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png.
  • probability_and_statistics_bab1945a639aeb9a9625103ea54556b913a9334e.png means almost surely convergence for random elements probability_and_statistics_b8f2ec52633cebb1dd2ce4e5e7a81b4f1c142376.png

Theorems

Chebyshev Inequality

Let probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png (integrable) be a random variable with finite expected value probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and finite non-zero variance probability_and_statistics_356e0d9e7374e9995c8e63ce904f486552077872.png.

Then for any real number probability_and_statistics_f302e185a658106944d9873b0657e9a5bbfeb360.png,

probability_and_statistics_939f58b3404f99a76a6c6f12875ca3539bac2b43.png

Let probability_and_statistics_f071eab395d7012d199994f5f1863b98777ef5b9.png be a measure space, and let probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png be an extended real-valued measureable-function defined on probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png.

Then for any real number probability_and_statistics_982d0368cdc0774aadd4157289b17796ac829613.png and probability_and_statistics_5172298f91093fb3163a6a36b56a5603fc6f6cd7.png

probability_and_statistics_283f037f7ac6522fc36558e5d6f443ac064dc8a8.png

Or more generally, if probability_and_statistics_c4f480233088a134e88f2426541b2f00ca318b55.png is an extended real-valued measurable function, nonnegative and nondecreasing on the range of probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png, then

probability_and_statistics_ac73652455286b782dee62241403690f2d3fc154.png

Hmm, not quite sure how to relate the probabilistic and measure-theoretic definition..

Markov Inequality

If probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png is a nonnegative random variable and probability_and_statistics_ec5459e04a0145c81ea18a81b490c9126c10c1bf.png, then

probability_and_statistics_c08ed99d10cb4f80a59616e5545e8a95f6add864.png

Law of large numbers

There are mainly two laws of large numbers:

Sample averages converge in probability to the mean, i.e.

probability_and_statistics_223649aba07339cb5e35b94f3b43a8f8caf90175.png

or equivalently,

probability_and_statistics_1e049eca2f0d754c576d85af5b38b7a077ed3cec.png

Often the assumption of finite variance is made, but this is in fact not necessary (but makes the proof easier). Large or infinite variance will make the convergence slower, but the Law of Large Numbers holds anyway.

Central Limit Theorem

Suppose probability_and_statistics_8560f3130639e8ddd39f6a5019f01247b3e22327.png is a sequence of i.i.d. random variables with probability_and_statistics_8d742f828a62919cc14468d7b8aa3522f0092692.png and probability_and_statistics_0a1362ea4124dc0f960a6c4e16686a49003539ac.png. Then,

probability_and_statistics_29fee17b9a56a664db968e0eabfe0c7d4883efb0.png

where probability_and_statistics_19d744da1c5786e39c49ca4bb507065370d14872.png means convergence in distribution, i.e. that the cumulative distribution functions of probability_and_statistics_4f3e2f50ea65559a29fe580e5c9c7f9ae25688b4.png converge pointwise to the cdf of the probability_and_statistics_1f681812e9c1e4e0058d69f2960307238ef08524.png distribution: for any real number probability_and_statistics_01f7b7127bdc73de1b82efda8d5a045166ffb9fa.png,

probability_and_statistics_ac278c2f5eb7131dbaf90f8f03f97000da5a23d7.png

Jensen's Inequality

If

  • probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png is a convex function.
  • probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png is a rv.

Then probability_and_statistics_8b51d0876f55f11fba347dd4b711bc05a944464b.png.

Further, if probability_and_statistics_8cf373a6f87d085d7c046b5b9ae7c488db95973d.png (probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png is strictly convex), then probability_and_statistics_05c094e3d6ea292a57b75262a18726f788f72d8b.png "with probability 1".

If we instead have:

  • probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png is a concave function

Then probability_and_statistics_aede0a5151366465531aaf309f644da85a6695c6.png. This is the one we need for the derivation of EM-algorithm.

To get a visual intuition, here is a image from Wikipedia: jensens_inequality.png

Continuous Mapping Theorem

Let probability_and_statistics_5d689f8304a70fb660ad6e0c1d1aca55c6405e9d.png, probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png be random elements defined on a metric space probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png.

Suppose a function probability_and_statistics_71f2ed4f8e02e20382a5e76d2a17479636a0f336.png (where probability_and_statistics_cb8865ca89058a083e8ab7f2aac9d0d9c57dcaee.png is another metric space) has the set of discountinuity points probability_and_statistics_cf111ad9f58bdc6ed23da48abb27270722c9122b.png such that probability_and_statistics_fa3672bf05324d688aeec6b0d010059fe6ebafb6.png.

Then,

  1. probability_and_statistics_707f12bd55775fb848c3a0488951517d89ebe761.png
  2. probability_and_statistics_052925c9cdfafefc75e66267be3f410b4746b429.png
  3. probability_and_statistics_85eddc2371cd0b303fa0ef65ae53faad4f864f49.png

We start with convergence in distribution, for which we will need a particular statement from the

Slutsky's Lemma

Let probability_and_statistics_5d689f8304a70fb660ad6e0c1d1aca55c6405e9d.png, probability_and_statistics_5ab5546f94de57667111f3a27e369a3c809a7b85.png be sequences of scalarvectormatrix random elements.

If probability_and_statistics_767bc0da6b46e811005aecc340a329e590bcb1f8.png and probability_and_statistics_f4640df252b2fe03e95014c2490da0a1219521a9.png, where probability_and_statistics_f4cb67c3046f8cf86ce57c6e7e8d2baaba5fcfea.png is a constant, then

  • probability_and_statistics_b55bb2771014e0d8a833c76315cd0a4f1179892e.png
  • probability_and_statistics_a26c99a1d2a2e4a2024014f9332c5c41233c7179.png
  • probability_and_statistics_d07cc7adb2f5804b57b8ad965c18f2ea68675120.png

This follows from the fact that if:

probability_and_statistics_b932bb0a405177c1cc7c31f37dd93d0d21eeb271.png

then the joint vector probability_and_statistics_4097f5c33b045b6b19fdffe8e179d794420b62b6.png converges in distribution to probability_and_statistics_ab698b4180ccf93a7f2f6e8d327d73106c98f4a4.png, i.e.

probability_and_statistics_fbc30cf4b85c76d8cc0859184d0317149553f877.png

Next we simply apply the Continuous Mapping Theorem, recognizing the functions

  • probability_and_statistics_b55002805d6ce6c4fb2193f0872e23b2f6e999af.png
  • probability_and_statistics_993f68939c8306780aa48b58099dbf5d2c4dc0a4.png
  • probability_and_statistics_b8945ead0d5211a55683dee98c9b920dc9faa9ec.png

as continuous (for the last mapping to be continuous, probability_and_statistics_bb81403cb60ae4e207645fc7ca64fe389ba7aa73.png has to invertible).

Definitions

Probability space

A probability space is a measure space probability_and_statistics_39381b4b66a794c69f418f6af558c753002f06f9.png such that the measure of the whole space is equal to one, i.e.

  • probability_and_statistics_c78149fe826553ad47f6b160861906066538c7ea.png is the sample space (some arbitrary non-empty set)
  • probability_and_statistics_ff549cdd76c6683a0c20055822a5a03a2f524e82.png is the σ-algebra over probability_and_statistics_c78149fe826553ad47f6b160861906066538c7ea.png, which is the set of possible events
  • probability_and_statistics_d3fa34d1c4a76e180898ed3f9fc7e52bad0d47e2.png which is the probability measure such that

    probability_and_statistics_f729f6b948834db019e09d6b944537e83af5e06c.png

Random measure

Let probability_and_statistics_a1e2c57ce3a3d2af123967adb15ba31a234a00fc.png be a separable metric space (e.g. probability_and_statistics_83034389c37802feed72ca0f76bd9001048a983d.png) and let probability_and_statistics_fae3eb5f942052606531fc35e2661af0335f6a79.png be its Borel σ-algebra.

probability_and_statistics_229f2d5e252097bbb2ad0376b3dd088009b2afad.png is a transition kernel from a probability space probability_and_statistics_6fa50f70b96b0fda78f6e3bbdb4e520318175244.png to probability_and_statistics_09717fa32ad85661d114a682afbfc96bc9190ed2.png if

We say a transition kernel is locally finite, if

probability_and_statistics_19f542949698254e2fed8c92a53074d43962d834.png

satisfy probability_and_statistics_c018397ab7f7c265570e4c87d6d55408f8e7b253.png for all bounded measurable sets probability_and_statistics_c8044fd0c6fec5e750d499ec72e0a1f923a6ad8b.png and for all probability_and_statistics_84e2dca1f625626311815a4a31fa84c2e17d4d12.png except for some zero-measure set (under probability_and_statistics_a0e42a3969013d94a065630de2ab62bb3d124b8f.png).

Let probability_and_statistics_a1e2c57ce3a3d2af123967adb15ba31a234a00fc.png be a separable metric space (e.g. probability_and_statistics_83034389c37802feed72ca0f76bd9001048a983d.png) and let probability_and_statistics_fae3eb5f942052606531fc35e2661af0335f6a79.png be its Borel σ-algebra.

A random measure probability_and_statistics_229f2d5e252097bbb2ad0376b3dd088009b2afad.png is a (a.s.) /locally finite transition kernel from a (abstract) probability space probability_and_statistics_6fa50f70b96b0fda78f6e3bbdb4e520318175244.png to probability_and_statistics_09717fa32ad85661d114a682afbfc96bc9190ed2.png.

Let probability_and_statistics_229f2d5e252097bbb2ad0376b3dd088009b2afad.png be a random measure on the measurable space probability_and_statistics_56cb33e0a77f13ccb23cd53d93838e923d4f9cfb.png and denote the expected value of a random element probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png with probability_and_statistics_16779a84a64b23033e769a67377a1ae7f5426a1e.png.

The intensity measure probability_and_statistics_229f2d5e252097bbb2ad0376b3dd088009b2afad.png is defined

probability_and_statistics_564aadc288a0656b3fe8ece5a3de2a26091032ec.png

So it's a non-random measure which sends an element probability_and_statistics_f75be59687ef2da76b04e88af02585ab68efc56c.png of the sigma-algebra probability_and_statistics_4cc68ebefc13f65a9f085f468fdf9abda339a257.png to the expected value of the probability_and_statistics_bea086a068e2d855d6a6b407e7da06fe0f5ad402.png, since probability_and_statistics_bea086a068e2d855d6a6b407e7da06fe0f5ad402.png is a measurable function, i.e. a random variable.

Poisson process

A Poisson process is a generalized notion of a Poisson random variable.

A point process probability_and_statistics_53f1e85ad97d85e1d44c21eeddd7b49e5c1cf97a.png is a (general) Poisson process with intensity probability_and_statistics_cdcd5e2526eab592b865153e93442d1697a4ef60.png if it has the two following properties:

  1. Number of points in a bounded Borel set probability_and_statistics_86504a08fe587f7765c3c6e108c08ed5edee64f8.png is a Poisson random variable with mean probability_and_statistics_cdaa120a7b0f640c7d6d313c940fae69ecf79a06.png.
    • In other words, denote the total number of points located in probability_and_statistics_86504a08fe587f7765c3c6e108c08ed5edee64f8.png by probability_and_statistics_e02ffcd1eafa0876ba1203d0445177be82635a2e.png, then the probability of random variable probability_and_statistics_e02ffcd1eafa0876ba1203d0445177be82635a2e.png being equal to probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png is given by

      probability_and_statistics_642ff3aa1da671258ecf5ddb2d770cb275378ce3.png

  2. Number of points in probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png disjoint Borel sets form probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png independent random variables.

The Radon measure probability_and_statistics_cdcd5e2526eab592b865153e93442d1697a4ef60.png maintains its previous interpretation of being the expected number of points probability_and_statistics_53f1e85ad97d85e1d44c21eeddd7b49e5c1cf97a.png located in the bounded region probability_and_statistics_86504a08fe587f7765c3c6e108c08ed5edee64f8.png, namely

probability_and_statistics_a2c53813d739c83bcce09ea418f5f1852fc79ed4.png

Cox process

Let probability_and_statistics_229f2d5e252097bbb2ad0376b3dd088009b2afad.png be a random measure.

A random measure probability_and_statistics_284e544994dfc6017bf26731df55f59aeb4c884b.png is called a Cox process directed by probability_and_statistics_229f2d5e252097bbb2ad0376b3dd088009b2afad.png, if probability_and_statistics_e9305368d31ae5444526567d26f42e962e2965dd.png is a Poisson process with intensity measure probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png.

Here probability_and_statistics_e9305368d31ae5444526567d26f42e962e2965dd.png is the conditional distribution of probability_and_statistics_284e544994dfc6017bf26731df55f59aeb4c884b.png, given probability_and_statistics_ac4950cb287d723efa49062fa95156a1602201e8.png.

Degeneracy of probability distributions

A degenerate probability distribution is a probability distribution with support only on a lower-dimensional space.

E.g. if the degenerate distribution is univariate (invoving a single random variable), then it's a deterministic distribution, only taking on a single value.

Characteristic function

Let probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png be a random variable with density probability_and_statistics_9b9582bebf53d3b622e3bdb42734eda43d8da43d.png and cumulative distribution function probability_and_statistics_e0fbcbd2b5d12e821c3ad91d7dfcd8ae995d3845.png.

Then the characteristic equation is defined as the Fourier transform of probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png:

probability_and_statistics_9aa80a357c71b649442192cc2bed7161c5bb008b.png

Examples

10 dice: you want at least one 4, one 3, one 2, one 1

Consider the number of misses before a first "hit". When we say "hit", we refer to throwing something in the set probability_and_statistics_ef30f1d70efb7f0c8bbaa59c7a0fea1006ae97e0.png.

Then we can do as follows:

  • probability_and_statistics_e4dc5a2e9812a317cef01d7dd76116034a90d841.png is the # of misses before our first "hit"
  • probability_and_statistics_41cbd01f8d5c9d23da7dffb77d9b9cb54a9381bf.png is the # of misses between our first "hit" and second "hit"

We then observe the following:

  • probability_and_statistics_ba4045607cd5baec5180b6d29cba86d9d67c12f1.png since if we have more than probability_and_statistics_c0264f6ff12ab035691e301aaa2c2bc4570b59b0.png misses before our first hit, we'll never be able to get all the events in our target set
  • probability_and_statistics_bd72210d945089aeb447b602cca53936a6b2107b.png
  • probability_and_statistics_799a60e66b287ad7cd52710587ddcc1e5031229a.png
  • probability_and_statistics_126fc08bc39f4faa48dd8c6bf4f2c0a5954bc79f.png

We also observe that there are probability_and_statistics_30ce0129e01b093bfaafe7f29545ad0fa22f1458.png, probability_and_statistics_b23dc6a7f15e5627e83a04fbf67b200e98177bdf.png, probability_and_statistics_d3b35b9bb6a95a530a52b82dd7459693a0d4dfe3.png and probability_and_statistics_48d2827c20cd687d4d96417ddb44d1dc61fd8455.png ways of missing for each of the respective target events.

from sympy import *

r_1, r_2, r_3, r_4 = symbols("r_1 r_2 r_3 r_4", nonnegative=True, integers=True)

s = Sum(
    Sum(
        Sum(
            Sum(
                (2**(r_1 - 1) * 3**(r_2 - 1) * 4**(r_3 - 1) * 5**(r_4 - 1)) \
                / 6**(r_1 + r_2 + r_3 + r_4),
                (r_4, 1, 10 - r_1 - r_2 - r_3)
            ), (r_3, 1, 9 - r_1 - r_2)
        ), (r_2, 1, 8 - r_1)
    ), (r_1, 1, 7)
)

res = factorial(4) * s.doit()
print(res.evalf())

Probability "metrics" / divergences

Notation

Overview of the "metrics" / divergences

These definitions are taken from arjovsky17_wasser_gan

The Total Variation (TV) distance

probability_and_statistics_15f8f063db44a896552c2da6e3463251f2a15c97.png

Or, using slightly different notation, the total variation between two probability measures probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png is given by

probability_and_statistics_ce00f6843462b6b1d623de0977b0b834648e38b8.png

These two notions are completely equivalent. One can see this by observing that any discrepancy between probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png is "accounted for twice", since probability_and_statistics_22e4910a19de34caa248de2945f8fac08e39a699.png, which is why we get the factor of probability_and_statistics_e409955968971d39838a300ee5ac843f7def4c98.png in front. We can then gather all probability_and_statistics_46bb2460b54adf6ecdb59225d2987f80531ea4cc.png where probability_and_statistics_7ef4ed582a31cdd7f5571926018a06bb136501dc.png into a subset probability_and_statistics_b04f842ecfc630d02275c6f56253f0458526593e.png, making sure to choose probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png or it's complement probability_and_statistics_72cedcf0228893c0327bd06991462b53b85462b8.png such that the difference is positive when probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png in the first term. Then we end up with the probability_and_statistics_c8f8b18a73a20338ab63f6bd975065ec79aefe1a.png seen above.

The Jensen-Shannon (JS) divergence

probability_and_statistics_5c5ff5270fdafb55835d64775de128ef7baf2004.png

where probability_and_statistics_5e7bc7c0100d52fce39b5278475a397d32c65cfb.png is the mixture probability_and_statistics_1d048f2e6862b1d24c893619df538c432da09e74.png.

This divergence is symmetrical and always defiend because we can choose probability_and_statistics_d8c261b64618f9b3bb256477cc9325e92c09184f.png.

The Earth-Mover (EM) distance or Wasserstein-1

probability_and_statistics_35253b6ed399e5773c99117195a4cede7ba654e2.png

where probability_and_statistics_631f6e3db940e5975d362f1bde03784e7878b2ca.png denotes the set of all joint distributions probability_and_statistics_88b5585f30165dcab1c1a3fc9a8dbc4b0436bbe5.png whose marginals are respectively probability_and_statistics_0f1ae73ace2b0c4e5ae7aa070fedca56b6c08b91.png and probability_and_statistics_9de319e0219ab5c5963d28c8ecf13a148f9348a9.png.

Intuitively, probability_and_statistics_88b5585f30165dcab1c1a3fc9a8dbc4b0436bbe5.png indicates how much "mass" must be transported from probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png to probability_and_statistics_bb81403cb60ae4e207645fc7ca64fe389ba7aa73.png in order to transform the distributions probability_and_statistics_0f1ae73ace2b0c4e5ae7aa070fedca56b6c08b91.png into the distribution probability_and_statistics_9de319e0219ab5c5963d28c8ecf13a148f9348a9.png. The EM distance then is the "cost" of the optimal transport plan.

Kullback-Leibler divergence

Definition

The Kullback-Leibler (KL) divergence

probability_and_statistics_6676f6b4d3d6bb4ec321bdfc19226ef597872061.png

where both probability_and_statistics_9d88af76045e722a4957638fefe95149ba0f2ccc.png and probability_and_statistics_048f6c98e916ed8890f70a26012e2f994b6f56f5.png are assumed to be absolutely continuous, and therefore admit densities, wrt. to the same measure probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png defined on probability_and_statistics_15e882c4dc9f1a3d88edb27e9d05f348c9acedc3.png.

The KL divergence is famously asymmetric and possibly infinite / undefined when there are points s.t. probability_and_statistics_d072cdb0ac4e498f601841630f01a7af5105538c.png and probability_and_statistics_bd856473b8ed8d720e8c5ea8c52c009d90f08c5a.png.

probability_and_statistics_090fc60806a21f238b45ddae3a5c6fc8d934995f.png

probability_and_statistics_6913838719b4627918abba268e01b98f180519c7.png

where in the inequality we have use Jensen's inequality together with the fact that probability_and_statistics_2c1f31769b603e4f75ba9b88063933554544b450.png is a convex function.

Why?

Kullback-Leibner divergence from some probability-distributions probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png and probability_and_statistics_a0e42a3969013d94a065630de2ab62bb3d124b8f.png, denoted probability_and_statistics_b859ede0953aba24e6ce231a1e98b6b0ccc146ab.png, is a measure of the information gained when one revises one's beliefs from the prior distribution probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png to the posterior distribution probability_and_statistics_a0e42a3969013d94a065630de2ab62bb3d124b8f.png. In other words, amount of information lost when probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png is used to approximate probability_and_statistics_a0e42a3969013d94a065630de2ab62bb3d124b8f.png.

Most importantly, the KL-divergence can be written

probability_and_statistics_5c3c7458772951986a158f5c2e40539a87c26ecd.png

where probability_and_statistics_c4705c72c454812b91c92c54f4966bca9e56a7f6.png is the optimal parameter and probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png is the one we vary to approximate probability_and_statistics_9217f1c2563bcbd94d5b4bcb7f5e303df6aab72c.png. The second term in the equation above is the only one which depend on the "unknown" parameter probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png (probability_and_statistics_c4705c72c454812b91c92c54f4966bca9e56a7f6.png is fixed, since this is the parameter we assume probability_and_statistics_b346b8eaf5913ed81b033dacc1d6568c142bbb4a.png to take on). Now, suppose we have probability_and_statistics_53f1e85ad97d85e1d44c21eeddd7b49e5c1cf97a.png samples probability_and_statistics_4379f005c713a1722e2441620d255f9e096ae143.png from probability_and_statistics_b346b8eaf5913ed81b033dacc1d6568c142bbb4a.png, then observe that the negative log-likelihood for some parametrizable distribution probability_and_statistics_9974194a836f8c06efda1c78a55ab184c6c2d746.png is given by

probability_and_statistics_deb09fd5ead37dd035dd53708523b34ba787d44a.png

By the Law of Large numbers, we have

probability_and_statistics_f9bc2afae2b8a7011f568aa376e00e06b8fd7962.png

where probability_and_statistics_25e275d015724e807323a86de51e8ffa4e66c927.png denotes the expectation over the probability density probability_and_statistics_b346b8eaf5913ed81b033dacc1d6568c142bbb4a.png. But this is exactly the second term in the KL-divergence! Hence, minimizing the KL-divergence between probability_and_statistics_321066684cb7ebf577b9cbd698b835d9c6a60a6c.png and probability_and_statistics_2cce4a02a065f9611a08049224de9d1373d349c5.png is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!

Wasserstein metric

Let probability_and_statistics_0b45dad47edd15d6cb929ba568b05cfdd4963684.png be a metric space for which every probability measure on probability_and_statistics_4041476b998bc5a6ca4d7b8f4880e883a5a433a8.png is a Radon measure.

For probability_and_statistics_0f9e5afb032bf938ac874766e8e20ec557118937.png, let probability_and_statistics_cf5ff7433a32b914fc2e55e68544ee973b97c1a5.png denote the collection of all probability measures probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png on probability_and_statistics_4041476b998bc5a6ca4d7b8f4880e883a5a433a8.png with finite p-th moment (expectation of rv. to the p-th power) for some probability_and_statistics_8e6a1d1d852556dd4175325553aaca64f10d18ab.png,

probability_and_statistics_f7eba5bf9a6aa411f35c97cc06d97d26f1d3553c.png

Then the p-th Wasserstein distance between two probability measures probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png in probability_and_statistics_cf5ff7433a32b914fc2e55e68544ee973b97c1a5.png is defined as

probability_and_statistics_0fd681d25ef114ccab912604ae089952abeda5f5.png

where probability_and_statistics_71fdfb3dbd0bfb9a56d2df63f51306d73d81f6fe.png denotes the collection of all measures on probability_and_statistics_3e525fba042292e96ec4e5a32ccc0d81f2db8139.png with marginals probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png on the first and second factors respectively (i.e. all possible "joint distributions").

Or equivalently,

probability_and_statistics_bef7e6b8099ae5a29a143a88e0edaecdf8e571af.png

Intuitively, if each distribution is viewed as a unit amount of "dirt" piled on the metric space probability_and_statistics_4041476b998bc5a6ca4d7b8f4880e883a5a433a8.png, the metric minimum "cost" of turning one pile into the other, which is assumed to eb the amount dirt that needs to moved times the distance it has to be moved.

Because of this analogy, the metric is sometimes known as the "earth mover's distance".

Using the dual representation of probability_and_statistics_f93561f2f2ccb7dcc8c5c96fec9ab9a92f3743a3.png, when probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png have bounded support:

probability_and_statistics_f146d3d2fba3dcb0e2bab27d3eeff88d0d0c1a9a.png

where probability_and_statistics_43ae4bbf19d2a7a43e46d8a4bc17dfeea7470ffe.png denotes the minimal Lipschitz constant for probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png.

Compare this with the definition of the Radon metric (metric induced by distance between two measures):

probability_and_statistics_45d158ad787a1cea33dca56289be067e8e125283.png

If the metric probability_and_statistics_01aac6f51c6e4bb7092c0bd6b155a3467d0a25f3.png is bounded by some constant probability_and_statistics_a4680369a3334fa4e273bbd161e5ca60a5d7c8a5.png, then

probability_and_statistics_ffbb43087beb9cb89b713fc9be9c336a1e8fc6a4.png

and so convergence in the Radon metric implies convergence in the Wasserstein metric, but not vice versa.

Observe that we can write the duality as

probability_and_statistics_012fa0e7f2dd0cf0b5f0a8df6799eca7e77c23bc.png

where we let probability_and_statistics_49dab3e5d26ffea4ff50c2e994bfc76c151b2832.png denotes that we're finding the supremum over all functions which are 1-Lipschitz, i.e. Lipschitz continuous with Lipschitz constant 1.

Integral Probability Metric

Let probability_and_statistics_a0e42a3969013d94a065630de2ab62bb3d124b8f.png and probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png be a probablity distributions, and probability_and_statistics_0107cb57f587212f32e488764cfac6481341adff.png be some space of real-valued functions. Further, let

probability_and_statistics_406c5ae7192f2616787d97df3ba5478839174db0.png

When probability_and_statistics_0107cb57f587212f32e488764cfac6481341adff.png is sufficently "large", then

probability_and_statistics_5aef840beb21aacd2c33025ae518749c112c1715.png

We then say that probability_and_statistics_0107cb57f587212f32e488764cfac6481341adff.png together with probability_and_statistics_fd9feb3ac7b8c5f70ba6a38d738424e369f5e1a9.png defines an integral probabilty metric (IPM).

Stein's method

Absolutely fantastic description of it: https://statweb.stanford.edu/~souravc/beam-icm-trans.pdf

Notation

  • probability_and_statistics_3bd1f26e899259191c15988046ebcac426ab307c.png any random variable
  • probability_and_statistics_819394bc33557d54ecc2b0d62dfae8541b0d6a97.png std. Gaussian random variable

Overview

  • Technique for proving generalized central limit theorems
  • Ordinary central limit theorem: if probability_and_statistics_8560f3130639e8ddd39f6a5019f01247b3e22327.png are i.i.d. rvs. then

    probability_and_statistics_d95f5d5092a91db1485315d9adf7de24921ec21a.png

    where probability_and_statistics_93101acd53a3ec01fc537e9ae95ce5c212ef7075.png and probability_and_statistics_6d20ebcfa2ba32a7d49b2d7d2dfba846c81cb2f7.png

  • Usual method of proof:
    1. LHS is computed using Fourier Transform
    2. Independence implies that FT decomposes as a product
    3. Analysis
  • Stein's motivation: what if probability_and_statistics_83fcec7fe2a3faa4f10b5ac9b967d3b554aa4781.png are not exactly independent?!

Method

Suppose we want to show that probability_and_statistics_3bd1f26e899259191c15988046ebcac426ab307c.png is "approximately Gaussian", i.e.

probability_and_statistics_97bf4878e445f70ed831153d6a334233265551c4.png

or,

probability_and_statistics_47af4c8323ca186f99ca9194e6e67f3e13436c8b.png

for any well-behaved probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png

It's a generalization because if

probability_and_statistics_3aa78d1458aa33c84a2f79a735326c3a6fe8b8a6.png

then

probability_and_statistics_dc6d183e0de31014fcbaa7481664674dc60c4124.png

Suppose

probability_and_statistics_95f9e07465d25e3b547ba77812d05567d7e508eb.png

and we want to show that the rv. probability_and_statistics_3bd1f26e899259191c15988046ebcac426ab307c.png is approximately std. Gaussian, i.e. probability_and_statistics_c35022c1a458da8f1d87bd2434fbc80d486f88a0.png for all well-behaved probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png.

  1. Given probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png, obtain a function probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png by solving the differential equation

    probability_and_statistics_cc778104f095ed18fb510de166bc7ffec2a23110.png

2.Show that

probability_and_statistics_6d59a08c063f138f4b2b64b7c7ec156d95a5ec62.png

using the properties of probability_and_statistics_3bd1f26e899259191c15988046ebcac426ab307c.png

  1. Since

    probability_and_statistics_d5221c394226d9a4c7172c1901833206f8228f20.png

    conclude that probability_and_statistics_a1e8d5d49088236df38240215b19813434b21ec7.png.

More generally, two smooth densities probability_and_statistics_048f6c98e916ed8890f70a26012e2f994b6f56f5.png and probability_and_statistics_9d88af76045e722a4957638fefe95149ba0f2ccc.png supported on probability_and_statistics_ad02fc814a4ba9ef833f6acacd935d6845e92c41.png are indentical if and only if

probability_and_statistics_155db29cc26eb7b403ba42942d66b1c78bd74ff7.png

for smooth functions probability_and_statistics_1d736e1951c74eba84ed2f52d092bfa972e95df8.png with proper zero-boundary conditions, where

probability_and_statistics_5d59df299d7933184494e64404200ba66689d695.png

is called the Stein score function of probability_and_statistics_9d88af76045e722a4957638fefe95149ba0f2ccc.png.

Stein discrepancy measure between two continuous densities probability_and_statistics_048f6c98e916ed8890f70a26012e2f994b6f56f5.png and probability_and_statistics_9d88af76045e722a4957638fefe95149ba0f2ccc.png is defined

probability_and_statistics_98584ffbc872f7831e47be408a34b1411c5068e3.png

where probability_and_statistics_d52e9d9efb831414835bb75d787728b7bc049127.png is the set of functions which satisfies

Two smooth densities probability_and_statistics_048f6c98e916ed8890f70a26012e2f994b6f56f5.png and probability_and_statistics_9d88af76045e722a4957638fefe95149ba0f2ccc.png supported on probability_and_statistics_ad02fc814a4ba9ef833f6acacd935d6845e92c41.png are indentical if and only if

probability_and_statistics_cf7f61d85987c1221904805376bdc1f5206dfcbf.png

for smooth functions probability_and_statistics_1d736e1951c74eba84ed2f52d092bfa972e95df8.png with proper zero-boundary conditions, where

probability_and_statistics_711f4fa65d303cbbb479fed90bffb5c4708519a7.png

is called the Stein score function of probability_and_statistics_9d88af76045e722a4957638fefe95149ba0f2ccc.png.

Stein discrepancy measure between two continuous densities probability_and_statistics_048f6c98e916ed8890f70a26012e2f994b6f56f5.png and probability_and_statistics_9d88af76045e722a4957638fefe95149ba0f2ccc.png is defined

probability_and_statistics_98584ffbc872f7831e47be408a34b1411c5068e3.png

where probability_and_statistics_d52e9d9efb831414835bb75d787728b7bc049127.png is the set of functions which previous expectation and is also "rich" enough to ensure that probability_and_statistics_2102d686ebc5440669a9aa62bf402ec74902cb95.png whenever probability_and_statistics_3a2249b73b83a8aa85f7a6421a8ef5806673870a.png.

Why does it work?

  • Turns out that if we replace the probability_and_statistics_6870d55eb94604a569089514b3d43c428bca9dca.png in Stein's method, then probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png is not well-behaved; it blows up at infinity.
  • A random variable probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png has the standard Gaussian distribution if and only if probability_and_statistics_2c012ff20c89cc292312a510c68c73b649ba1b46.png for all probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png! (you can easilty check this by observing that the solution to this is probability_and_statistics_e006d7e67cd5d7c58573da0cf4417d076ec5f5a2.png).
  • Differential operator probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png, defined as

    probability_and_statistics_5b2e9461b5fb19922591976e1b46ac804fb1000e.png

    is called a characterizing operator for the standard Gaussian distribution.

Stochastic processes

Discrete

Simple random walk

Let probability_and_statistics_6ddd6b422e43663ee3de30b5b81e55b3440a85f7.png be a set of i.i.d. random variables such that

probability_and_statistics_efecddad0decce30a90147dd74204d0977dbd0d3.png

Then, let

probability_and_statistics_9daf1caf2634610f676dd68675bf716077fa2b71.png

We then call the sequence probability_and_statistics_0db4c09ac968b320b82e947c9bf6d77faecc3c70.png a simple random walk.

For large probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png, due to CLT we have

probability_and_statistics_973a2016bc2524f9442741c00f804122cd392f54.png

i.e. probability_and_statistics_f336ba34df1b2a703e0c2eb5eefc378adf7347de.png follows a normal distribution with probability_and_statistics_e249c9717135936f73db26e85662cc0ccc926195.png and probability_and_statistics_064c000af0a4aee04973ed5dcb00f42f56c1c98a.png.

We then have the following properties:

  1. probability_and_statistics_dd6c460da46d9c4aeb4ba9c437ce1e12b681ddae.png
  2. Independent measurements : if

    probability_and_statistics_fdf9dcc8af1ec66c516b15d0d4d419a02df33f1d.png

    then probability_and_statistics_eb883a9b610766da49dd8525babb6547d3bfee3d.png are mutually independent, with probability_and_statistics_8c69df137e32229a0ea4f93e5721dd861621a6bd.png.

  3. Stationary : for all probability_and_statistics_e934c58037814bca1bccb135044fa927fe0535b9.png and probability_and_statistics_abb206289754d055ed8f6cd73e029b1430e9c49a.png, the distribution of probability_and_statistics_ea11933e53c6b757425d0ef0f028e4579436e1d3.png is the same as the distribution of probability_and_statistics_5eae771931044bee2c01ccad2085f316348ea776.png.

Martingale

A martingale is a stochastic process which is fair game.

We say a stochastic process probability_and_statistics_c20033ec657eb46bbe72c85909d733125f84d637.png is a martingale if

probability_and_statistics_d50026731e47883698fe6e2ea4b1a0a1d897caa9.png

where

probability_and_statistics_0a3a7b0a32e8effb3711c79a2db1e378c7efa4ad.png

Examples
  1. Simple random walk is a martingale
  2. Balance of a Roulette player is NOT martingale
Optional stopping theorem

A given stochastic process probability_and_statistics_338ef1517e6d418c97e9dffbec9dcbe404d89a23.png, a non-negative integer r.v. probability_and_statistics_c6bef10eba2db5e87e10b36de4a645e2b0a4a5d7.png is called a stopping time if

probability_and_statistics_2faaf382bbf2313914e18da5e737d87af4370c53.png

depends only on probability_and_statistics_54a22ea8885cd6a2ee9942d55eb766623c9d120e.png (i.e. stopping at some time probability_and_statistics_c6bef10eba2db5e87e10b36de4a645e2b0a4a5d7.png only depends on the first probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png r.v.s).

Suppose probability_and_statistics_c9f38360db2281a72f40ec800ffa185efcfc94a2.png is martingale and probability_and_statistics_c6bef10eba2db5e87e10b36de4a645e2b0a4a5d7.png is a stopping time.

There exists a constant probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png s.t. probability_and_statistics_ecd978ee33cf248609e022af85f43b417203870b.png, and

probability_and_statistics_ecb4ecd5865243044ab78c3bc9a503cce74882fa.png

This implies that if we a process can be described as a martingale, then it's a "zero-sum".

Continuous

Lévy process

A stochastic process probability_and_statistics_6fe695c4a4814164e23ad9341664dc2f87b5d080.png is said to be a Lévy process if it satisfies the following properties:

  1. probability_and_statistics_b61325819bb448b0b032f67afd9999d41870f416.png almost surely.
  2. Independence of increments: for any $0 ≤ t1 ≤ t2 ≤ … ≤ tn < ∞, the increments probability_and_statistics_2f5a64234536b10800478d34f7dd5540fbb30f25.png are independent.
  3. Stationary increments: For any probability_and_statistics_045d34b9936bfc64c0833b0045831eb09f189f3c.png, the rvs. probability_and_statistics_ffbf32ed9dcdfbb615cf3fc93b3cac6658898b0c.png is equal in distribution to probability_and_statistics_e0c02e1d736841e939738212cb3ce6c1cb227044.png.
  4. Continuity in probability: For any probability_and_statistics_f87e60cad32b94a8ab30bfe95c7b98ff47f0bdcc.png and probability_and_statistics_abb206289754d055ed8f6cd73e029b1430e9c49a.png it holds that

    probability_and_statistics_0577edf46339a88b983c31994b3d64cc9b599b6b.png

If probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png is a Lévy process then one may construct a version of probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png such that probability_and_statistics_0df49c4380a8a1a011b16897054385b01aa1667c.png is almost surely right-continuous with left limits.

Wiener process

The Wiener process (or Brownian motion) is a continuous-time stochastic process probability_and_statistics_d23f1e7bcaf5f0a454ec1c2f17a6a0b7a4accbbb.png characterized by the following properties:

  1. probability_and_statistics_20b13e023874eebf5f4e94cb5dcc97391797c6b3.png (almost surely)
  2. probability_and_statistics_3bd1f26e899259191c15988046ebcac426ab307c.png has independent increments: for probability_and_statistics_982d0368cdc0774aadd4157289b17796ac829613.png, future increments probability_and_statistics_63e0ae55d77d1a817c9d18f77ba025d5d782ef38.png with probability_and_statistics_152a6e04f66414823f87715e003b9baa988ca8e7.png, are independent of the past values probability_and_statistics_21f377196208ccd1789d758a02236d9429c0b81c.png, probability_and_statistics_2c886a8bb42c36f0f5aa6b114f9c73aef97a7400.png
  3. probability_and_statistics_3bd1f26e899259191c15988046ebcac426ab307c.png has Gaussian increments:

    probability_and_statistics_5dd3d720f62dc56ec2f3bdf96a58dcf699591989.png

  4. probability_and_statistics_3bd1f26e899259191c15988046ebcac426ab307c.png has continuous paths: with probability probability_and_statistics_6b314f00387982277c754a67e9002cc0f3dd7144.png, probability_and_statistics_d23f1e7bcaf5f0a454ec1c2f17a6a0b7a4accbbb.png is continuous in probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png

Further, let probability_and_statistics_f9570c68c91a75f8c63c7b7e67a5367e7a096aab.png be i.i.d. rv. with mean 0 and variance 1. For each probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png, define the continuous-time stochastic process

probability_and_statistics_eeeb188da110ab72442506a784fe856449fa8bdf.png

Then

probability_and_statistics_745e79fd6c9df4a91b1a6ff99bd544d7742a5f54.png

i.e. a random walk approaches a Wiener process.

Properties

Let probability_and_statistics_829b3d150b38bdba397ce7c2044cc81687195d05.png be a Wiener process and let

probability_and_statistics_b6b501312f78fefc185ffb17f53eff9520bcc860.png

Then

probability_and_statistics_610b84502639340b716bbe9b3df5365be7926a49.png

for any probability_and_statistics_72c333773d5518d578c0e288511ba915ba275d72.png.

Let probability_and_statistics_548bc022e2dcce3cb525d1d97eef9993b37c48bc.png be the first time such that

probability_and_statistics_387d3d655a589be895e0ac39b07da5c4ac8c1d21.png

Then

probability_and_statistics_4a27e72b8ce5f747055eea0d20af8ca6505438b3.png

Observe that

probability_and_statistics_8944136a2ece0e5ff8665c98917cce7aa77e7822.png

which is saying that at the point probability_and_statistics_548bc022e2dcce3cb525d1d97eef9993b37c48bc.png we have probability_and_statistics_5867c973c0bc03537bc371b2661e2dd41b75a73c.png, and therefore for all time which follows we're equally likely to be above or below probability_and_statistics_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png.

Therefore

probability_and_statistics_c1cab3eedb43daaa1f36a1b54c8506636ea9ae20.png

where in the last inequality we used the fact that the set of events probability_and_statistics_cf8b3217cd53e7c312d6c3b08ce10beeac3c69d6.png.

This theorem helps us solve the gambler's ruin problem!

Let probability_and_statistics_8667cfe9e29b2e47086b5f3b2383f1812eb2db13.png denote your bankroll at time probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png, and at each probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png it changes by

probability_and_statistics_f62f183bf4cbbbd9f05d10860a53c5e5a2b5fac0.png

The continuous approximation of this process is given by a Brownian motion, hence we can write the bankroll as

probability_and_statistics_b77b4fad336e0fd3be11e321ba69cf8235610575.png

where probability_and_statistics_d23f1e7bcaf5f0a454ec1c2f17a6a0b7a4accbbb.png denotes a Brownian motion (with probability_and_statistics_f45c07e15b41edd6a1fbbc63647b669873658ae0.png).

We're interested in how probable it is for us to be ruined before some time probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png. Letting probability_and_statistics_c6bef10eba2db5e87e10b36de4a645e2b0a4a5d7.png be the time were we go bankrupt, we have

probability_and_statistics_0a706eec79c7f47b15734d81e066ca1a083353c9.png

i.e. that we hit the time were we go bankrupt before time probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png.

Bankrupt corresponds to

probability_and_statistics_6eeb149feb337c26a5f3df425ee258cf6cd9f093.png

where probability_and_statistics_d23f1e7bcaf5f0a454ec1c2f17a6a0b7a4accbbb.png denotes a Wiener process (hence centered at 0).

Then we can write

probability_and_statistics_8988d54a428abefc3b85c7de63c90fc0e430efcf.png

where the last equality is due to the symmetry of a Normal distribution and thus a Brownian motion.

Hence

probability_and_statistics_275f5d1e23b6ef8e9b0481479c8139421d9c4520.png

where probability_and_statistics_f53aa32229f9a3682ba1b04fb02defac3e7084fb.png denotes a standard normal distribution.

probability_and_statistics_8acf8f537cd46d21c834230f6d4827c5d3fb3a53.png

This is farily straightforward to prove, just recall the Law of Large numbers and you should be good.

Non-differentiability

probability_and_statistics_829b3d150b38bdba397ce7c2044cc81687195d05.png is non-differentiable with probability 1.

Suppose, for the sake of contradiction, that probability_and_statistics_829b3d150b38bdba397ce7c2044cc81687195d05.png is differentiable with probability 1. Then, by MVP we have

probability_and_statistics_3bc942e446432b21ddcf50ec5b2cd149113ea920.png

for some probability_and_statistics_f75be59687ef2da76b04e88af02585ab68efc56c.png and all probability_and_statistics_f87e60cad32b94a8ab30bfe95c7b98ff47f0bdcc.png. This implies that

probability_and_statistics_f2f4329ce630855a546d244bd24c309f37d07998.png

since probability_and_statistics_cdb5e28e79fe3eaaf368d0d6fc53f6063e6863d0.png.

Hence

probability_and_statistics_4bed1ff928e732e366b3890ca882d10a86576ddc.png

As probability_and_statistics_2fcf1aee34f0a7d916b96877f75cbc7e3ed4d640.png, since probability_and_statistics_f12115285201971da866ab0ecf93fbcbadc40e01.png, then

probability_and_statistics_2a133d8ad6678fc93bb72a93089471bd7743a39e.png

where

probability_and_statistics_fb3a091449418a8cbf803c8b2ad91e69282a30a7.png

Thus, when probability_and_statistics_2fcf1aee34f0a7d916b96877f75cbc7e3ed4d640.png, we have

probability_and_statistics_86b0401613fc24c6fbd13020048e98b83eabda22.png

since probability_and_statistics_a96c3682a717a4b70418287e378aead582440d64.png Hence,

probability_and_statistics_7169b7e893a1c902ea3e4bbac2400ec52f3e797d.png

contracting our initial statement.

Geometric Brownian motion

A stochast process probability_and_statistics_3f0375356b1bce238abe274395c5106d94719a21.png is said to follow a Geometric Brownian Motion if it satisfies the following stochastic differential equation (SDE):

probability_and_statistics_3e383a181f852cce979617e40ee9def7f03a2fc6.png

where probability_and_statistics_d23f1e7bcaf5f0a454ec1c2f17a6a0b7a4accbbb.png is the Brownian motion, and probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png (percentage drift) and probability_and_statistics_8a303c22b5ff8640fb384bd5649f789459e04e8b.png (percentage volatility) are constants.

Brownian bridge

A Brownian bridge is a continuous-time stochastic process probability_and_statistics_7e72e97ae3df73e8d356e302bdb93d89b7c28807.png whose probability distribution is the conditional probability distribution of a Brownian motion probability_and_statistics_829b3d150b38bdba397ce7c2044cc81687195d05.png subject to the condition that

probability_and_statistics_f26f5ec79bf335752ee274b910ccc667162650fe.png

so that the process is pinned at the origin at both probability_and_statistics_fd13652fa4f5dbee2e73cdfdb1f89580bcd038ae.png and probability_and_statistics_a77268150562e5ffdcee72b33eec9399e25c1f56.png. More precisely,

probability_and_statistics_04c00b827f93dab209164a88e56457cf1521ce7e.png

Then

probability_and_statistics_4a2932776e5e1e43e8d67c23f954a88d6e7e9d50.png

implying that the most uncertainty is in the middle of the bridge, with zero uncertainty at the nodes.

Sometimes the notation probability_and_statistics_7e72e97ae3df73e8d356e302bdb93d89b7c28807.png is used for a Wiener process / Brownian motion rather than for a Brownian bridge.

Markov Chains

Notation

  • probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png denotes the transition matrix (assumed to be ergodic, unless stated otherwise)
  • probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png is the stationary distribution
  • probability_and_statistics_4b77f6de08addd6ec6d3aebc4a1e4f8835584e38.png denotes the initial state
  • probability_and_statistics_c78149fe826553ad47f6b160861906066538c7ea.png is the state space
  • probability_and_statistics_15e882c4dc9f1a3d88edb27e9d05f348c9acedc3.png denotes an uncountable state space
  • probability_and_statistics_3a00998aa171d7882b092b14117f04ef19d57f8c.png denotes probability_and_statistics_fd5efd1f6f004983c5d8a99edf6a16630dc4bff5.png which means all points except those with zero-measure, i.e. probability_and_statistics_c29e1fc5e116c246a118ce61186618a88e3c4532.png such that probability_and_statistics_636ff3b3d1af26910e464929925fbc6f7ac15706.png

Definitions

A Markov chain is said to be irreducible if it's possible to reach any state from any state.

A state probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png has a period probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png if any return to state probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png must occur in multiples of probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png time steps.

Formally, the period of a state is defined as

probability_and_statistics_a5b51994b705d4ae7d19d80ab1d47d6415460ec8.png

provided that the set is non-empty.

A state probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png is said to be transient if, given that we start in state probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png,t there is a non-zero probability that we will never return to probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png.

A state probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png is said to be recurrent (or persistent) if it is not transient, i.e. gauranteed (with prob. 1) to have a finite hitting time.

A state probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png is said to be ergodic if it is aperiodic and positive recurrent, i.e.

  • aperiodic: period of 1, i.e. can return to current state in a single step
  • positive recurrent: has a finite mean recurrence time

If all states in an irreducible Markov chain are ergodic , then the chain is said to be ergodic.

Coupling

  • Useful for bouding the mixing rate of Markov chains, i.e. the number of steps it takes for the Markov chain to converge to the stationary distribution

Distance to Stationary Distribution

  • Use Total Variance as a distance measure, therefore convergence is defined through

    probability_and_statistics_ca21fe9a08122c790547861908b946bfb91def74.png

  • probability_and_statistics_bb848906e9b96b63a418a86491fb2afeb04afd75.png denotes the variation distance between two Markov chain random variables probability_and_statistics_205a06f2a1f8767f163a826f406eeac7ffcea934.png and probability_and_statistics_986ad80dd5b7e02b5affa0b1a2aa9e7e106e0834.png, i.e.

    probability_and_statistics_e733f9ae2fc5d076c2745ef6ebd8f1cfca954877.png

  • One can prove that

    probability_and_statistics_5e764c786ee3fdb32f26238649a48c7c6301b9d2.png

    which allows us to bound the distance between a chain and the stationary distribution, by considering the difference between two chains, without knowing the stationary distribution!

Coupling

Let probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png and probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png be random variables with probability distributions probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png on probability_and_statistics_c78149fe826553ad47f6b160861906066538c7ea.png, respectively.

A distribution probability_and_statistics_0771ce7b70bacdbe91b857a46becb2dd6d7e0e92.png on probability_and_statistics_8a06bea97e1bf22ca393d64a607ed2a07547e941.png is a coupling if

probability_and_statistics_035e7fc1fa6275c7d3bc0d2e8a38a8a6e0ef8c0f.png

Consider a pair of distributions probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png on probability_and_statistics_c78149fe826553ad47f6b160861906066538c7ea.png.

  1. For any coupling probability_and_statistics_0771ce7b70bacdbe91b857a46becb2dd6d7e0e92.png of probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png, with probability_and_statistics_335e8e20ed8404a8a700087dd79cfa980e11bc15.png

    probability_and_statistics_e4bafec39e1b113bf02ca6695d41e52c80627330.png

  2. There always exists a coupling probability_and_statistics_0771ce7b70bacdbe91b857a46becb2dd6d7e0e92.png s.t.

    probability_and_statistics_574ed90cd963dcd29f0f14c39f9a73edc3d6ae13.png

  1. Observe that

    probability_and_statistics_081a65a3c01d0062a2122eb6fb0f1ba89c332adf.png

    Therefore,

    probability_and_statistics_1fd1d1e5cff63ca8b8a23be624d642d4869f8996.png

    Concluding the proof.

  2. "Inspired" by our proof in 1., we fix diagonal entries:

    probability_and_statistics_62639058dd9efaff586845f74cdfe3d69a7f6e23.png

    to make one of the inequalities an equality. Now we simply need to construct probability_and_statistics_0771ce7b70bacdbe91b857a46becb2dd6d7e0e92.png such that the above is satisfied, and it's a coupling. One can check that

    probability_and_statistics_79ba11fcadb94d600586462f1edafb587813fc47.png

    does indeed to the job.

Ergodicity Theorem

If probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png is irreducible and aperiodic, then there is a unique stationary distribution probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png such that

probability_and_statistics_6f07df6ecd427d5cf3cc773b3d6a1a58e1d556a4.png

Consider two copies of the Markov chain probability_and_statistics_f336ba34df1b2a703e0c2eb5eefc378adf7347de.png and probability_and_statistics_91425ab8e64471bd415802d37c3016da017cc73b.png, both following probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png. We create the coupling distribution as follows:

  1. If probability_and_statistics_d3faa8226a5a3ed5e083130a622d4deb6fb469cd.png, then choose probability_and_statistics_de1003b36c1c75b2de188504168ed13ddf1bc91e.png and probability_and_statistics_02f519893c29831ac6525ad402f3f39bc1fcdeb7.png independently according to probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png
  2. If probability_and_statistics_7274bd8904001c3ae290e72a81ba6ba003d5e17a.png, choose probability_and_statistics_3de1154a9a7c3e01c7cc591d1a5f6ac0a5e3a0d3.png and set probability_and_statistics_543d0ac6d1b36eb1de39995a274de69a9c4addcb.png

From the couppling lemma, we know that

probability_and_statistics_2898fde57124f2599f79761ab91b56514a999fdb.png

Due to ergodicity, there exists probability_and_statistics_ad426b9b6e56032bafbc7ea14909c19c2da6395f.png such that probability_and_statistics_25a1d8c2d120a0dcf12e048f767b71667744e81a.png. Therefore, there is some probability_and_statistics_f87e60cad32b94a8ab30bfe95c7b98ff47f0bdcc.png such that for all initial states probability_and_statistics_e47d8e2a1b6d76bb0c397e86bdfc26b5212c931d.png and probability_and_statistics_0b52766692d92389baf4fb95893460c6f4f3a482.png,

probability_and_statistics_d15e71cc5692a01c3bd429dcf92c712fb2e6043f.png

Similarily, due to the Markovian property, we know

probability_and_statistics_14df104011a7f2bce8807e639afbee570c4cb05f.png

Since

probability_and_statistics_6d3683cdf58d0ba82431aeb73cb0f07f90427246.png

we have

probability_and_statistics_e5ac3e5a626091ea55f5e337fe9cdecd9095ec3f.png

Hence, for any positive integer probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png:

probability_and_statistics_6e636ece9b37f2ac83298adf9887e59bc992a39e.png

Hence,

probability_and_statistics_85a039ab0f90379d0c9de08f2f3d7a82cffa1056.png

Coupling lemma then gives

probability_and_statistics_6c299bbead36aa6a4a7fbab3a0cc1ddcf98a9a0c.png

Finally, we observe that, letting probability_and_statistics_0afe740095f5edf16c368ec49b476e1201188101.png,

probability_and_statistics_95c747f4adc88b8255e6d0b5cde54c969df68560.png

which shows that probability_and_statistics_197a4736e3ed1cccee6186e92e6dfcea9bb13ded.png, i.e. converges to the stationary distribution.

Finally, observe that probability_and_statistics_8a303c22b5ff8640fb384bd5649f789459e04e8b.png is unique, since

probability_and_statistics_76937515e5c23e230d1587f79f19856232cbfd94.png

Mixing time

Let probability_and_statistics_e47d8e2a1b6d76bb0c397e86bdfc26b5212c931d.png be some probability_and_statistics_41b9461990e0de03cc8c6634e8f56c71725aea84.png and let probability_and_statistics_0b52766692d92389baf4fb95893460c6f4f3a482.png have the stationary distribution. Fix probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png. By the coupling lemma, there is a coupling and random variables probability_and_statistics_92c7e74c631c1fa785c20afaa8f3f977379cb5ae.png and probability_and_statistics_0d246e965b16f7fbf8389f5d60e0270115362822.png such that

probability_and_statistics_fa703ba414e2156ca1e8e88342e6a5b144607b3f.png

Using this coupling, we define a coupling of the distributions of probability_and_statistics_e1b433fdc6c574536b7a51c67c2c565ccd5bef0b.png as follows:

  1. If probability_and_statistics_cd21a6bdc29289e9662ee0c376c3873f4b83b62f.png, set probability_and_statistics_e0e7479d55344a2042f93951b002b084349c5b90.png
  2. Else, let probability_and_statistics_119140c2f474fb9fc54a5cf86605ba6730452310.png and probability_and_statistics_7c686c0536c67ee8bc8e2a1688dbe68ef71c6428.png independently

Then we have,

probability_and_statistics_4126001d02a0d3e73b9abd06f2b8136801d0836f.png

The first inequality holds due to the coupling lemma, and the second inequality holds by construction of the coupling.

Since probability_and_statistics_52f79c197a9cae5c89768b3267e450b1e90b3f06.png never decreases, we can define the mixing time probability_and_statistics_d51f5859298183ed570ffe7625470cace8bc3a0b.png of a Markov chain as

probability_and_statistics_e95e5a7c5f146d18b22740cc48ce48b4e58f4c63.png

for some chosen probability_and_statistics_f87e60cad32b94a8ab30bfe95c7b98ff47f0bdcc.png.

General (uncountable) state space

Now suppose we have some uncountable state-space probability_and_statistics_15e882c4dc9f1a3d88edb27e9d05f348c9acedc3.png. Here we have to be a bit more careful when going about constructing Markov chains.

  • Instead of considering transitions probability_and_statistics_cd474510ae80e27d54fa894699a2b45cf18685bf.png for some probability_and_statistics_cd252b6d23439dbd89512ae5732202e97f363c1c.png, we have to consider transitions probability_and_statistics_1ac02892ade974aa2d50147eeb1a0acccc962527.png for probability_and_statistics_c29e1fc5e116c246a118ce61186618a88e3c4532.png and probability_and_statistics_b342b22b1787bf58bab4d294f13f3766aad86154.png.
    • Transition probability is therefore denote probability_and_statistics_25148078ea67a6f786c3f229e6bc2b3b4fe24ba1.png
  • This is due to the fact that typically, the measure of a singleton set probability_and_statistics_bb208d471a33090ab779d631080ba78f0569a4a9.png is zero, i.e. probability_and_statistics_ced6fa0a288206241b7125c80a3f3ba22811bd7b.png, when we have an uncountable number possible inputs.
  • That is; we have to consider transitions to subsets of the state-space, rather than singletons
  • This requires a new definition of irreducibility of a Markov chain

A chain is probability_and_statistics_54c122f6d3c4b78457df130723c56972c218a2e2.png if there exists a non-zero sigma-finite measure probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png on probability_and_statistics_15e882c4dc9f1a3d88edb27e9d05f348c9acedc3.png such that for all probability_and_statistics_94de4ef1e39a9f0743b87cbe84602d98ddc073f6.png with probability_and_statistics_ff85f0cd23b0cef819c7845db3f44845bfe26c29.png (i.e. all non-zero measurable subsets), and for all probability_and_statistics_c29e1fc5e116c246a118ce61186618a88e3c4532.png, there exists a positive integer probability_and_statistics_5d7e01db45e6561400a6011c0ffa08d6b45916fd.png such that

probability_and_statistics_9419b212ee518e060fb55d8421c99232c512d6ef.png

We also need a slightly altered definition of periodicity of a Markov chain:

A Markov chain with stationary distribution probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png is periodic if there exist probability_and_statistics_46a83bc5086a9d9ea24f02429d5197707b1571f3.png and disjoint subsets probability_and_statistics_27eaad7ed22db3b58d766c4559fff83e4e561735.png with

probability_and_statistics_680b15f63530f0b73f005b7d9aa77312a7c6aa54.png

and

probability_and_statistics_30ca27543dfca85d043466c6091ab13f6d9ae727.png

such that probability_and_statistics_4a1e19a30174ceda692fdb9d317a0c6a9b572d37.png, and hence probability_and_statistics_99bc6880ff40ca4ebc0190c3815575af6c91e2cf.png for all probability_and_statistics_b7c30b8be6c76b11192e84f50168a1541d5fab2d.png. That is, we always transition between subsets, and never within these disjoint subsets.

If such does not exist, i.e. the chain is not periodic, then we say the chain is aperiodic.

Results such as the coupling lemma also hold for the case of uncountable state-space, simply by replacing the summation with the corresponding integrand over the probability_and_statistics_15e882c4dc9f1a3d88edb27e9d05f348c9acedc3.png. From roberts04_gener_state_space_markov_chain_mcmc_algor, similarily to the countable case, we have the following properties:

Let probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png be probability measures on the space probability_and_statistics_15e882c4dc9f1a3d88edb27e9d05f348c9acedc3.png. Then

  1. probability_and_statistics_bfc852aa2add63351fbffa62364f55982f980e68.png
  2. More generally,

    probability_and_statistics_e8a7cdf142844ae84cab45bfc3e1fff920c0db90.png

    In particular,

    probability_and_statistics_2f7fe9b3d23522379cf7a6c182b8c3aa9cbdc5f2.png

  3. If probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png is a stationary distribution for a Markov chain with kernel probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png, then probability_and_statistics_bc2f72597a86a43eda86d31912a5b56f19a06220.png is non-increasing in probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png, i.e.

    probability_and_statistics_ae3e8162e6415f8fcd01060b564a61575ece70f4.png

    for probability_and_statistics_2ea0202f9c42a324755fcb722452ed92e7a0322d.png.

  4. More generally, letting

    probability_and_statistics_3d278e5b3f887c2231fe5406cfdf0cee2e4ae7e8.png

    we always have

    probability_and_statistics_bec8fe746014164f69996307ab829f168c97dd88.png

  5. If probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png have densities probability_and_statistics_c4f480233088a134e88f2426541b2f00ca318b55.png and probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png, respectively, wrt. to some sigma-finite measure probability_and_statistics_ccb973f0cf1a9e40c184ec65f392888f18a6471e.png, then

    probability_and_statistics_1a3e0ba1cba23f2c38520b08efdead5f3fb61e6e.png

    with

    probability_and_statistics_56f2f19d8353e5017d0055247a6d796b83c2d681.png

  6. Given probability measures probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png, there are jointly defined random variables probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png and probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png such that

    probability_and_statistics_5c8b91a14b8ae1cceb40fa840aba2d9acaad6ce2.png

    and

    probability_and_statistics_78a8deac4f7042fb7572b6d4d39d4b6b3d80378d.png

From roberts04_gener_state_space_markov_chain_mcmc_algor we also have the following, important theorem:

If a Markov chain on a state space with countable generated sigma-algebra is probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png irreducible and aperiodic, and has a stationary distribution probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png, then for probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png almost-every probability_and_statistics_c29e1fc5e116c246a118ce61186618a88e3c4532.png,

probability_and_statistics_ae223da0662db44be2ab657ed42ac2d7c93011b2.png

In particular,

probability_and_statistics_8ab7e6ab86f30eb305d5a192fce131cac94825be.png

We also introduce the notion of Harris recurrent:

We say a Markov chain is Harris recurrent if for all probability_and_statistics_1ee50d47d6057d38253d6010fbee6c1efee6f6da.png with probability_and_statistics_8f8a7c2ba9d703d57c26d98a43475194d5c4e88e.png (i.e. all non-zero-measurable subsets), and all probability_and_statistics_c29e1fc5e116c246a118ce61186618a88e3c4532.png, the chain will eventually reach probability_and_statistics_86504a08fe587f7765c3c6e108c08ed5edee64f8.png from probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png with probability 1, i.e.

probability_and_statistics_4d1d62bad701178ab666b72108f36f590630425c.png

This notion is stronger than the notion of irreducibility introduced earlier.

Uniform Ergodicity

From this theorem, we have implied asymptotic convergence to stationarity, but it does not provide us with any information about the rate of convergence. One "qualitative" convergence rate property is

A Markov chain having stationary distribution probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png is uniformly ergodic if

probability_and_statistics_04e7c36d67773dc760031d595a0e0c3fa01640f8.png

for some probability_and_statistics_b140137adfa536adf3a633bda1e4d8b3f8b2f24d.png and probability_and_statistics_94bab89d3365ef25a0679961449a591742711fb1.png.

For further developments of ensuring uniform ergodicity, we need to define

A subset probability_and_statistics_75f3d96da47c832e8f57d01a263548f4704b6208.png is small or probability_and_statistics_501e5f96d3b643827d3ce43e677b71edb9c101d9.png if there exists a positive integer probability_and_statistics_a3b4cc9b652eb2c995c5ffee46422b816042c848.png, probability_and_statistics_f87e60cad32b94a8ab30bfe95c7b98ff47f0bdcc.png, and a probability measure probability_and_statistics_5e752bf81e66e66878e38f877cbaa0f81324f1f9.png on probability_and_statistics_15e882c4dc9f1a3d88edb27e9d05f348c9acedc3.png s.t. the following minorisation condition holds:

probability_and_statistics_9d82c6a763e1ca64d2db16e74df16d4873bdf317.png

i.e. probability_and_statistics_735d76e486ec3004ae7b5b042e7a8ccea9362161.png for all probability_and_statistics_37f9f86fdde2f5d3b31a6fdc6a278ac204a54b56.png and all measurable probability_and_statistics_b342b22b1787bf58bab4d294f13f3766aad86154.png.

Specific distributions & processes

Hazard function

The hazard function / failure rate probability_and_statistics_181e7e5923f7d9a6b8d6b7ff8e1cbf7467292443.png is defined

probability_and_statistics_64aacc9177c4ebf7ef688f1edbcd314196f07c9a.png

where probability_and_statistics_a0e42a3969013d94a065630de2ab62bb3d124b8f.png denotes the PDF and probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png denotes the CDF.

TODO Renewal processes

  • Let probability_and_statistics_3c88576af3a42dca4890ec7bf2d6cb40df5273a5.png with probability_and_statistics_b91be591dea4d0379a4a70dae4bc4d2da61ab497.png be a sequence of waiting times between probability_and_statistics_fef17eb8f18cbca39343efb289d554b8e3dbef5d.png and the probability_and_statistics_dff8b6128994236289bbd44f11c32a93b9f32e9e.png event.
  • Arrival time of the probability_and_statistics_2dccadf289f9d7226b843da5f637a61b118b34a2.png event is

    probability_and_statistics_c1147914901fb58cec0b378add7d994064da4dbb.png

  • Denote probability_and_statistics_53f1e85ad97d85e1d44c21eeddd7b49e5c1cf97a.png by the total number of events in probability_and_statistics_9f32dccbdd8194ed7f93848f8673f9126f5134d0.png
  • If probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png is fixed, then probability_and_statistics_3f32fd3ab7937cc6b82bd8b096867ba4004cc80c.png is the count variable we wish to model
    • It follows that

      probability_and_statistics_3a468a822f01bd6e736cd68a4d475a625e4d96c9.png

      i.e. total number of events is less than probability_and_statistics_8f44de754519a6b4b737f2e24c2083a1a7cb4a03.png if and only if the arrival time of the probability_and_statistics_d343851462849c88268f2785b69150c3db55018a.png event is greater than probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png, which makes sense

  • If probability_and_statistics_aa747f573fef4218debcc874d3a0e68a0dac12dc.png is the distribution function of probability_and_statistics_d467e8cb1461c8c1ffc74abddffabf4bed0387eb.png, we have

    probability_and_statistics_3b530db80456037ebd7a7540b32bdcaa45b30141.png

  • Furthermore

    probability_and_statistics_eb013b1e7dd7fe4aa6c0955f02479e8f7d32fb1e.png

    • This is the fundamental relationship between the count variable probability_and_statistics_53f1e85ad97d85e1d44c21eeddd7b49e5c1cf97a.png and the timing process probability_and_statistics_d467e8cb1461c8c1ffc74abddffabf4bed0387eb.png
  • If probability_and_statistics_273df32a05d5e127a80d4ea748d98e94aae485f7.png, the process is called a renewal process
  • In this case, we can extend the above equation to the following recursive relationship:

    probability_and_statistics_db499c058c02788660a831aa1cc31e3a62a60d3d.png

    where we have

    probability_and_statistics_8b57c27c3b2ab1533a5aeff88e6ba51a37480e9f.png

    • Intuitively: probability of exactly probability_and_statistics_09624459afcdb2856876edd6b724cf492aedddcc.png events occuring by time probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png is the probability that the first event occurs at time probability_and_statistics_b2110ac708a325605a16e07f91c2bf9db0865aea.png, and that exactly probability_and_statistics_8f44de754519a6b4b737f2e24c2083a1a7cb4a03.png events occur in the remaining time interval, integrated over all times probability_and_statistics_05371a6500a000896344bc95e447d7b0a03899f4.png.
    • Evaluting this integral, we can obtain probability_and_statistics_b6fd235e7ddc0f5c0b6df9a84423a43e6107dd45.png from the above recursive relationship.

Statistics

Notation

Definitions

Efficency

The efficency of an unbiased estimator probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is the ratio of the minimum possible variance to probability_and_statistics_047dbb04f5a3627ea7eb91269eef00b700561817.png.

An unbiased estimator with efficiency equal to 1 is called efficient or a minimum variance unbiased estimator (MVUE).

Statistic

Suppose a random vector probability_and_statistics_acf3b65b2f4f86069358e656f377dbb727298990.png has distribution function in a parametric family probability_and_statistics_a4efeaa67bde6584ef626d647b84bc4117fb155b.png and realized value probability_and_statistics_8ea286b57b111723d4787b43ffe4b3cdfc1614fc.png.

A statistic is r.v. probability_and_statistics_863ec45bc4bf50022c1bcba604faa4c0bae5b1e7.png which is a function of probability_and_statistics_acf3b65b2f4f86069358e656f377dbb727298990.png independent of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png. Its realized value is probability_and_statistics_9c92476b1ff133ea1705bf8a86c473fdc7b9e2ff.png.

A statistic probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png is said to be sufficient for probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png if the distribution of probability_and_statistics_acf3b65b2f4f86069358e656f377dbb727298990.png given probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png does not depend on probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png, i.e.

probability_and_statistics_0b1ef3fcf6e50212136005507aab7ab371c1a7b0.png

Further, we say probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png is a minimal sufficient statistic if it's the smallest / least complex "proxy" for probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png.

Observe that,

  1. if probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png is sufficient for probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png, so is any one-to-one function of probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png
  2. probability_and_statistics_acf3b65b2f4f86069358e656f377dbb727298990.png is trivially sufficient

We say a statistic is pivotal if it does not depend on any unknown parameters.

E.g. if we are considering a normal distribution probability_and_statistics_3a0be2eebbac3cb66797454572c137315a181189.png and probability_and_statistics_8a303c22b5ff8640fb384bd5649f789459e04e8b.png is known, then the mean could be a pivotal statistic, since we know all information about the distribution except the statistic itself. But if we didn't know probability_and_statistics_8a303c22b5ff8640fb384bd5649f789459e04e8b.png, the mean would not be pivotal.

Let probability_and_statistics_fbc1d6b6e27e6500339753c67a4a57b94abdca5b.png.

Then statistic probability_and_statistics_1d73a46d8a26bc6f9d904535b4c0b0cccd5d6613.png is sufficient for probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png if and only if there exists functions probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png of probability_and_statistics_8ea286b57b111723d4787b43ffe4b3cdfc1614fc.png and probability_and_statistics_c4f480233088a134e88f2426541b2f00ca318b55.png of probability_and_statistics_773dc743a87adae256c17bf9f2fdf9446b3bd042.png such that

probability_and_statistics_50d9c33d7973c9b0f89f96eac79a32733553e24d.png

where probability_and_statistics_4fbb7b56fdaff2d067837fdf1ba5647d5c40c724.png denotes the likelihood.

U-statistic

Let probability_and_statistics_8560f3130639e8ddd39f6a5019f01247b3e22327.png be independent observations on a distribution probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png.

Consider a "parametric function" probability_and_statistics_e8a2a0aac9a20ff3b8b5c07569a07370924787f4.png for which there is an unbiased estimator. That is, probability_and_statistics_258f7fa77d270937c483e0c7a6c7653505aaf304.png may be represented as

probability_and_statistics_57b596b6c3904bdc745abc3e420ae781c5c8b305.png

for some function probability_and_statistics_ac0141b810318ee7b122c3a1739a2f2727ea6b0e.png, called a kernel.

Without loss of generality, we may assume probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png is symmetric, otherwise we could simply

probability_and_statistics_9f262fed749d39255c9d6130706f79c39dedf1f3.png

For any kernel probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png, the corresponding U-statistic for estimation of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png on the basis of a sample probability_and_statistics_cfb561fb9e1d9f58ab86ab812f934a0934973065.png for size probability_and_statistics_848807fd61483ff6b4a390cb15124e2f7e552664.png is obtained by averaging the kernel probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png symmetrically over the observations:

probability_and_statistics_058c535884425ed43ca148a175ab3c0a580fc53f.png

where probability_and_statistics_2a22d04d67a7480f0bec12756a5c64e31287b408.png denotes the summation over probability_and_statistics_4b117b23534ab5e9ba18d4ae25a66e850381cd42.png combinations of probability_and_statistics_8f44de754519a6b4b737f2e24c2083a1a7cb4a03.png distinct elements probability_and_statistics_4017795f928648b10b92788a12a9145b768c3798.png from probability_and_statistics_58d2598c595a8825113f260cbcc23d98f4ec4228.png.

Clearly, probability_and_statistics_8c3070614572e6ab3cd1d4236f8cddb7ac44824c.png is then an unbiased estimate of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png.

To conclude, a U-statistic is then a statistic which has the property of being an unbiased estimator for the corresponding probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png, where probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png is such that it can be written stated above.

V-statistic

Statistics that can be represented as functionals probability_and_statistics_4febfc51fdff1d72ab923a7d66f1c825ea85dca2.png of the empirical distribution function, probability_and_statistics_fc63368bda84bdc47b5c68e00be09f635871e720.png, are called statistical functionals.

A V-statistic is a statistical function (of a sample) defined by a particular statistical functional of a probability distribution.

Suppose probability_and_statistics_f342f32162df5a5c8bd2693150d533af207efcaf.png is a sample. In typical applications the statistical function has a representation as the V-statistic

probability_and_statistics_b0b97b6306d81803e263ba6b0c05cfb09a73039a.png

where probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png is a symmetric kernel function.

probability_and_statistics_daf252c46c7fac69f4a2559b7d98d391028e04bd.png is called a V-statistic of degree probability_and_statistics_8f44de754519a6b4b737f2e24c2083a1a7cb4a03.png.

Seems very much like a form of boostrap-estimate, does it not?

Informally, the type of asymptotic distribution of a statistical function depends on the order of "degeneracy," which is determined by which term is the first non-vanishing term in the Taylor expansion of the functional probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png.

Quantiles

Let probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png be a random variable with cumulative density function (CDF) probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png, i.e.

probability_and_statistics_81fafbc544e899d8254c3b96a668825a732471c7.png

Let probability_and_statistics_b649d1e28ce78cc50d50e458231260f47ed33a87.png be such that

probability_and_statistics_1b23dc51ba71506cd071caf1eb87f66d27d821d5.png

for some probability_and_statistics_b7bcb91ef29fa3dd0cfa08349b9854675926cb26.png.

Then we say that probability_and_statistics_b649d1e28ce78cc50d50e458231260f47ed33a87.png is called the probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png quantile of probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png, i.e. the value such that

probability_and_statistics_b3838d3b2413c7be5267903e6bec7815ad80cc76.png

We then say:

  • probability_and_statistics_7fee3c9ed3f3d73b7c14d6c3a0cef585fa5baf7a.png is the median; half probability mass on each side
  • probability_and_statistics_8fac218faad2c76d4d5bc54f0bb03a6a2365b208.png is the lower quantile
  • probability_and_statistics_39f77a3a35f951b49cbf6484b09bab5f36e33b1a.png is the upper quantile

probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png is strictly monotonically increasing so probability_and_statistics_f7f37ec01bb04e03ad217905c0dd538664c4d130.png exists, hence we can compute the quantiles!

Convergence in law

A sequence of cdfs probability_and_statistics_8a5e5eab6a83cab9aeb6e9aa2b5282aa98d7cdd5.png is said to converge to probability_and_statistics_510a3d993dd428bfe79de35a9511a7ed4f30995e.png iff probability_and_statistics_1ce9b21ade4b4c3c411a9f1311c4f89244475d4c.png on all continuity points of probability_and_statistics_510a3d993dd428bfe79de35a9511a7ed4f30995e.png.

We say that if a random variable probability_and_statistics_97dfb349aa7b5a668f7aa4e66f6d0fa7e625ed79.png has cdf probability_and_statistics_8a5e5eab6a83cab9aeb6e9aa2b5282aa98d7cdd5.png and the rv probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png has cdf probability_and_statistics_510a3d993dd428bfe79de35a9511a7ed4f30995e.png, then probability_and_statistics_97dfb349aa7b5a668f7aa4e66f6d0fa7e625ed79.png converges in law to probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png, and we write

probability_and_statistics_c55e4e43c04358297a67e5c47acd0b951ed15762.png

This does not mean that probability_and_statistics_97dfb349aa7b5a668f7aa4e66f6d0fa7e625ed79.png and probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png are arbitrarily close as random variables. Consider the random variables probability_and_statistics_551469f1904bbe549faa53a4b30b6ded38b7712c.png and probability_and_statistics_c9e5c715a9fa23d4c66e2058b13b68823712a5ba.png.

Let probability_and_statistics_97dfb349aa7b5a668f7aa4e66f6d0fa7e625ed79.png and probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png be random variables. Then

probability_and_statistics_e4b57d0a6a593fdd1a368e968b64fc596dd60af2.png

where probability_and_statistics_87de92a16285e058c73b022afb266ee0ce349e88.png means converges in probability.

Or, equivalently

probability_and_statistics_449579ef62e5e90c61e87f211feaa9ca88ebef51.png

This is the notion used by WLLN.

If probability_and_statistics_48f82f476792665c58c75b158d0683127150a86c.png where probability_and_statistics_510a3d993dd428bfe79de35a9511a7ed4f30995e.png is the limit distribution and probability_and_statistics_dde01d797492048655a85cf212300e7855362db1.png then

probability_and_statistics_e787d3742db1a97618cdf224b10c017a74bdbadd.png

Let probability_and_statistics_e3b206c46562450a093b78a722941b2cc0cf6daa.png be a sequence of random variables, and probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png be a random variable.

Then

probability_and_statistics_a059e24ed1f6a07b427b72ea1d084c970777e00d.png

This is the notion of convergence used by SLLN.

Kurtosis

The kurtosis of a random variable is the 4th moment, i.e.

probability_and_statistics_6404958b20e6bc6ad55d92a6ff4d0f1a3c8bfa3a.png

where probability_and_statistics_db9480d4c6177b159281b19648a52411d24dd2d4.png denotes the 4th central moment and probability_and_statistics_8a303c22b5ff8640fb384bd5649f789459e04e8b.png is the std. dev.

Words

A collection of random variables is homoskedastic if all of these random variables have the same finite variance.

This is also known as homogeneity of variance.

A collection of random variables is heteroskedastic if there are sub-populations that have different variability from others.

Thus heteroskedasticity is the absence of homoskedastic.

Consistency

Loosely speaking, an estimator probability_and_statistics_e19eeadf07d6126bfa6ba9b193d3d6867ae0330f.png of parameter probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png is said to be consistent, if it converges in probability to the true value of the parameter:

probability_and_statistics_92d35ed7d4e326c8245c9e3275e2c3561ae2d812.png

Or more rigorously, suppose probability_and_statistics_b9d7cf38c117deac43a79a51c92408b1bb755d1a.png is a family of distributions (the parametric model) and probability_and_statistics_d022864b93244ea3bfb75babb302e43ecbc802f3.png is an infinite sample from the distribution probability_and_statistics_752512a5c4f58baddbd363fd71442225add91d29.png.

Let probability_and_statistics_bdbffd935a925548bcfae85c63cc92f278f6450c.png be a sequence of estimators for some parameter probability_and_statistics_00c5083bbdd29f45c72300f134528911b417409a.png. Usually probability_and_statistics_e19eeadf07d6126bfa6ba9b193d3d6867ae0330f.png will be based on the first probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png observations of a sample. Then this sequence is said to be (weakly) consistent if

probability_and_statistics_54d539fd5c62801f896a551c831933d578d8e49f.png

Jeffreys Prior

The Jeffreys prior is a non-informative prior distribution for a parameter space, defiend as:

probability_and_statistics_fdf1bc469d3da86fe08b30ba5ef36c9780cbd347.png

It has the key feature that it is invariant under reparametrization.

Moment generating function

For r.v. probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png the moment generating function is defined as

probability_and_statistics_e00362794b5fc127d9717c1e70c9d832155731d0.png

Taking the derivative wrt. probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png we observe that

probability_and_statistics_d35fa89d2dbf1a276d3e4e09a3ba600f35dd7cd9.png

Letting probability_and_statistics_1521f78eb9a3ed17a5379fd7b8ebcfa5f1733e50.png, we get

probability_and_statistics_a64f4823f5a63da9007f285f84a0943cb09f7d8a.png

i.e. the mean. We can take this derivative probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png times to obtain the expectation of probability_and_statistics_70835c5f63160b639a9616da0f7f5fa77c69ba8a.png, which is why probability_and_statistics_8eefd4b9250c16ed90ae3eb267c406a2dea3ee43.png is useful.

Distributions

Binomial

probability_and_statistics_a98d6ecb908a2686eedd281fe28c198062d5a4e0.png

or equivalently,

probability_and_statistics_1d45564e324db9af0385788f589870612ebcbf73.png

Negative binomial

A negative binomial random variable represents the number of success probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png before obtaining probability_and_statistics_6ee30c172ceb70f3426abc02237d9f23445ae2f9.png failures, where probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png is the probability of a failure for each of these binomial rvs.

probability_and_statistics_e26997975159e553454f96239af87b8a264bd432.png

Derivation

Let probability_and_statistics_ab9d9977871eda0af750024aad11a3b9b2f6c240.png denote the rv. for # of successes and probability_and_statistics_76c425f1c1b4193275045ddb6cf6d9b72be71d87.png the number of failures.

Suppose we have a run of probability_and_statistics_adc9b7991a7e3d9d80a4995af0a51b968aa19fc0.png successes and probability_and_statistics_31362bcce0c8106eaa136a1e2f2cc3588e3bbc8f.png failure, then

probability_and_statistics_dd5b9b484493ef3b9c990a84544eb00259a66cd9.png

where the failure is of course the last thing that happens before we terminate the run.

Now, suppose that the run above is followed up by probability_and_statistics_d644663e01a65b4d14816766136cd76a0dabd149.png failures, i.e. probability_and_statistics_6dd8e3aa0c75538141459d0daa082b1018ca2462.png but we have a specific ordering. Then, letting probability_and_statistics_6c7d3bb97a9dc426bd5d203399acff7d53e2256a.png denote the result of the i-th Binomial trail,

probability_and_statistics_64354701d443ed74fd0cc54ca4bc2ba5a30a286d.png

But for probability_and_statistics_d644663e01a65b4d14816766136cd76a0dabd149.png of the failures, we don't actually care when they happen, so we don't want any ordering. That is, we have probability_and_statistics_6babec315c005248ffb44682aca23f5d251af2ae.png sequences of the form described above which are acceptable.

Hence we get the pmf

probability_and_statistics_bedd036b4e0c7b0f8c4a7d8bdcc3e3d034927f5a.png

Gamma distribution

Chi-squared distribution

The chi-squared distribution with probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png degrees of freedom is the distribution of a sum of the squares of probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png independent standard normal random variables.

It's a special case of the gamma distribution.

T-distribution

The T-distribution arises as a result of the Bessel-corrected sample variance.

Why?

Our population-model is as follows:

probability_and_statistics_6bee5e02fd8bb4569c231ecc3e86b78004f2c1ce.png

Let

probability_and_statistics_d7175b3966acb61e319829e0137a6eaae75ef603.png

then let

probability_and_statistics_db05a3372412d14617b6e233c7b2cd68d7f701e1.png

be the Bessel-corrected variance. Then the random variable

probability_and_statistics_1caee4c7a29b9d0f46134b7d47c89c7417f41f0c.png

and the random variable

probability_and_statistics_b39924286d7f9e165aad5cc3ac0627fbd3d4519f.png

(where probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png has been substituted for probability_and_statistics_8a303c22b5ff8640fb384bd5649f789459e04e8b.png ) has a Student's t-distribution with probability_and_statistics_aa6475989551e9093825e91895bb5894896ec153.png degrees of freedom.

Note that the numerator and the denominator in the preceding expression are independent random variables, which can be proved by a simple induction.

F-distribution

A random variate of the F-distribution with parameters probability_and_statistics_c683b5fd4af60dc179214afb1963e5d3ed7df879.png and probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png arises as the ratio of the two appropriately scaled chi-quared variates:

probability_and_statistics_ca17d5b2e0756b79e98b4ed01ef7a27917a63694.png

where

Power laws

Notation
  • probability_and_statistics_459fa07f4570c074d7193da154543cf35c79efd7.png such that we have a power law probability_and_statistics_bcedbd0ec31a68bc5ba44172322d7afa687219ee.png, in such cases we say that the tail of the distribution follows a power law
Definition

A continuous power-law is one described by probability density probability_and_statistics_048f6c98e916ed8890f70a26012e2f994b6f56f5.png such that

probability_and_statistics_6ce902cf5daa9d9c12b5ed18334c0a8868d5b345.png

where probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png is the observed value and probability_and_statistics_a4680369a3334fa4e273bbd161e5ca60a5d7c8a5.png is the normalization constant. Clearly this diverges as probability_and_statistics_326935dc2172ac9f7e2b834c57ea742ac30e0c51.png, hence cannot hold for all probability_and_statistics_1233a80798c04bb731186c409c7c8f037c5c12b1.png; must be a lower bound to power-law behavior.

Provided probability_and_statistics_e8d15c7fa4f333d7821472c1a2ac2dd283030456.png, we have

probability_and_statistics_794103912b6ccb00acbc8ade14ba40600f164cb2.png

A discrete power-law is defined by

probability_and_statistics_50cb01c78a307b3c1382a8e3c08b75b5eb91b5a3.png

Again, diverges at zero, hence must be some lb probability_and_statistics_fde2b7c0db6482a345cb9ada58dd63451cf73916.png on power-law behaviour:

probability_and_statistics_8adc54c92eb90ef9f3f8fa3941e39c0373b0e437.png

where

probability_and_statistics_b4562ad3e805c984f198b30876dad2b041892462.png

is the generalized or Hurwitz zeta function.

Important

Sources: clauset07_power_law_distr_empir_data and so_you_think_you_have_a_power_law_shalizi

  1. Log-log plot being roughly linear is necessary but not sufficient
    1. Errors are hard to estimate because they are not well-defined by the usual regression formulas, which are based on assumptions that do not apply in this case: noise of logarithm is not Gaussian as assumed by linear regression. In continuous case, this can be made worse by choice of binning scheme used to construct histogram, hence we have an additional free parameter.
    2. A fit to a power-law distribution can account for a large fraction of the variance even when fitted data do not follow a power law, hence high values of probability_and_statistics_4506da39e6f6ba1c0c3ad3711d023912991e44d1.png ("explained variance by regression fit") cannot be taken as evidence in favor of the power-law form.
    3. Fits extracted by regression methods usually do not satisfy basic requirements on probability distributions, such as normalization and hence cannot be correct.
  2. Abusing linear regression. Standard methods, e.g. least-squares fitting are known to produce systematically biased estimates of parameters for power-law distributions and should not be used in most circumstances
  3. Use MLE to estimate scaling exponent probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png
  4. Use goodness of fit to estimate scaling regions.
    • In some cases there are only parts of the data which actually follows a power law.
    • Method based on Kolmogorv-Smirnov goodness-of-fit statistic
  5. Use goodnes-of-fit test to check goodness of fit
  6. Use Vuong's test to check for alternative non-power laws (see likelihood_ratio_test_for_model_selection_and_non_nested_hypothesis_vyong89)
Fitting power laws
  • Continuous

    The MLE for the continuous case is

    probability_and_statistics_5e5ea305a2e288ad40935dde86616999b3cd4366.png

Maximum Likelihood Estimation

Notation

  • probability_and_statistics_4b3df9f3b9934fa8f5f627dbe0b6db7de9e86dc4.png denotes the log-likelihood , i.e. probability_and_statistics_0b15ef347b4007b93eca367a46c1ea978167e99c.png
  • probability_and_statistics_25308b720f4f19a61ced72e2e3c75c46f87966cb.png is the Fisher's information (expected information)
  • probability_and_statistics_c639210ee48eeac40b7348bdfb68e1e5b98ea4b9.png is the observed information , i.e. information without taking the expectation
  • probability_and_statistics_62dc4f40de3242ca186375d8563c0535b9ff7f1e.png is called the score function
  • probability_and_statistics_f2f278a473884f29e46817cccf4129b065185aa4.png denotes that true (and thus unobserved) parameter
  • probability_and_statistics_b149b50e2a8d51c59438a47d599a4e72aa85167f.png means the probability density probability_and_statistics_66710f6fc601e6c255c7bcd98b5a5bb4be34898f.png evaluated at probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png

Appoximate (asymptotic) variance of MLE

For large samples (and under certain conditions ) the (asymptotic) variance of the MLE probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is given by

probability_and_statistics_8e0485af4cb126743c74bc002386055a302d3ab4.png

where

probability_and_statistics_488cec7aef400efc279ad723fe903827898fd2e8.png

where probability_and_statistics_25308b720f4f19a61ced72e2e3c75c46f87966cb.png is called the Fisher information .

Estimated standard error

The estimated standard error of the MLE probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png is given by

probability_and_statistics_3b27270044b98a0e9895c9eb81dbe37ed54f9d8e.png

Regularity conditions

The lower bound of the variance of the MLE is true under the following conditions on the probability density function, probability_and_statistics_10a804642fdd488409b653a5353b6aee03b1df63.png, and the estimator probability_and_statistics_fce7b5a8efa80779aee1826e2e1dce7f42af42be.png:

probability_and_statistics_9b2e26af7ecdf7086900607c600aa5fa2257a260.png

exists and is finite.

  • The operations of integration wrt. to probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png and differentiation wrt. probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png can be interchanged in the expectation of probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png; that is,

probability_and_statistics_991c56f9291651372ac660af857cd486d39e0e11.png

whenever the RHS is finite. This condition can often be confirmed by using the fact that integration and differentiation can be swapped when either of the following is true:

  1. The function probability_and_statistics_10a804642fdd488409b653a5353b6aee03b1df63.png has bounded support in probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png, and the bounds do not depend on probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png
  2. The function probability_and_statistics_10a804642fdd488409b653a5353b6aee03b1df63.png has infinite support, is continuously differentiable, and the integral converges uniformly for all probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png
"Proof" (alternative, also single variable)
  • Notation
    • probability_and_statistics_7df9a470775af7f02077f5b0aa3738526a3c20a8.png denotes the expectation over the data probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png, assuming the data arises from the model specified by the parameter probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png
    • probability_and_statistics_8164dd71b5b45e0ca183fdf12705e2fd78b33b3e.png is the same as above, but assuming probability_and_statistics_b019cf4b45a1e4fa6679b5267038be1d4ccee782.png
    • probability_and_statistics_369a8fe8484433fb76235394a0177cb5c49e7bfb.png denotes the score, where we've made the dependence on the data probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png explicit by including it as an argument
  • Stuff
    • Consistency of probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png as an estimator

      Suppose the true parameter is probability_and_statistics_f2f278a473884f29e46817cccf4129b065185aa4.png, that is:

      probability_and_statistics_00534eca582f6ba509f76583867e85785dfdfefb.png

      Then, for any probability_and_statistics_6e974a4db8443b47c8a62f930f210d8cc06aec2c.png (not necessarily probability_and_statistics_f2f278a473884f29e46817cccf4129b065185aa4.png), the Law of Large Numbersimplies the convergence in probability

      probability_and_statistics_780d82021ccda564644c56c37241e3cccdc6b7bd.png

      Under suitable regularity conditions, this implies that the value of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png maximizing the LHS, which is probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png, converges in probability to the value of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png maximizing RHS, which we claim is probability_and_statistics_f2f278a473884f29e46817cccf4129b065185aa4.png.

      Indeed, for any probability_and_statistics_6e974a4db8443b47c8a62f930f210d8cc06aec2c.png

      probability_and_statistics_8d6085371d88c5f381fd0d1720a510317a64a5e4.png

      Noting that probability_and_statistics_336ea4a4055a77ee377a87af54e784122b7ea1d3.png is concave, Jensen's Inequality implies

      probability_and_statistics_f63e403a665d4125fd7bb440efc599958aa3e036.png

      for any positive random variable probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png, so

      probability_and_statistics_540420034ef86ede8a89efcdf1c34934e3083bf7.png

      Which establishes "consistency" of probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png since probability_and_statistics_669c9f9ddd8fb7be4a4a8eeb73a4e5409abed027.png is maximized at probability_and_statistics_4eb5b8f6eefa10f9d783d5d57974c5251d64d625.png.

    • Expectation and variance of score

      For probability_and_statistics_6e974a4db8443b47c8a62f930f210d8cc06aec2c.png,

      probability_and_statistics_bb9ec152e6b26e9baaf100adf187e52d3d3ff50a.png

      First, for the expectation we have

      probability_and_statistics_22ff215817ec69d766dcb19d77445b769bfe97c3.png

      Assuming we can interchange the order of the derivative and the integral (which we can for analytic functions), we have

      probability_and_statistics_3d4fba0a1890f8d83e6681bdf6d7134924aa229f.png

      For the variance, we can differentiate the above identity:

      probability_and_statistics_1a5355ba22ec11fe1b43c8435d86a9b50d33897d.png

      where we've used the fact that probability_and_statistics_7016a0344d0a2069848f66586d5c0e16f6df9987.png and probability_and_statistics_92c816b7ca63014ead89de2ec3decd64cd612211.png, which implies that probability_and_statistics_aac0a09403236ac3d45f84f3009f3958ca316998.png.

      This is equivalent to

      probability_and_statistics_5920f04a5494679535833ff1179a9eebfa5845ea.png

      as wanted.

    • Asymptotic behavior

      Now, since probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png maximizes probability_and_statistics_decc2065d58551b057bfcdf8688b4e55230b0c04.png, we must have probability_and_statistics_8adf3e71c8e470a67f0eef9950bed4c2bed6799b.png.

      Consistency of probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png ensures that probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is close to probability_and_statistics_f2f278a473884f29e46817cccf4129b065185aa4.png (for large n, with high probability). Thus, we an apply first-order Taylor expansion to the equation probability_and_statistics_8adf3e71c8e470a67f0eef9950bed4c2bed6799b.png about the point probability_and_statistics_b019cf4b45a1e4fa6679b5267038be1d4ccee782.png:

      probability_and_statistics_e469805cff9a6d140bfcdaa330bc90c302fb65d6.png

      Thus,

      probability_and_statistics_042beca3bd28aee3f1f5493bac0cec4e863d1d1d.png

      For the denominator, we rewrite as

      probability_and_statistics_23a3eaf9da2a09efb90044bb9c24ec2f3ca19de0.png

      and then, by the Law of Large Numbers again, we have

      probability_and_statistics_79825a034c049628a259f6e05649901a40dc3ecb.png

      in probability.

      For the numerator, due to what we showed earlier, we know that

      probability_and_statistics_463ce92f8a492572236d9e11dd21746b6edc3930.png

      We then have,

      probability_and_statistics_e4f289233447ddcf657bbc4f5c54ea105abfd365.png

      and by the Central Limit Theorem, we have

      probability_and_statistics_7d9ade95e4d8cff014ba4b2397e057f283fe49f6.png

      Finally, by Continuous Mapping Theorem and Slutsky's Lemma, we have

      probability_and_statistics_0c628a922cd08029cd04b3d1cae13b5e9f106d0e.png

Score test

For a random sample probability_and_statistics_2f8c22c3d5dfa62a0d40f92a13125f1231a66e91.png the total score

probability_and_statistics_a8fee608bbd8b9c140f12cbcf05aaf796bef1d11.png

is the sum of probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png i.i.d. random variables. Thus, by the central limit theorem, it follows that as probability_and_statistics_bf3fe99c079b6d787b66abdd8c02bfd9ac53b784.png

probability_and_statistics_6c02739e32d5050fbfe0acb4d9b5c2fe7b17acc1.png

and

probability_and_statistics_a9346564f60dc08b44e18ec8b947de803b8605e3.png

This can then be used as an asymptotic test of the null-hypothesis probability_and_statistics_53f581a1f3fc8ead4cea2804129ad69f247b66ef.png.

Reject null-hypothesis probability_and_statistics_53f581a1f3fc8ead4cea2804129ad69f247b66ef.png if

probability_and_statistics_3283a3e1aa152ad4c53db12e4f88d93205d52154.png

for a suitable critical value probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png. OR equivalently,

probability_and_statistics_4da054ab3437b2bc080badf690396ab2f25df831.png

Likelihood Ratio Tests

We expand the log-likelihood using Taylor expansion about the true parameter probability_and_statistics_f2f278a473884f29e46817cccf4129b065185aa4.png

probability_and_statistics_fdd619a85c98c33f6696ce8d14c003b58e84a0ad.png

Subtracting probability_and_statistics_71d07079439fd93fd0495f92a47a7d46f1854052.png from both sides, and arranging

probability_and_statistics_d3d40b770f4b095d1210eb5b03ef06bd84a4d9fb.png

And since probability_and_statistics_1859c6e3df01d9915dd779bc634f3072d325918d.png, we get

probability_and_statistics_1b253259d22eca19bca6cdef0a10c64035c71c4b.png

which means that the difference between the log-likelihoods can be considered to be a random variable drawn from a probability_and_statistics_a7d0e1a2db3c47f81bea40f74592a777f901e8e5.png distribution

probability_and_statistics_58a3520ac1f912150c78c73c908d73b3a1a65b3f.png

and we define the term of the left side as the likelihood-ratio .

The test which rejects the probability_and_statistics_558c22104fa07357e662b2bb2855f779ecb6ebd9.png if

probability_and_statistics_a7ba2c3cb5fcc1b20fe447f16f9acdc7d42ff1d3.png

for a suitable significance / critical value probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png.

The above is just saying that we reject the null hypothesis that probability_and_statistics_b019cf4b45a1e4fa6679b5267038be1d4ccee782.png if the left term is not drawn from a chi-squared distribution .

Wald test

We test whether or not

probability_and_statistics_7ebe6bf883b84bd6765634a497fdef6d6ac07c11.png

That is, we reject the null-hypothesis probability_and_statistics_558c22104fa07357e662b2bb2855f779ecb6ebd9.png if

probability_and_statistics_9b84381d5ebec3666f6bac574f4c2be7a7cbccc9.png

for some suitable signifiance / critical value probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png.

Generalization to multi-parameter case

Hypothesis tests

probability_and_statistics_bf5fbe3986caa96891f286df440eed119426033f.png

Confidence regions

probability_and_statistics_b522a8aa40b4201416cfbfef151c955e20aba029.png

Likelihood test for composite hypothesis

Let probability_and_statistics_c78149fe826553ad47f6b160861906066538c7ea.png denote the whole parameter space (e.g. probability_and_statistics_840a0c93614ea49a179760dd123c1667817c2700.png). In general terms we have

probability_and_statistics_a6758b901456b6faabae9995baa2620ef9c7edc1.png

where probability_and_statistics_c9d23d974fdd818194d08a1638af60b85e8453db.png.

The general likelihood ratio test compares the maximum likelihood attainable if probability_and_statistics_c25b608a0d1c76ac3bb86c97fc0916520da3a85d.png is restricted to lie in the restricted subspace probability_and_statistics_0771ce7b70bacdbe91b857a46becb2dd6d7e0e92.png (i.e. under 'reduced' model) with maximum likelihood attainable if probability_and_statistics_c25b608a0d1c76ac3bb86c97fc0916520da3a85d.png is unrestricted (i.e. under 'full' model):

probability_and_statistics_229a9bcfb1d4434bef87bdbbc326d51834b8131d.png

where probability_and_statistics_578cbd7bdb0f3b6d8fd7e4ab19c1931204860fa7.png is the unrestricted MLE of probability_and_statistics_c25b608a0d1c76ac3bb86c97fc0916520da3a85d.png and probability_and_statistics_8e3c543d8ac85fabcf2bd93c1b4e68a2bb4d65c8.png is the restricted MLE of probability_and_statistics_c25b608a0d1c76ac3bb86c97fc0916520da3a85d.png.

Some authors define the general likelihood ratio test with the numerator and denominator swapped; but it doesn't matter.

Iterative methods

probability_and_statistics_1a73ef90ecf0a24ef2555f18c25d46937b9a4ac0.png

probability_and_statistics_d67fcefb94368204aec31260eff5bcc1b670edbf.png

Simple Linear Regression (Statistical Methodology stuff)

Correlation coefficient and coefficent of determination

The (Pearson) correlation coefficient is given by

probability_and_statistics_4453299959b3b2b560d795d4e6f0a1652badfbb1.png

and coefficient of determination

probability_and_statistics_a125750aad0b9eb33f766d8379af9244b556c3c9.png

Least squares estimates

probability_and_statistics_276383f9f8d20fd755c6b819423f88132e2ef0f5.png

Residual sum of squares

probability_and_statistics_a33bfdeb5104c55fe9e30d0995935e6e612d4b11.png

with the estimated standard deviation being

probability_and_statistics_3ed5dc029850cf735b77c65188c0e11094653b97.png

Laplace Approximation

Overview

Under certain regularity conditions, we can approximate a posterior pdf / pmf as a Normal distribution, that is

probability_and_statistics_e2a060c43c0313b08bef51a6574c736e4723f5d6.png

where

  • probability_and_statistics_fd74ad8dae48a948b3fdc0fe5a54cbf2baf036b3.png i.e. the log-pdf, with probability_and_statistics_d98dc0cc843413cead94a9a230c975387a7fc7b5.png being the second-order derivative wrt. probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png
  • probability_and_statistics_96a8a957c2ff9cee09ca87631ca187b05086b261.png is the posterior we want to approximate

"Derivation"

For any pdf that is smooth and well peaked around its point of maxima, we can use the Laplace approximation to approximate a posterior distribution by a Gaussian pdf.

It follows from taking the second-order Taylor expansion on the logarithm of the pdf.

Let probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png denote the point of maxima of a pdf probability_and_statistics_96a8a957c2ff9cee09ca87631ca187b05086b261.png, then it also the point of maximia of the log-pdf probability_and_statistics_fd74ad8dae48a948b3fdc0fe5a54cbf2baf036b3.png and we can write

probability_and_statistics_4fc96a8802ea4656217c578f31c806d6d4e803e9.png

where in the second step we've used the fact that probability_and_statistics_63919af51720d9a1f5faa342dce26eb3faa9af2c.png since probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is a maxima, and finally let

probability_and_statistics_28622f99a202a4cbb12236270c75daef4be0bdc9.png

(notice that probability_and_statistics_ed96bda47111edf252c3a24f9eeb55a46876366e.png since probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is a maxima )

But the above is simply the log-pdf of a probability_and_statistics_d21162297e19366e216800c57d4e883fb5d53528.png, hence the pdf probability_and_statistics_96a8a957c2ff9cee09ca87631ca187b05086b261.png is approx. normal.

Hence, if we can find the probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png and compute the second-order derivative probability_and_statistics_d98dc0cc843413cead94a9a230c975387a7fc7b5.png of the log-pdf, we can use Laplace approximation.

Guarantees

Consider the model probability_and_statistics_9ef8c637e27bf942990ee2f96545064b5ac3651e.png, probability_and_statistics_41dccc9503ab50dd71edceab950b62a927639b4a.png.

Under some regularity conditions on the pdfs/pmfs probability_and_statistics_56bef0128215e5feaf6084ea19f8c7baef092c0a.png, including all of the have the same "support", and that for each probability_and_statistics_c1a095bc430b73abaf63eae78d315295aaa78937.png, probability_and_statistics_ad71a67cdb07ebcbd000d10dae0a43afc03e3a78.png is twice continuously differentiable, we have that for any prior probability_and_statistics_acaecb6465653931d8a121ca25489c3a97d28bce.png which is positive, bounded and twice differentiable over probability_and_statistics_f89b50eea8387290f17929e663d8b0ab909ba29c.png (the parameter space),

probability_and_statistics_885ba0af4cf603b59809965b6911926837b99822.png

for large probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png.

Under the same regularity conditions it turns out that probability_and_statistics_34631f50de93ec1507569492b38fbfeecdb112b6.png and that probability_and_statistics_f5717d3e5da597a7c7dd6b7b9c088cc9ee5534a3.png.

Point estimators

Notation

Cramér-Rao lower bound

Notation
  • probability_and_statistics_f5b18a018c8ff9c7eacd991aaf2bb6a740540722.png
  • probability_and_statistics_a093126fc5ad46e45c41628f1a95e9a8cf691dc8.png
Theorem

To state the Cramér-Rao inequality, we need to establish some regularity conditions:

  1. probability_and_statistics_fcd3bcd7e191b58d2de5b2a966152db28ecdd2c6.png such that probability_and_statistics_145204aeb1d5c1719020b93b21e10e9428de91a5.png we have probability_and_statistics_7d4a9845c2fd56dcf5a96db5ddd2b1554c3842a3.png, called identifiability
  2. probability_and_statistics_76b454f751e1e36c361d83a18f2df8928953fa3c.png we have probability_and_statistics_650a6a42d604acc99fe33b32840da1f0e32322f6.png have common support
  3. probability_and_statistics_f89b50eea8387290f17929e663d8b0ab909ba29c.png is an open set
  4. probability_and_statistics_83888f8042c19b1e8cae760ed0ee4dc1d2829b19.png exists
  5. probability_and_statistics_5bc92769e771a68f25d1ab59055fb32fe9fb1eff.png

Where we remember that probability_and_statistics_4021d4dc281d0a35fce4f22a0e6c6d8223b9140e.png is the Fisher's information.

Let probability_and_statistics_cfb561fb9e1d9f58ab86ab812f934a0934973065.png denote a random sample from probability_and_statistics_650a6a42d604acc99fe33b32840da1f0e32322f6.png and suppose that probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is an estimator of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png. Then, subject to the above regularity conditions,

probability_and_statistics_5527dd8a7d3a286b58d1cac2109f85b4a86164d4.png

where

probability_and_statistics_281af8928fe7da96484d423fbba172c00afdb0ef.png

  1. For unbiased probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png, the lower bound simplies to probability_and_statistics_0f8601bd74a780c27af8c61b0d5f07a9e2f918cd.png
  2. Regularity conditions are needed to change the order of differentation and integration in the proof given below.
  3. Proof is for continuous r.v.s, but there is a similar proof for discrete r.v.s.

probability_and_statistics_e2b1548cfe8aa5fff319ef6479acb9a49866bb0c.png

From the definition of bias we have

probability_and_statistics_42060d4577a476879121ef22b435901ffb97af99.png

Differentiating both sides wrt. probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png gives (using the regularity conditions)

probability_and_statistics_d7b10fccfd21c50adacf1fd20e36f279884296b7.png

since probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png does not depend on probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png. Since probability_and_statistics_b71617eb0db25c748d29ba7f3f3a749a62e83318.png we have

probability_and_statistics_8aab3efff383125279689d3796700bf1de585dfb.png

Thus,

probability_and_statistics_c3923f1809ef9f291fa748db31885c075dc19f17.png

Now use the result that for any two r.v.s probability_and_statistics_f8a2b400bf9def7b5b67035dfb5f06e148ab7065.png and probability_and_statistics_b339a910bb6007d4d79e9c904b54a684fcf79e70.png,

probability_and_statistics_d3aa835c3d912ad26697d0285449443e824191ee.png

and let

probability_and_statistics_8359a9ab6d7e116018616e778c64e14e72e4618e.png

Then

probability_and_statistics_1168fdd266726b8d603294480451f01c1175f3b9.png

Hence

probability_and_statistics_d4cab30a9feac5e4aa15540c02930962d377b28a.png

Similiarily

probability_and_statistics_9f3a9e28ee205ece44e0e788f793de3fb73058fe.png

and since

probability_and_statistics_95ae2724fff909c32fc695f2159093a2787101f4.png

we obtain the Cramér-Rao lower bound as

probability_and_statistics_4aa7d063650e460fbd972043b38545b08969089f.png

The Cramér-Rao lower bound will only be useful if it is attainable or at least nearly attainable.

Rao-Blackwell Theorem

Theorem

Let probability_and_statistics_c217d2b0b8ed5b01d70977f8f72c6d5d90e6ed82.png be a random sample of observations from a distribution with p.d.f. probability_and_statistics_bd4e91a593d2e64468da4ad8649c6f9a812256dd.png. Suppose that probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png is a sufficient statistic for probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png and probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is any unbiased estimator for probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png.

Define probability_and_statistics_9812f2431829cb5bbba4c6d9c16c7ccc63fa5924.png. Then,

  1. probability_and_statistics_4c9985ec61ed78089812437915df072a1519f983.png is a function of probability_and_statistics_391ba25320bcd9d67b1952507e2350803f994ba7.png only
  2. probability_and_statistics_c9376557e4f7e66a76e6aaca3a1bd04368f0d939.png
  3. probability_and_statistics_f8eb1b3e8bcabde7e64b1c54318ef4d398826c8e.png

Non-parametric statistics

Maximum Mean Discrepancy (MMD)

Let probability_and_statistics_d52e9d9efb831414835bb75d787728b7bc049127.png be a class of functions probability_and_statistics_4337b33130fc1a4ff7b38bc98c99047eaae2f71f.png and let:

  • probability_and_statistics_ba0cf7096c85302ffa7297d51180f4f406f1e1bd.png be Borel probability measures
  • probability_and_statistics_0d3f80be4f614c402e368ed7ff697cca7a73b625.png and probability_and_statistics_dc798db4721de8b046922355b76c8844dfcd3b74.png are i.i.d. observations from probability_and_statistics_b346b8eaf5913ed81b033dacc1d6568c142bbb4a.png and probability_and_statistics_9974194a836f8c06efda1c78a55ab184c6c2d746.png, respectively

We define the maximum mean discrepancy (MMD) as

probability_and_statistics_3b8599145b287b2d2d08748e985fff6e48d2072d.png

In the statistics literature, this is an instance of a integral probability metric.

A biased empirical estimate of the probability_and_statistics_7f401fe773c604cb2817a9df9319e087f93d3f70.png is obtained by replacing the population expectations with empirical expectations computed on the samples probability_and_statistics_f40ad5f32532ae52dd17a4315b7711042277a778.png and probability_and_statistics_a4242ffc7d6e6fc3298db18c548b41b6309e0a68.png:

probability_and_statistics_aa51d5d01d036fa88e940ba1b2907b12627fa331.png

We must therefore identify a function class that is rich enough to uniquely identify whether probability_and_statistics_3e686986fa9cc2f56eb9cc431fd34a037a04fae1.png, yet restrictive enough to provide useful finite sample estimates.

Wavelets

The problem with Fourier transform is that, even though we can extract the frequencies, we do not know at which time these frequencies occurr. This can be incredibly important in cases where the signal has "switches".

Idea is to

  • Start with low frequency / high wavelength
  • Slide over signal to extract low frequency
  • Subtract high frequency away from signal
  • Take higher frequency / smaller wavelength
  • Slide over signal to extract higher frequency

By subtracting the low frequency we're removing the overlap of the low frequency and high frequency without interferring with the high frequency signal! Thus it's possible to extract both low and high frequencies.

A mother wavelet is some function probability_and_statistics_69cc3a6dde1ae7395a4213dfdbd23aaf75cace74.png such that we can scale and translate to create new wavelets probability_and_statistics_811e347d4356dc02dd087effceaf9c7bc98092b9.png:

probability_and_statistics_585af800f36d86bc2b31157b116d792213893b70.png

where

  • probability_and_statistics_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png is scaling
  • probability_and_statistics_62d0596bf7b968d41aa765c30eee0fa137f87c97.png is translation

And the Fourier transform is then

probability_and_statistics_0fab39800f2fe1c71eb55e388cbc84f7a64c2e17.png

The condition required by a wavelet on probability_and_statistics_62c8769bb6a6c4840b4e6a9387478fdfea7ee55a.png is that:

probability_and_statistics_694b4961d9e8ee02d1487429cb46e5f80c35802e.png

Continuous Wavelet Transform

probability_and_statistics_7938d475a0de48f43047cc547ca896ea430af30b.png

and to invert we have

probability_and_statistics_d66e6a007c77399351f2d913d1b22ffb388b5c50.png

Suppose probability_and_statistics_06533eba579f5050f2511b94841d9efe169c005f.png is a wavelet, and probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png is an integrable and bounded function, then $ψ * φ is wavelet.

Wavelet basis

A mother wavelet probability_and_statistics_06533eba579f5050f2511b94841d9efe169c005f.png is such that probability_and_statistics_50e825f72c2237a5af3d9f981ca4f6657b210de9.png form an orthogonal basis for some subspace of probability_and_statistics_83b04133cc8e4358fd750b17aae3250ead3621be.png, hence

probability_and_statistics_e2447f3e2c1a1ab8126df7c45670f213c4f776d9.png

converges to probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png in the probability_and_statistics_83b04133cc8e4358fd750b17aae3250ead3621be.png norm!

Multi-Resolution Analysis (MRA)
  • Overview
    • Algorithm for constructing the different resolutions

    We consider wavelets probability_and_statistics_50e825f72c2237a5af3d9f981ca4f6657b210de9.png constructed from some mother wavelet probability_and_statistics_06533eba579f5050f2511b94841d9efe169c005f.png:

    probability_and_statistics_1fd1e8c84aef93dcacc71457df2e1eb6ee3f170f.png

    and we want to expand our signal in such a wavelet basis.

    Consider sequence of subspaces

    probability_and_statistics_236c9194e651aca18a0f4e5dbc2dd3379e8bab3f.png

    in probability_and_statistics_2829af117657e9afcc0bb0cf43ae8f151bb827bf.png, with the follow properties:

    1. Nested

      probability_and_statistics_451a60e7af1ca1905939ac3f91bd1ac274f14f6b.png

    2. Union

      probability_and_statistics_4c39c864851ea01950e0a2813f41ebb60535fc71.png

      is dense in probability_and_statistics_2829af117657e9afcc0bb0cf43ae8f151bb827bf.png

    3. Orthogonality

      probability_and_statistics_3df2e81da386456ba3f23e80a8d8ae0668e241e6.png

    4. Increased "resolution"

      probability_and_statistics_1046d038251085db12bdac31cbfc62335d96448c.png

    5. probability_and_statistics_dcc8bd38a0b3833b4f4117e2ed83fd1d579a194e.png such that probability_and_statistics_a64385432296cd46fb14dd4c1ea412780f0c3dc7.png is an orthogonal basis in probability_and_statistics_570856150f1a547fb856ccc466530fb6ab845eb6.png
      • probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png is called the scaling function or father wavelet

    Let probability_and_statistics_7922c3c7a0999142416b5a40db16373fa910fd69.png be the "standard" scale where our mother wavelet probability_and_statistics_06533eba579f5050f2511b94841d9efe169c005f.png and father wavelet probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png live, i.e. probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png is probability_and_statistics_7d4e7cf6dfcda77d905cf34c3dbdb4a477a3998c.png for probability_and_statistics_f5ec04bb79c0a45ac6172c7dd6b4cae4718129ad.png.

    We can map probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png to probability_and_statistics_f377695d881c4f672d1bf48b016adc1c0890005c.png by

    probability_and_statistics_b0d514ab8d9edc53b7ed600d8cbdcdb229de07bf.png

    which is called the dilation or two-scale or refinement equation.

    We can repeat this argument for arbitrary probability_and_statistics_8f44de754519a6b4b737f2e24c2083a1a7cb4a03.png, so with probability_and_statistics_8fd69388cd14a5c9d61af64a353d92cd76f070a9.png, we write

    probability_and_statistics_ac3aaabdd2c0c22009a1632073d3d0c04da8969f.png

    Then,

    probability_and_statistics_2e6d7bc89a16940b690d5f04d31c66bc45ff439f.png

    which means

    probability_and_statistics_4c5971a52e543cb1b1b770126b825b2aa0c3f36b.png

    Finally, this tells us that probability_and_statistics_fee4ed35f18c566dd811a5ef05ba6781ffecc039.png (the mother wavelet) such that

    probability_and_statistics_c75e46f19470e61fa198a05590feb58bba6ce736.png

    constitutes an orthogonal basis for probability_and_statistics_92c35b439385c55095bf287bc01e6ae53b5ee7be.png.

    If probability_and_statistics_0620fb29ecae1671a835aacd8b909f530c125897.png is a multi-resolution analysis (MRA) with scaling function probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png, then there exists a mother wavelet probability_and_statistics_06533eba579f5050f2511b94841d9efe169c005f.png defined

    probability_and_statistics_7a36802cb7ed5bb3c552454a4910037164c60893.png

    where

    probability_and_statistics_272918c08cdcbb2e02b2fbefd979bf9aa93e3618.png

    which allows us to obtain an orthormal basis probability_and_statistics_50e825f72c2237a5af3d9f981ca4f6657b210de9.png for probability_and_statistics_71a9cbcb40d988ead9d1d42c87a11d6c0d8c3d44.png which is dense in probability_and_statistics_2829af117657e9afcc0bb0cf43ae8f151bb827bf.png:

    probability_and_statistics_c75e46f19470e61fa198a05590feb58bba6ce736.png

Examples of wavelets
  • Haar wavelet

    probability_and_statistics_29e42c138c37383fbf68c51f9d5abe79610cb0bc.png

Application

For more information about how and when to apply wavelet transforms, and which mother wavelet to use, e.g., checkout MATLABs documentation on this.

Testing

Notation

  • probability_and_statistics_ca49d060511c331df12455746852c1f9fdc3b98f.png and probability_and_statistics_a4680369a3334fa4e273bbd161e5ca60a5d7c8a5.png denotes the acceptance and critical region, respectively
  • probability_and_statistics_f4cb67c3046f8cf86ce57c6e7e8d2baaba5fcfea.png denotes the critical value of some statistic probability_and_statistics_078b55cac678a6008bcb22d99d3132f63c6685fb.png
  • Critical region with a given significane level probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png is denoted

    probability_and_statistics_496cdb678a59714df0e7c14a9593c5afa5feee93.png

  • Type I error: Reject probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png when probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png is true with

    probability_and_statistics_b62499988b80dc956856b3fb17107c802bb3a0e7.png

  • Type II error : Fail to reject probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png when probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png is false (equiv., we accept probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png when we shouldn't):

    probability_and_statistics_8ffcd4440dfed89754543e100d7a58f7e00108ca.png

  • probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png denotes the test function:

    probability_and_statistics_12610be0132746ac7d314380f94178415e4bd653.png

  • probability_and_statistics_cff6c44c36dbf98a7c152fdf5d8aa8190a60573d.png refers to the expectation over whatever parametrized distribution probability_and_statistics_752512a5c4f58baddbd363fd71442225add91d29.png which given you're parametrizing the distribution with probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png.

Definitions

A hypothesis is simple if it defines probability_and_statistics_897406e9708070ff45ea262873eca616517c97df.png completely:

probability_and_statistics_ace205bc16c53411990e5416883c1db32a6815f0.png

otherwise, it's composite.

If probability_and_statistics_897406e9708070ff45ea262873eca616517c97df.png is parametric with more than one parameter, a composite hypothesis might specify values of some or all of them (e.g. on regression coefficient).

A U-statistic is the class of unbiased statistics; this class arise naturally in producing minimum-variance unbiased estimators (MVUEs)

Important: in statistics there are TWO notions for a U-statistic

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions.

Thus, nonparametric tests or distribution-free tests are procedures for hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed.

Let the data probability_and_statistics_9d2817748ac037d354e94db222c805fbebf658cb.png and we wish to test two simply hypotheses:

probability_and_statistics_e4f50719f61684b2a03c21553b3f74b6c145afd2.png

Suppose that we choose a test statistic probability_and_statistics_078b55cac678a6008bcb22d99d3132f63c6685fb.png and reject probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png if probability_and_statistics_a30de8c0a4fefd9efdbef07173582bea82a33da6.png for some critical value probability_and_statistics_f4cb67c3046f8cf86ce57c6e7e8d2baaba5fcfea.png.

This induces a partition of the sample space probability_and_statistics_f89b50eea8387290f17929e663d8b0ab909ba29c.png into two disjoint regions:

  • the rejection region probability_and_statistics_a4680369a3334fa4e273bbd161e5ca60a5d7c8a5.png:

    probability_and_statistics_5c269a00a61816585ca22821222302ce10ae5506.png

  • the acceptance region probability_and_statistics_ca49d060511c331df12455746852c1f9fdc3b98f.png:

    probability_and_statistics_ad603a2599e1704d44a729c324fddc1fe4e5d25b.png

We will also sometimes use the notation probability_and_statistics_cf414f313083d46b44bc120396fc41af6e07cf5c.png to denote the critical region which has the property

probability_and_statistics_50cb36736783e0e568418a3cb20f927acda0f57f.png

Consider probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png and probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png with corresponding p.d.f.'s probability_and_statistics_106d9a9a02be3580bd059a95323046ee40db2f85.png, probability_and_statistics_de42e4d2b7dae0bb37b5dc3fb5c6d54817a76799.png for probability_and_statistics_acf3b65b2f4f86069358e656f377dbb727298990.png.

For these hypotheses, comparison between the critical regions of different tests is in terms of

probability_and_statistics_123d4c950fe2fb687a6bcb9a5f2c5723a04d08e1.png

the power of a size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png critical region probability_and_statistics_cf414f313083d46b44bc120396fc41af6e07cf5c.png for alternative probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png.

A best critical region is then the critical region with maximum power.

There are two possible types of error:

  • Type I error: Reject probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png when probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png is true
  • Type II error: Fail to reject probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png when probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png is false
  • probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png denotes the probability of Type I error and is called the significance level (or size ) of the test
  • probability_and_statistics_31aa0b82f4e4fe4be2c026d5eddd655a517a7f53.png denotes the probability of Type II error which is uniquely defined only if probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png is simple, in which case

    probability_and_statistics_9491e11d2d28cccbf862f193513c2c54c232cb49.png

    denotes the power of the test. For composite probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png, probability_and_statistics_7e78eec7a365d3a8952da7cc9dad5ee6ffb53050.png is the power function.

We can define the test function probability_and_statistics_fefb152feb7b73de9b3bd6b92dcf4a91f2afb1bb.png such that

probability_and_statistics_c3e1f24464479bdd2f84de5642d523298c8b5389.png

which has the property that

probability_and_statistics_a0ca06d0c3166f2de0f0febd125e78a38204aa7b.png

For discrete distributions, the probabilty that the test statistic lies on the boundary of the critical region, probability_and_statistics_1ffae5cd8759eb240f5449120293b114f361299e.png, may be non-zero.

In that case, it is sometimes necessary to use a randomized test, for which the test function is

probability_and_statistics_c8d4d162ce079f6d28ac8e65404e308b24199a4e.png

for some function probability_and_statistics_0cd36a252cfc501cf7ced34be4c3055ed7b3cacc.png and we reject probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png based on observed data probability_and_statistics_8ea286b57b111723d4787b43ffe4b3cdfc1614fc.png with probability probability_and_statistics_fefb152feb7b73de9b3bd6b92dcf4a91f2afb1bb.png.

Suppose now there is a parametric family probability_and_statistics_6731125801b0ebf71912f80ad9c35391249e5eb2.png of alternative p.d.f.'s for probability_and_statistics_8ea286b57b111723d4787b43ffe4b3cdfc1614fc.png.

The power of a size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png critical region probability_and_statistics_cf414f313083d46b44bc120396fc41af6e07cf5c.png generalizes to the size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png power function

probability_and_statistics_6e876c2084f8ded4c67b70c1bb345f268bd9ab3e.png

A size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png critical region probability_and_statistics_cf414f313083d46b44bc120396fc41af6e07cf5c.png, is then uniformly most powerful size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png (UMP size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png) if it has maximum power uniformly over probability_and_statistics_32df61edc77eb5a205e41ab76b7d34e41787caa5.png (probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png is NOT power, probability_and_statistics_284e544994dfc6017bf26731df55f59aeb4c884b.png is).

A test is UMP if all its critical regions are UMP, or more formally:

A uniformly most powerful or UMP test, probability_and_statistics_ec10dc2cb16bca68b6b58c99c7225beccea4590e.png, of size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png is a test probability_and_statistics_078b55cac678a6008bcb22d99d3132f63c6685fb.png for which

  1. probability_and_statistics_f08564eadc3cedfeb89daa2dbfe90300b67a28a0.png
  2. Given any other test probability_and_statistics_079896695de87c2c29a6e49778b156a190f7390a.png for which probability_and_statistics_8bea7c3b2bee7f6ae334084f0e85aca470a6bbfa.png for all probability_and_statistics_038b69300d4433e53979cc18be4d37ef772166bf.png, we have

    probability_and_statistics_e085d5e54df0ffb3f0fec422c65fcddb45fc47fe.png

    i.e. expectation given probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png is at least as large as for the less powerful statistic (who's test function is probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png).

A test probability_and_statistics_fefb152feb7b73de9b3bd6b92dcf4a91f2afb1bb.png of probability_and_statistics_ca86da18ab7c058acbfe59f85e39b8d5241d7438.png against probability_and_statistics_f471a1cba43fc0bd7659201398eb19628e28b38a.png is called unbiased of size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png if

probability_and_statistics_17300db83c47b237412878fd9c6b5b94b3d2f261.png

and

probability_and_statistics_e1403b87e31989653814a092e2a9217712ae8495.png

Informally, unbiased test is one which has higher probability of rejecting probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png when it is false, than when it is true.

A test which is uniformly most powerful among the set of all unbiased tests is called the uniformly most powerful unbiased (UMPU).

Two-sample test

The problem of comparing samples from two probability distributions, by a statistical tests of the null hypothesis that these distributions are equal against the alternative hypothesis that these distributions are different (this is called the two-sample problem), i.e. in short:

  • probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png corresponds to distributions being equal
  • probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png corresponds to distributions being different
Exact test

A test is exact if and only if

probability_and_statistics_f1c32c7a3a88b6c6e393d1495d055b62d300e04b.png

which is in contrast to non-exact tests which only have the property

probability_and_statistics_b87de7433ce7962a39e4e5f238ce17e8ad839e8d.png

for some "critical constant" probability_and_statistics_b05953cd58fcbee1edc86ab6cfa7b8de6898d459.png.

Hypothesis testing

For any size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png, the LR critical region is the best critical region for testing simple hypothesis probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png vs. probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png.

That is, suppose one is performing a hypothesis test between two simple hypothesis probability_and_statistics_53f581a1f3fc8ead4cea2804129ad69f247b66ef.png and probability_and_statistics_6cca317acdcb3056a530acf24c500e39506f1810.png, using the likelihood-ratio test with threshold probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png which rejects probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png in favour of probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png at a significance level of

probability_and_statistics_118558738b6641384279edfd371a5913ba3404f0.png

where

probability_and_statistics_013ede12f5e68dca45d29912e5767043cca9e529.png

Then, probability_and_statistics_d4f3f26f8833b2f041f6b06ecce0f65394ea946e.png is the most powerful test at significance level probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png.

Bahadur efficiency

Notation
  • probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png and probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png denotes the null hypothesis and the alternative hypothesis
  • probability_and_statistics_5305ae43a2696267e347c39ab8c809b0a12d8841.png and probability_and_statistics_32df61edc77eb5a205e41ab76b7d34e41787caa5.png are the parametric set corresponding to the probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png and probability_and_statistics_a9a7732ebf75bc04e4ad301cfa1b72bdb5186c5a.png, respectively
  • A test statistic probability_and_statistics_c1de25df8b5f20ce257fb4f0689c7dd3e85b24f6.png based on a random sample probability_and_statistics_a89247658a6ab6f94101fa443058b5f748ace3fd.png
  • probability_and_statistics_038b69300d4433e53979cc18be4d37ef772166bf.png
  • probability_and_statistics_8140315109506163f404b1e37163fb730f068cc2.png
  • probability_and_statistics_2252bd8e62db6d74d3b95f27eb3e07183d618c89.png
  • probability_and_statistics_859f7649a0b76b3de799e83b4474849c8d41838a.png
  • probability_and_statistics_b18e92db526365b944b6fc426875d9bf45678769.png -value refers to the minimum value we require from the test for us to accept probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png
  • probability_and_statistics_8b4a1f3258228c661e761669be2535cc7ba27956.png is the information number corresponding to probability_and_statistics_96abe2e9e06c6f5f3ff40400e5edfba9ce35e4e5.png and probability_and_statistics_b3e6c8d42a2d0d5c77dbe5110d67e2519d674ca9.png
Stuff
  • Assume large values of probability_and_statistics_e19eeadf07d6126bfa6ba9b193d3d6867ae0330f.png give evidence against probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png
  • For fixed probability_and_statistics_038b69300d4433e53979cc18be4d37ef772166bf.png and a real number probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png let

Random quantity

probability_and_statistics_75a843f56504199c2ec61d94db420a8811f119c5.png

is the probability_and_statistics_b18e92db526365b944b6fc426875d9bf45678769.png value corresponding to the statistic probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png when probability_and_statistics_f2f278a473884f29e46817cccf4129b065185aa4.png is the true parametric value.

For example, if

probability_and_statistics_393a50a9a2d80aa9e478da0b7360af8e229b152a.png

and the null hypothesis probability_and_statistics_83285fa3a3c1d79224dab65c37fcc71cef5a42d1.png is rejected at the significance level probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png.

If for probability_and_statistics_e4fc98479e13c95f5e6daa423ca32a9e5cd87047.png with probability_and_statistics_96abe2e9e06c6f5f3ff40400e5edfba9ce35e4e5.png probability one we will have

probability_and_statistics_0a4716e65552bf00ede994e74e846af238e9442d.png

then probability_and_statistics_e90654300663d6bab02d07cb3aa66a6b867ce2ef.png is called the Bahadur exact slope of probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png.

The larger the Bahadur exact slope , the faster the rate of decay of the probability_and_statistics_b18e92db526365b944b6fc426875d9bf45678769.png value under the alternative. It is known that for any probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png, probability_and_statistics_c2f2032c5e8cae7080120dae4b873dd9db9c4604.png where probability_and_statistics_8b4a1f3258228c661e761669be2535cc7ba27956.png is the information number corresponding to probability_and_statistics_96abe2e9e06c6f5f3ff40400e5edfba9ce35e4e5.png and probability_and_statistics_b3e6c8d42a2d0d5c77dbe5110d67e2519d674ca9.png.

A test statistic probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png is called Bahadur efficient at probability_and_statistics_284e544994dfc6017bf26731df55f59aeb4c884b.png if

probability_and_statistics_851862c5d2016edd73fda0163558e2763638382a.png

Bahadur efficiency allows one to compare two (sequences of) test statistics probability_and_statistics_23d7204738a504cb9e2d91ae676cd206b39333ec.png and probability_and_statistics_8efdc648c2adc92d597cc8c6fd5e1ab671bcfb81.png from the following perspective:

Let probability_and_statistics_bb4ed565cb0dd678a8e0c498670ac8f69af518ae.png be the smallest sample size required to reject probability_and_statistics_5305ae43a2696267e347c39ab8c809b0a12d8841.png at the significance level probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png on the basis of a random sample probability_and_statistics_ca2245850d70fd07ec71ee0539d81d105fcd2d6f.png when probability_and_statistics_284e544994dfc6017bf26731df55f59aeb4c884b.png is the true parametric value.

The ratio probability_and_statistics_3b6bdeacdaea8a4d4c7a0279be16f3be2a1b6b90.png gives a measure of relative efficiency of probability_and_statistics_23d7204738a504cb9e2d91ae676cd206b39333ec.png to probability_and_statistics_8efdc648c2adc92d597cc8c6fd5e1ab671bcfb81.png.

To reduce the number of arguments, probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png, probability_and_statistics_17da0758469180666a34cfa1eb0d65bd169e80ca.png and probability_and_statistics_284e544994dfc6017bf26731df55f59aeb4c884b.png, one usually considers the rv. which is the limit of this ratio, as probability_and_statistics_41ccb5a1c60a949cb7f04b1735ff65cd9bbfa59c.png. In many situations this limit does not depend on probability_and_statistics_17da0758469180666a34cfa1eb0d65bd169e80ca.png, so it represents the efficiency of probability_and_statistics_23d7204738a504cb9e2d91ae676cd206b39333ec.png against probability_and_statistics_8efdc648c2adc92d597cc8c6fd5e1ab671bcfb81.png at probability_and_statistics_284e544994dfc6017bf26731df55f59aeb4c884b.png with the convenient formula

probability_and_statistics_3bc6cf0995e6764d08fe3b6f32b25c7b81120e2c.png

where probability_and_statistics_c683b5fd4af60dc179214afb1963e5d3ed7df879.png and probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png are the corresponding Bahadur slopes.

I.e. can use the Bahadur efficiency to compare the number of samples needed to reject probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png.

Kolmogorov-Smirnov

As probability_and_statistics_bf3fe99c079b6d787b66abdd8c02bfd9ac53b784.png,

probability_and_statistics_045f0f134b4835d64522709fd795964613e670ac.png

where probability_and_statistics_68e148d7bf15c148ad9eae39f980d0ebac2296f9.png is the Kolmogorov-Smirnov statistic.

The empirical distribution function is given by

probability_and_statistics_fa04d57723eaa3079f9585e7cbc8cc6e370e20cd.png

Consistency and unbiasedness at a point

Fix probability_and_statistics_735c8abc62946070c7500473b48ada5ee5f5126e.png, then

probability_and_statistics_24f542413e260e9a3e2924f1bcc12d3d87345d9b.png

Consequently,

probability_and_statistics_657403573f5e56a8e4ced6ab67748c479d275e05.png

That is, probability_and_statistics_bb8cda3b46a82162a60fee7385ca5a57a1943eb7.png is an unbiased estimator of probability_and_statistics_76b571658194dba7df95d99f157fd065b37baca6.png for each fixed 4x$. Also

probability_and_statistics_9435d76f6b60631cf8fb8154b9c559105c547352.png

Consequently, by the Chebyshev inequality,

probability_and_statistics_7bf2e211307cf5b5aecdf2ba5d503bc632cc7031.png

Therefore, probability_and_statistics_bb8cda3b46a82162a60fee7385ca5a57a1943eb7.png is consistent

Kolmogorov-Smirnov test

The Kolmogorov distribution is the distribution of the rv.

probability_and_statistics_350414625cf13642362b34e7685d200adc6a9402.png

where probability_and_statistics_7e72e97ae3df73e8d356e302bdb93d89b7c28807.png is the Brownian bridge. The cumulative distribution function of probability_and_statistics_ab9d9977871eda0af750024aad11a3b9b2f6c240.png is given by

probability_and_statistics_7979a2cfd764fb96c20f5765ac92298195417d6d.png

The empirical distribution function probability_and_statistics_fc63368bda84bdc47b5c68e00be09f635871e720.png for i.i.d. observations probability_and_statistics_83fcec7fe2a3faa4f10b5ac9b967d3b554aa4781.png is defined as

probability_and_statistics_8b465506c9e6d6e5645d9c88da14ce00a8aae8dc.png

where

  • probability_and_statistics_186a57ee41c3709cc9ff0cdf4b74fc2017f5d4d2.png is the indicatior function defined

    probability_and_statistics_ebe63a88c60f2951f94cb930c097be46cd536376.png

The Kolmogorov-Smirnov statistic for a given cumulative distribution function probability_and_statistics_76b571658194dba7df95d99f157fd065b37baca6.png is

probability_and_statistics_c8e236be2d14d75c78dd6e58d9ab604d24883e8a.png

Under probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png that the sample comes from the hypothesized distribution probability_and_statistics_76b571658194dba7df95d99f157fd065b37baca6.png

probability_and_statistics_79feec30035e584f88edf644c155204357660986.png

in distribution, where probability_and_statistics_7e72e97ae3df73e8d356e302bdb93d89b7c28807.png is the Brownian bridge.

If probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png is continuous, then under probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png probability_and_statistics_a89360d2a3e98979bf538854d89447918ff10ed5.png converges to the Kolmogorov distribution, which does not depend on probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png.

Kolmogorov-Smirnov test (K-S test) is a non-parametric test of the equality of continuous , one-dimensional probability distributions that can be used to compare one sample with a reference probability distribution (one-sample K-S test ), or to compare two samples (two-sample K-S test).

The one-sample KS test is:

F-test

An F-test is any statistical test in which the test statistic, called a F-statistic, has an F-distribution under the null hypothesis.

Residual-sum of squares between two models as a F-statistic

Notation
  • probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png denotes the number of samples
  • probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png denotes the number of regressors (including the constant term)
  • probability_and_statistics_9974194a836f8c06efda1c78a55ab184c6c2d746.png denotes the number of linear "restrictions" / "constraints", and therefore, also the degrees of freedom
  • probability_and_statistics_76c425f1c1b4193275045ddb6cf6d9b72be71d87.png is probability_and_statistics_351b3b4339128131711a0ab132fea50efa7c0636.png matrix with full column rank probability_and_statistics_9974194a836f8c06efda1c78a55ab184c6c2d746.png, i.e. probability_and_statistics_cebdc67351308c136963628e22940d1ac10a4c5f.png, which translates into constraints being "independent". This matrix is used to enforce the constraints on probability_and_statistics_31aa0b82f4e4fe4be2c026d5eddd655a517a7f53.png, i.e. if we don't want a constant term / intercept probability_and_statistics_5542fb33072ef3092adb8829db90c8575e82b9fa.png is such that probability_and_statistics_10c59114d490acf3ff73d120c58be1240dc6bb78.png, if the constant term was in the first entry.
  • probability_and_statistics_88f46705fa3993e69d453d7b2179b55774fc2bde.png represents the hypothesized residual sum
Stuff

Consider two linear regression models with coefficients probability_and_statistics_91243d8b78ac5f45f957c8f2d5965763dfb78d32.png and probability_and_statistics_aca80bad88c522fb784093dd4562bca16786a245.png, with probability_and_statistics_16709cd5a38beb8d5318911d26c527b7ae7be11c.png. Then let probability_and_statistics_31794fd5bb4bc129c825a1b6ea13897823c3783b.png and probability_and_statistics_ecdbc8ddbb5c79db57fab0204b66de1d8292d3df.png denote the respective number of coefficients, with probability_and_statistics_163673f3554ad298dc623d1849b457f93c520ddd.png. Then letting

probability_and_statistics_1e16419632a7c6aad1c58f6ce60ab0e9d29075a6.png

we have

probability_and_statistics_6a60b3dc0cda65902a4bdd1aa122c3f50d9d6c79.png

i.e. the quantity above is F-distributed.

Note: observe that probability_and_statistics_e7f91cc03cce78e488cde92bd720bcb80f7e6480.png so the above is a positive quantity.

Consider the null-hypothesis probability_and_statistics_82776943e576139ea8fd5e3e31fc6c00aa210b6a.png. Recall the OLS solution is given by

probability_and_statistics_b1d16727a92c7913dc9b59e377f043e3be8aabbf.png

and we therefore have

probability_and_statistics_88f383b155cd46ae4b10b2f63bc7de57d99452aa.png

Let

probability_and_statistics_993cf92fcfe42dde5617cb8f7896262051c86d2d.png

and let, using Cholesky decomposition, probability_and_statistics_89c1c92593f5a0478673d0ba2e3c9bd22c53effb.png be such that

probability_and_statistics_3ab94513acc9fc79cfab7c240bb08568886715b6.png

we have a "matrix square root" of probability_and_statistics_6487ea584784e790ad686f358d49d6df0b500ee8.png in probability_and_statistics_c2f435302f52a988b0e1b9473848f752f2b4e08e.png. The purpose of this matrix is so we can define the following random variable:

probability_and_statistics_55e75381e9deef4958b040ff60b55293eb2e57bc.png

where probability_and_statistics_3d602866169e997d5246cc91cf036c3bdab9c12e.png denotes the identity probability_and_statistics_38c28c8d9bcf189e33217030c1fe25b31682ec23.png identity matrix. To see that this is indeed the case, observe that

probability_and_statistics_3da743cafe59e0edca266889321dcb7b253787d0.png

Using properties seen in the proof of probability_and_statistics_2f2452fcec9d82030246d7ed4a89c49d9f38e236.png, know that this probability_and_statistics_819394bc33557d54ecc2b0d62dfae8541b0d6a97.png is independent of the rv.

probability_and_statistics_42ab429b3c4f8750c93cb6b87024807f5bd8cd4d.png

where

probability_and_statistics_d26ab65604dce873c536dd148c5901a4e844d3a8.png

where

probability_and_statistics_cf5e734c2cc8743bdac59c059047b507a18e5cfd.png

known as the residual maker matrix. Then we have

probability_and_statistics_d821d85922f307c76d1c0755ac73035282dad863.png

Recall that a F-statistic is a ratio between probability_and_statistics_ba9eee347467f62736fe9b49f05e9db47db8eba7.png divided by their corresponding degrees of freedom, hence the above. In particular, under probability_and_statistics_82776943e576139ea8fd5e3e31fc6c00aa210b6a.png, this reduces to the statistic

probability_and_statistics_8d9bfe0a328febb75304aeffc5b79f408f6ac3e2.png

Letting probability_and_statistics_fe6021897ca9f7005bc943b802d3d8c034a094e8.png (which is the number of components of probability_and_statistics_91243d8b78ac5f45f957c8f2d5965763dfb78d32.png we set to zero to get probability_and_statistics_aca80bad88c522fb784093dd4562bca16786a245.png) and probability_and_statistics_d238f665da2fea4458074f8ab5fca4cc7e299bc4.png (since this is how many parameters we now have), we conclude our proof.

Proof that probability_and_statistics_5bcd2134a59f9ee04ccf56513d918ac077582c58.png

Consider the regression model

probability_and_statistics_2fe41b3cfe896ffc2a92b1728c9ad6d4b3a6a204.png

with

  • probability_and_statistics_30680539b347f296825fefda028a6612e13c3cf4.png
  • probability_and_statistics_02053e7fa7b23033cb99b11c5d57f5719ca8f319.png
  • probability_and_statistics_71f2f0f25c8f797aab95a639b78d61af028ac049.png

Then the residual vector is estimated by

probability_and_statistics_5a8bd79fd8c15cb04087575ff0b06c4eaeb5fba7.png

And is distributed according to

probability_and_statistics_907a99c945bc735b7ce341f7066eb717bf84d9b6.png

and thus

probability_and_statistics_64537461a1fd6cc678e030ad4bbd34fb5f8211e3.png

Consider the following linear model

probability_and_statistics_2fe41b3cfe896ffc2a92b1728c9ad6d4b3a6a204.png

where probability_and_statistics_611c85a77eda0b66ca1c486baf1b5e4625e24d6e.png, probability_and_statistics_02053e7fa7b23033cb99b11c5d57f5719ca8f319.png and probability_and_statistics_7428a50eaa01b6404460feaa96bd4fb3b6d867f3.png. The vector of residuals is estimated by

probability_and_statistics_7b43c51792fcacd76550218c9915623e74d4a4b0.png

where

probability_and_statistics_cf5e734c2cc8743bdac59c059047b507a18e5cfd.png

Since trace is invariant under cyclic permutations, we have

probability_and_statistics_1fb072344da96c06f8ce088a8330ffb61fa0716a.png

since probability_and_statistics_178c5237221324e29958c3d431640dd5ea8cc802.png matrices, and so probability_and_statistics_28c57148f4d0f0e3df4d470dc9dda7473a96e7cb.png. Therefore

probability_and_statistics_2c42ae18813d89a23eceff7148d8b698bc0426aa.png

Futhermore,

probability_and_statistics_46d7d3001667bdbedbe0bf9262fc98ca35acb730.png

which is seen by the fact that

probability_and_statistics_92866074a9562ba275e3f08ad3449952dcbe77d9.png

We then have the following sequence of implications:

  • probability_and_statistics_07f5e3436369143c7d2fef720cc0e460b36f46fa.png
  • probability_and_statistics_303406ce21187f67c3c10232935de67bfc7e18a2.png probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png only has eigenvalues probability_and_statistics_07b269ab2ac24e4eeaaa1cb31593283d236cad7d.png and probability_and_statistics_6b314f00387982277c754a67e9002cc0f3dd7144.png
  • probability_and_statistics_303406ce21187f67c3c10232935de67bfc7e18a2.png probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png is normal and so there exists a unitary matrix probability_and_statistics_b339a910bb6007d4d79e9c904b54a684fcf79e70.png such that

    probability_and_statistics_b7dd4135cc48c530eba783a67952e3653b88d49a.png

Recall that

probability_and_statistics_0824c69484cd8987c9a5ad9bc997c662efa14c9e.png

which is seen by computing probability_and_statistics_02b243e297e96c69df7e5041a8d20a23975c562d.png and using probability_and_statistics_36b913b3d90a7aee1cc623cee329eec1bd0f71ac.png. From this it follows that

probability_and_statistics_4c1e5b6a242e76b7f1f77fe8bd06f5f188b469d7.png

where probability_and_statistics_8ac72cf006648c007f1c43d510e8fc1bc62d53d3.png is the corresponding probability_and_statistics_fb6030c5c8a6f7124650661fc53fd209abdb8a83.png vector (since the rest of the components have zero variance and thus add nothing to the norm).

Furthermore, since probability_and_statistics_b339a910bb6007d4d79e9c904b54a684fcf79e70.png is unitary, as mentioned earlier, we have

probability_and_statistics_560ae908d4c21faed041d3c53cd1d0efee60c196.png

Recall that the residual sum-of-squares is given by probability_and_statistics_b78c6806b00cdec66d4fc47e56342cd27150761e.png, and so arrive at our result

probability_and_statistics_297d4bc2bd7206324663559c5996b784ebda6dc5.png

Finally, observe that this implies

probability_and_statistics_64537461a1fd6cc678e030ad4bbd34fb5f8211e3.png

Similarity testing

Notation

  • probability_and_statistics_78a8b7f254d3b1024daabc7b4dee0a307b42b191.png with can take on values probability_and_statistics_cc0723e592ef24ce779e431a28137a1e355c6e6f.png
  • probability_and_statistics_92bc1bb3def91e3a5d1f599d4afcfb9c09190a2d.png denotes a test

Stuff

Test probability_and_statistics_7bfb9dd0df8501aac6f232db1dab026a56c9764f.png vs. probability_and_statistics_e0a7f685be2c7bdc2bbc621a7c156958c823898a.png.

probability_and_statistics_fefb152feb7b73de9b3bd6b92dcf4a91f2afb1bb.png is a test of size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png, i.e.

probability_and_statistics_9ea9818cb1ec86d0f754ea7b6f03451cd5025ba2.png

Then probability_and_statistics_1827136eeff990e82cb2b9950aa049a32cc2f910.png is a similar test of size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png.

probability_and_statistics_8cac4925817d276c8dfb1dc75c636a911b7ddac1.png with probability_and_statistics_9af4f3cb256a760c8d1952b6750de82fd78bffeb.png on the boundary of probability_and_statistics_5305ae43a2696267e347c39ab8c809b0a12d8841.png.

Confidence regions

Notation

  • probability_and_statistics_8ea286b57b111723d4787b43ffe4b3cdfc1614fc.png is drawn from some distribution

    probability_and_statistics_31eb5dd5e94af711c6bc7afa1559ad7f2acaa734.png

  • probability_and_statistics_ef245bc33d572e216ace850f6c5c0735eaa17f6b.png
  • probability_and_statistics_7b4ad8d81c420ed129eb6d347a4c18bdb41e16c2.png is a size probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png critical region for probability_and_statistics_f1457a88037430cfc61fa0285be86d1467d12769.png

    probability_and_statistics_1f55f0c992a64ea39ec6613afe2a3ba026284b08.png

    with

    probability_and_statistics_fd294ecbe897587d3952a8e3a561ef8302d9cffa.png

  • critical region refers to the sampling distribution which will lead to the rejection of the hypothesis tested when the hypothesis is true.

Stuff

  • Generalization of confidence intervalls to multiple dimensions

probability_and_statistics_a5224b794f70e3ce52cadfe6b3595f63f35e4796.png is a probability_and_statistics_35a630858d2a37f7502a38e04d162ddb8d7c717e.png confidence region for probability_and_statistics_06533eba579f5050f2511b94841d9efe169c005f.png if

probability_and_statistics_934cc273648d0498092016e4950e66c53685bae2.png

If probability_and_statistics_cc47daf15ade6b05c514b3412752985a896bad29.png then

probability_and_statistics_9b7238ebe1a22e7cc50613d208f8ef27a46dbc86.png

Pivotal quantities

probability_and_statistics_0c47ea876769f4b9a16468d7daf4b1a9fbf86dd6.png which has a distribution independent of probability_and_statistics_06533eba579f5050f2511b94841d9efe169c005f.png, i.e. probability_and_statistics_f8a2b400bf9def7b5b67035dfb5f06e148ab7065.png is a pivotal statistic, that is

probability_and_statistics_7c55c20e48a2c05d6f5fbc2d3dfc2ce5a976890f.png

Then we can construct a value probability_and_statistics_ee36b85620beda2dbfa25871dbf78ab978936183.png such that we obtain a confidence region following:

probability_and_statistics_434ccab1aff0e40faffad85de06bdfbece7eae3c.png

Decision Theory

Notation

  • probability_and_statistics_5eac6e953e096bd8fddad7a4d6456af7f2e43d43.png denotes the sample space i.e. probability_and_statistics_93b3119b26d7326c51265c0900e17e13d3a254c2.png
  • probability_and_statistics_cde2049a978c39a95d0415d2d16d75d9f9ea63fd.png denotes the parameter space
  • probability_and_statistics_e33f607c4019714ee878314ca6249a2d85a86454.png denotes a family of probability distributions
  • probability_and_statistics_4cc68ebefc13f65a9f085f468fdf9abda339a257.png is the action space, i.e. set of actions an experiment can take after observing data, e.g. reject or accept probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png, estimating the value of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png, etc.
  • probability_and_statistics_a838ca57255eeed208525bb770103534f1eba502.png denotes the loss function
  • probability_and_statistics_aa70807f5f2120d9b2f4140a9b0b60da8acc8006.png denotes decision rules , with probability_and_statistics_5df78f0d0c6d495682b83cab285aa3858d843ea2.png is a function probability_and_statistics_4abf12e170e6518eb3ab13ccf351235db7a7460b.png that associates a particular decision which each possible observed data set.

Stuff

For a parameter probability_and_statistics_038b69300d4433e53979cc18be4d37ef772166bf.png, the risk of a decision rule, probability_and_statistics_01aac6f51c6e4bb7092c0bd6b155a3467d0a25f3.png, is defined as

probability_and_statistics_bd2e02536ea61d36a6c0ff36789e951dcf40b37e.png

In other words, the risk is the expected loss of a particular decision rule probability_and_statistics_01aac6f51c6e4bb7092c0bd6b155a3467d0a25f3.png when the true value of the unknown parameter is probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png.

Note: this is fundamentally a frequentist concept, since the definition implicitly invokes the idea of repeated samples from the parameter space probability_and_statistics_5eac6e953e096bd8fddad7a4d6456af7f2e43d43.png and computes the average loss over these hypothetical repetitions.

Selection of decision rule

Given two possible decision rules probability_and_statistics_c683b5fd4af60dc179214afb1963e5d3ed7df879.png and probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png, the rule probability_and_statistics_c683b5fd4af60dc179214afb1963e5d3ed7df879.png strictly dominates the rule probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png if

probability_and_statistics_5053df19b9e7b49fbadc68194a667a5391bcab24.png

and there exists at least one value of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png, e.g. θ', for which

probability_and_statistics_cd990feed20b0e5814059b4259cedc8f588026d8.png

It is clear that we would always prefer probability_and_statistics_c683b5fd4af60dc179214afb1963e5d3ed7df879.png to probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png.

If, for a given decision rule probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png, there exists some decision rule probability_and_statistics_c683b5fd4af60dc179214afb1963e5d3ed7df879.png that strictly dominates probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png, then probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png is said to be inadmissible.

Conversely, if there is no decision rule that dominates probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png, then probability_and_statistics_c17521332c39ba07ddfb60aa41e25214f5621c6a.png is said to be admissible.

That is, generally less-than-or-equally AND has at least ONE value of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png for which we have strict inequality.

Unbiasedness

We say a loss-function is unbiased if

probability_and_statistics_9a8235505574c55a67910311fbef92ccbe16d30e.png

i.e. loss of the decision rule should be minimized for the true value of the parameter probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png.

Minimax decision rules

A decision rule is minimax if it minimizes the maximum risk

probability_and_statistics_30e702229f538633e79f3d99ede44d66258f3c50.png

which can also be written

probability_and_statistics_b73c8c0026394d97c5acae1d28cc148eed9ef59a.png

Bayes decision rules

The Bayes risk of a decision rule, probability_and_statistics_01aac6f51c6e4bb7092c0bd6b155a3467d0a25f3.png, is defined by

probability_and_statistics_4c6c4c2066079e77c2b9c3e9b504169818851b88.png

or by a sum in case of discrete-valued probability distribution.

A decision rule is a Bayes rule with respect to the prior probability_and_statistics_745ca11471fe2e19acda2f501392c2892144e2a1.png if it minimizes the Bayes risk, i.e.,

probability_and_statistics_6dbe3a197aecf265cd950cd0802d88dd5c169b18.png

Definition of Bayes risk assumes that the infimum is achieved by some rule, which might not always be true.

In those cases, for any probability_and_statistics_f87e60cad32b94a8ab30bfe95c7b98ff47f0bdcc.png, we can find a decision rule probability_and_statistics_89f36090974d68283852c968ec618b8d335a3752.png such that

probability_and_statistics_9b3bd453172a29552a1f7daa4dab0a7dd0e33b46.png

Such a rule is said to be probability_and_statistics_bdb48f03004dadf8bd891a01a8758b913f48cd28.png *Bayes wrt. prior probability_and_statistics_745ca11471fe2e19acda2f501392c2892144e2a1.png.

A useful choice of prior is the one that is most conservative in its estimate of risk. This gives rise to the concept of a least favourable prior.

We say probability_and_statistics_876b4aed6bc5cb00b2a55d123c67be9c745553e4.png is a least favourable prior if, for any other prior probability_and_statistics_d2866cfe52500c2c823a8edef513096863891f2e.png we have

probability_and_statistics_98f11bbb2982a655c83556a490497e03d938ad3e.png

where probability_and_statistics_33dd0132088d9c3da39e2baae58483dbd2052a49.png are the Bayes (decision) rules corresponding to probability_and_statistics_745ca11471fe2e19acda2f501392c2892144e2a1.png and probability_and_statistics_c3b6de38a04c48fe8d6b2bd835c6a015114c7646.png.

Randomized decision rules
  • probability_and_statistics_4041476b998bc5a6ca4d7b8f4880e883a5a433a8.png decision rules probability_and_statistics_f99063d7e492bbf0275f5befd4c30f1fa6cb6d26.png which we choose from with probabilties probability_and_statistics_ad7c64fa2bc032946142928ab34c2e0f92a3dc12.png, with probability_and_statistics_14448ab2792f689645067382e350d34ffe32c130.png and probability_and_statistics_e3c172b700fa815dd1e7a4a4c96ca56ea91ce6d6.png
  • Define probability_and_statistics_3d9e651c1975240bc6a523f40949fda4754b7c45.png to be the rule "selection decision rule probability_and_statistics_a6e4c085e864046fb8e1d722f07c4b4ac83f0db3.png with prob. probability_and_statistics_4a77bca9657725862e67fa3d15043c1dd9654cf1.png", called the randomized decision rule, which has risk

    probability_and_statistics_fb12dfd2474ecc6c4701a173c325a7cd25909f69.png

  • Sort of a linear combination of decision rules
  • Minimax rules are often this way
    • Supremum over probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png of the risk for probability_and_statistics_eb5109be70871b7df323835016360ccce02b5466.png is smaller than the supremum of the risk for any of the decision rules individually

Finding minimax rules

Suppose that probability_and_statistics_5591d41a19eea113d103e543190b27334ce376a0.png is a Bayes (decision) rule corresponding to prior probability_and_statistics_19d857d854e59c886e2349963bea802b43f0e6df.png and probability_and_statistics_e08d9aab02198f00c398866741ad1f3c9eabf5b1.png, then

  1. probability_and_statistics_5591d41a19eea113d103e543190b27334ce376a0.png is a minimax decision rule
  2. probability_and_statistics_745ca11471fe2e19acda2f501392c2892144e2a1.png is the least favourable prior

Admissibility of Bayes (decision) rules

We have the following three theorems:

Assume that the parameter space, probability_and_statistics_0bd5b006a06185991991f02618b7f99b4adf9ff7.png, is finite and a given prior probability_and_statistics_745ca11471fe2e19acda2f501392c2892144e2a1.png gives positive weight to each probability_and_statistics_ca21b159c150deb1c1d9f0bb5f72aef9e4b607f6.png.

The a Bayes (decision) rule wrt. probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png is admissible.

If a Bayes (decision) rule is unique, it is admissible.

Let probability_and_statistics_cde2049a978c39a95d0415d2d16d75d9f9ea63fd.png be a subset of the real line. Assume that hte risk functions probability_and_statistics_e364fc19bea14bb8b9bf0e8cdf7c6b158f36a805.png are continuous in probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png for all decision rules probability_and_statistics_01aac6f51c6e4bb7092c0bd6b155a3467d0a25f3.png.

If the prior probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png has the property that for any probability_and_statistics_f87e60cad32b94a8ab30bfe95c7b98ff47f0bdcc.png and any probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png, the interval probability_and_statistics_721b012709331928e97b84e2a846b94050e36c08.png has non-zero probability under probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png, then the Bayes (decision) rule wrt. probability_and_statistics_108e48865692dff7955ba9ddbee0c1e718d23960.png is admissible.

James-Stein Estimators

Notation

  • probability_and_statistics_0f49318ea207a7547b899fc731a098d211c5b163.png
  • probability_and_statistics_a93dd2d34c545bf0ad39d63f74556a9aa9b76e6a.png is the square loss function
  • Risk of a decision rule is then given

    probability_and_statistics_b290a7c4be10c56b6df844be46752b975931602b.png

Stein's Paradox

The class of James-Stein estimators of the form

probability_and_statistics_b03422c06070b339d166c0558fd7fa13a8432609.png

has smaller risks that are also independent of probability_and_statistics_daf4c98ec256e58b010370c12204e88945e82b90.png and hence strictly dominate the natural estimator

probability_and_statistics_c048d43ecae2efdc955352baeafd724d5c25e18b.png

These are called shrinkage estimators as they shrink probability_and_statistics_acf3b65b2f4f86069358e656f377dbb727298990.png towards probability_and_statistics_07b269ab2ac24e4eeaaa1cb31593283d236cad7d.png when probability_and_statistics_6f197568551da514133d958a43c73e7d5d7b5c12.png, and has the following consequence: folding in information from variables that are independent of the variable of interest can, on average, improve estimation of the variable of interest!

Suppose

probability_and_statistics_5575928b10ecefe429051a3c5a000d68a4862b5d.png

and probability_and_statistics_9fb5b5fcbd74e063a4011f8628b09a2034aa98c8.png is an almost differentiable function then

probability_and_statistics_d19066d7a0d6cdeb24603f2d3f4d60b3c2fa9052.png

First we do the univariate case.

Suppose

probability_and_statistics_ab30650f56f569c56f769924445ca7c8dcb5eaf3.png

which is absolutely continuous, and

probability_and_statistics_a825ac9df80724dea43fac4960fadd064eceb0fa.png

Using change of variables to set probability_and_statistics_52e25c58f561a8944a7de29ee6890051cc4b7d3f.png and probability_and_statistics_bcc31f1493e7c347a3ebfc365765040aeafca4e7.png, then probability_and_statistics_819394bc33557d54ecc2b0d62dfae8541b0d6a97.png is std. normal. Then

probability_and_statistics_1054a9faac93518e864cc8ac165df219ed27b167.png

Then one simply substitute into the Stein's lemma and prove that it is indeed satisfied.

Bayesian Statistics

Model comparison

Bayesian Information Criterion (BIC)

Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred.

The BIC is defined as:

probability_and_statistics_c21d7cd537e189b676e798a8d1dcb88a8b346d25.png

where

  • probability_and_statistics_f66bd69d7d990489c5f29fa84f79877bb094f11b.png is the MLE of the model
  • probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png the obsered data
  • probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png is the number of data points in probability_and_statistics_2a61cee9c707231c73a1fcb6992bc4bb552044fe.png
  • probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png is the number of parameters in the model

BIC is an asymptotic result derived under the assumptions that the data distribution follows an exponential family.

Limitations:

  1. Approximation only valid for sample sizes probability_and_statistics_877689a33a05aefb1213d9aa9f85f888bdaf9de4.png
  2. The BIC cannot handle complex collections of models as in the variable selection problem in high-dimensions
Akaike Information Criterion (AIC)
  • Given collection of models for some data, estimates quality of each model, relative to each of the other models
  • NOT a test of model in a sense of testing a null hypothesis, i.e. says nothing about the absolute quality, only relative quality

Suppose we have a statistical model of some data.

Let:

  • probability_and_statistics_09687bce7a8acae6a3b93666202b85ee05e5e796.png be the number of estimated parameters in the model
  • probability_and_statistics_f66bd69d7d990489c5f29fa84f79877bb094f11b.png be the maximum value of the likelihood function for the model

Then the Akaike Information Criterion (AIC) value of the model is

probability_and_statistics_5afc6011b83f42e73b011380fa4ad92d01f794a6.png

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value.

Thus, AIC rewards goodness-of-fit (as assessed by the likelihood function), but it also penalizes large number of parmateres (complexity).

AIC is based on information theory. Suppose data was generated by some unknown process probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png, and we're considering two candidate models to probability_and_statistics_ec2b66fba8710941a8b5e7ae87bbcb4b43529db0.png and probability_and_statistics_2f0ab332f2fe56568aae0344a49e76c077dd22ba.png. We could then compute the KL-divergence of probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png and probability_and_statistics_ec2b66fba8710941a8b5e7ae87bbcb4b43529db0.png, probability_and_statistics_4ae5c319807271bc1483c0ee00fe1822934c3ed9.png, and of probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png and probability_and_statistics_2f0ab332f2fe56568aae0344a49e76c077dd22ba.png, probability_and_statistics_381000d615687f47bbc145c9a13594d85fc37982.png, i.e. the "loss of information" by representing probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png using probability_and_statistics_ec2b66fba8710941a8b5e7ae87bbcb4b43529db0.png or probability_and_statistics_2f0ab332f2fe56568aae0344a49e76c077dd22ba.png, respectively. One could then compare these values, and choose the candidate model which had the smallest KL-divergence with probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png.

Asymptotically, making this choice is equivalent of choosing the model probability_and_statistics_d949c842c804b9606f2b64f720e8a3fa64fa0223.png with the smallest

AIC! Note, however, that AIC can be a bad comparison if the number of samples is small.

TODO ANOVA

One-way

Two-way

Bootstrapping

  • Bootstrap methods gets its name due to "using data to generate more data" seems analogous to a trick used by the fictional Baron Munchausen, who when he found himself at the bottom of a lake got out by pulling himself up by his bootstraps :)

Notation

  • probability_and_statistics_00d09b41a06132a6bf0aff21be0bcf0bf29eca55.png is a single homogenous sample of data with PDF probability_and_statistics_f017a9b0f9e8a176a3db97e0116f98ff496ac318.png and CDF probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png
  • Statistic probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png whose value in the sample is probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png, which estimates probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png, an underlying characterstic of the population. Generally probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png is a symmetric function of probability_and_statistics_00d09b41a06132a6bf0aff21be0bcf0bf29eca55.png
  • EDF stands for the empirical distribution function, denoted probability_and_statistics_dbdfcfd9cf44c034e1b89d7a385b5fd3d99323e5.png
  • probability_and_statistics_567365f39e14bcfbefe43da0878e9ab072fa9f7d.png denotes the parameter of a parametric model with CDF and PDF probability_and_statistics_91a1584c8f20133915869576a6600f6a1df86a0c.png and probability_and_statistics_d22d0344d2bce38c1f46c8f1e2d3215f36c78188.png, respectively
  • probability_and_statistics_3537a967f4e6f815ebcfeefc36bea68be1af9600.png is the fitted model to data drawn from the PDF probability_and_statistics_d22d0344d2bce38c1f46c8f1e2d3215f36c78188.png
  • probability_and_statistics_23f0a92afed981597caa9c1c9d44eaebb393b8a9.png denotes the rv. distributed according to the fitted model, and we do the same for other moments (e.g. probability_and_statistics_6604285cb1cf8607f60f8142dd3ce641dbaabecc.png denotes the mean of the fitted model)
  • probability_and_statistics_d230f926a902b2185c158fb5856b87bb81ae8f0a.png denotes the random variable of the statistic of interested, in comparison to probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png which is the observed estimate, or rather

    probability_and_statistics_742d5822a938e9f1c48184aca98bc1efc562bba9.png

Stuff

  • Interested in probability distribution of probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png
  • probability_and_statistics_f3a9394a0a9c19fc6e4ea5174cac921227e1c944.png describes the fact that the population parameter probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png is equal to the value the statistic probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png takes on for the underlying CDF probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png
  • probability_and_statistics_924425b50c8eda2b500f136c71c8d12a19a712c2.png expresses the relationship between the estimate probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png and probability_and_statistics_dbdfcfd9cf44c034e1b89d7a385b5fd3d99323e5.png
    • To be properly rigorous, we would write probability_and_statistics_e94fe0b2969c1dbce11c6bf8d8164f99892229cd.png and require that probability_and_statistics_5f02543d66e290d6c3b7a9c2deeab535e546e825.png as probability_and_statistics_bf3fe99c079b6d787b66abdd8c02bfd9ac53b784.png, possibly even that probability_and_statistics_95d6918d2969bf1022358d46e54bb8490bddf875.png
    • We will mostly assume probability_and_statistics_11f633a1ed909e254c159c88e3b64b1dc869c56e.png
Moment estimates
  • Suppose we simulate a dataset probability_and_statistics_4a59937306c02a35b5c9a07dcd93db6e77524adf.png, i.e. from fitted model
  • Properties of probability_and_statistics_4bb7f9b779285a0591b14c952631448f33f88b72.png are then estimated from probability_and_statistics_c8a8ffdda8b2d3c9439013455c5b47550b7ec72c.png, using probability_and_statistics_76c425f1c1b4193275045ddb6cf6d9b72be71d87.png repetitions of the data simulation
  • Important to recognize that we are not estimating absolute properties of probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png, but rather of probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png relative to probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png
Distribution and quantile estimates
  • Approximating the distribution of probability_and_statistics_4bb7f9b779285a0591b14c952631448f33f88b72.png by that of probability_and_statistics_d09966e5a9ae1f3f126a613ac57e6bb6e2fc0cac.png
  • Cumulative probabilities are estimated by EDF of the simulated values probability_and_statistics_93b032a3a4fc4ad152b3d6d24943d76ad319445a.png, i.e. if

    probability_and_statistics_c1fccd8c000b917a4682f9cc2f348d95c9d668b5.png

    then the simulation estimate of probability_and_statistics_e4625b2d299f01c50c1e7e8b42847d3e16b6fbd7.png is

    probability_and_statistics_eb794bc8f90174283f05df0bc49726e8875abfd8.png

    And probability_and_statistics_57f2baccff8dc53896a0b1adfcbca2d17bc94d0f.png, the exact CDF of probability_and_statistics_d09966e5a9ae1f3f126a613ac57e6bb6e2fc0cac.png under sampling from the fitted model

  • Approximation probability_and_statistics_4b509b158e7b238b37320d30767ae26d7819a7d4.png to probability_and_statistics_8ad47b33d9db4dafb715f077ba205eeab9cecff8.png contains two sources of error:
    1. probability_and_statistics_e5997612986ff1ebf6e57033554893482cea32be.png to probability_and_statistics_8ad47b33d9db4dafb715f077ba205eeab9cecff8.png due to data variability
    2. probability_and_statistics_4b509b158e7b238b37320d30767ae26d7819a7d4.png to probability_and_statistics_e5997612986ff1ebf6e57033554893482cea32be.png due to finite simulation
  • Quantiles of distribution of probability_and_statistics_4bb7f9b779285a0591b14c952631448f33f88b72.png
    • Approximated using ordered values of probability_and_statistics_93b032a3a4fc4ad152b3d6d24943d76ad319445a.png
    • Suppose probability_and_statistics_7b95eaa9f1e3c2dc645097957ed0f1124690344a.png are independent distribution with CDF probability_and_statistics_ab9d9977871eda0af750024aad11a3b9b2f6c240.png and if probability_and_statistics_710dcbc7b521e78ac9bdd5cfc5e3a3bc848cedf6.png denotes the j-th ordered value, then

      probability_and_statistics_9292df1dcea8a37b5c523e487b7bd7f9027f913f.png

    • This implies a sensible estimate of probability_and_statistics_433b18b83d8780ffd8a7da2222dfbec4b54209c2.png is the probability_and_statistics_016310eeef4ceb7e1863013be21a5e385bb08f48.png, assuming that probability_and_statistics_9238f64609791eb5a2ce57798145931d14f2091d.png is an integer
      • Therefore we can estimate probability_and_statistics_b346b8eaf5913ed81b033dacc1d6568c142bbb4a.png quantile of probability_and_statistics_4bb7f9b779285a0591b14c952631448f33f88b72.png by the probability_and_statistics_0bb9d87cabc3598d7142258a0028c4f7c92bf0e1.png oredered value of probability_and_statistics_93b032a3a4fc4ad152b3d6d24943d76ad319445a.png, i.e. probability_and_statistics_bac2b8d82530d27190c9ea5cdbb506afba9f6921.png
      • We're assuming probability_and_statistics_76c425f1c1b4193275045ddb6cf6d9b72be71d87.png is chosen so that probability_and_statistics_ba97d1a11330dd34b4681949814db48906d65214.png is an integer
Nonparametric simulation
  • Same as in previous section, but EDF to perform simulation instead of estimate of parameter for distribution (e.g. using MLE estimate of probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png); call this nonparametric bootstrap
Simple confidence intervals

If we use bootstrap estimates of quantiles for probability_and_statistics_4bb7f9b779285a0591b14c952631448f33f88b72.png, then an equitailed probability_and_statistics_cd449cf4ceaca64177629e528a63dde566c2406e.png confidence interval will have limits

probability_and_statistics_8b36dce744a958534327abadaa03ec03d111f39a.png

where we explicitly write the second term in the parenthesis in the limits to emphasize that we're looking at probability_and_statistics_93b032a3a4fc4ad152b3d6d24943d76ad319445a.png with an expectation probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png.

This is based on the probability implication

probability_and_statistics_5fce3ed1105008c750caec18b16c095e124692a9.png

Reducing error

  • Problem: choose quantity probability_and_statistics_5ba635517a032f9d22fc9ebcb8e62bcf86591563.png such that probability_and_statistics_2fb79be669b0dd19314e865fa5fac86d032cd33d.png is as nearly pivotal as possible
    • That is, it has (at least approx.) the same distribution under sampling from both probability_and_statistics_dbdfcfd9cf44c034e1b89d7a385b5fd3d99323e5.png and probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png
  • Let probability_and_statistics_7f6ca4e15d5a89c198f5dc98aff3a418343ee714.png with probability_and_statistics_44b55e687eab851723280bdee5cdc4f5bf8db8cf.png increasing in probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png and if probability_and_statistics_23f467d5c702f970721463ba5c7c13d92d8fb79e.png is an approx. lower probability_and_statistics_c909ebbfbb1d4a8b0b83930f817f288a82aeb304.png quantile of probability_and_statistics_bb0c74182c8792efce2d410371bc6f0454246fa1.png, then

    probability_and_statistics_10a5f3b18983305021926747f5b84435744b2762.png

    where probability_and_statistics_fc714806048aa00f9de580dec804b914cb92e9be.png is the inverse transformation

  • Thus, probability_and_statistics_938c521eb1dec77dc7afe221dbb14e27ae79bdd3.png is an upper probability_and_statistics_35a630858d2a37f7502a38e04d162ddb8d7c717e.png confidence limit for probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png

So we're basically saying "Let's bound the difference between probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png of the true probability_and_statistics_eb13eedc64828a894bed17ed8b010e5fea060eff.png and probability_and_statistics_c4207654f80c68e96d536088ad629a9dfa3c6927.png of our estimator probability_and_statistics_64785f2a493a7e6a25684a0c4b19905e60fbc20e.png by probability_and_statistics_23f467d5c702f970721463ba5c7c13d92d8fb79e.png, with a certainty / probability of probability_and_statistics_cd0973f70410eb5512f93fd1adb1d4f9a448e3e9.png"

OR rather, "let's bound the probability of the difference between probability_and_statistics_96a8a957c2ff9cee09ca87631ca187b05086b261.png and probability_and_statistics_e156fdeb0f046f2d6deb2f92459186e63fe6abbd.png being probability_and_statistics_cd0973f70410eb5512f93fd1adb1d4f9a448e3e9.png"

OR "we want some constant probability_and_statistics_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png such that the probability of probability_and_statistics_e156fdeb0f046f2d6deb2f92459186e63fe6abbd.png and probability_and_statistics_96a8a957c2ff9cee09ca87631ca187b05086b261.png differ by more than probability_and_statistics_cb254ffe3e498884a0e8a2679f6c3a003f7219a5.png to be bounded by probability_and_statistics_cd0973f70410eb5512f93fd1adb1d4f9a448e3e9.png", and we

Theoretical basis for bootstrap

Suppose we have a random sample probability_and_statistics_00d09b41a06132a6bf0aff21be0bcf0bf29eca55.png, or equiv., its EDF probability_and_statistics_dbdfcfd9cf44c034e1b89d7a385b5fd3d99323e5.png.

We want to estimate some quantity probability_and_statistics_62c24f59319860124d7df34297efdd56f9d23a3f.png, e.g.

probability_and_statistics_c7a38a0c32c74f453e5e5b9c9ddd67ba0893e642.png

and want to estimate the distribution function

probability_and_statistics_23bdbc854297c24f49e3a55a60924f7ae505fd52.png

where the conditioning on probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png indicates that probability_and_statistics_2f8c22c3d5dfa62a0d40f92a13125f1231a66e91.png is a random sample from probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png.

The bootstrap estimate of probability_and_statistics_16e10dc2ac2b49cf4bc8088a1ecb8e0f39eba9d5.png is then

probability_and_statistics_67bb94a0e26596f161dba88fe941962d50aec96c.png

where in this case

probability_and_statistics_95d3bc2a23a510b7b12f71af6383fd56b6daa400.png

In order for

probability_and_statistics_c9e0a9d26515fc8983e1918d7c14556c3cf33e16.png

we need the following conditions to hold (letting probability_and_statistics_f8a2b400bf9def7b5b67035dfb5f06e148ab7065.png be a neighborhood of probability_and_statistics_d5c60a4ae2b31ef1ab0bdd9fc92339f09358331b.png in a space of suitable distributions):

  1. For any probability_and_statistics_00c91349d74e784c002a6da2a3347b82282a8d65.png, probability_and_statistics_681f64ac069961503eba6350cd27b98d4e7c60c5.png must converge weakly to a limit probability_and_statistics_63414d60eba56b92103aa25a887f46434080bac4.png
  2. This convergence must be uniform on probability_and_statistics_f8a2b400bf9def7b5b67035dfb5f06e148ab7065.png
  3. The function mapping probability_and_statistics_f75be59687ef2da76b04e88af02585ab68efc56c.png to probability_and_statistics_ecb07aac38f991ba851e8e52f0e71e4ab16c92e7.png must be continuous

where converge weakly means

probability_and_statistics_7c38729a144d9d73d26d684004e10312ab6e0ab5.png

for all integrable functions probability_and_statistics_22ed05460aa9d3dc9112324da49c6ca98f3e8c9e.png.

Under these conditions, the bootstrap estimate is consistent.

Resampling for testing

Nonparametric Permutation tests
  • Permutation test is a two-sample test
  • Have samples probability_and_statistics_36a19bdc55cb0298730cb1028e7608ba3cefeea9.png and probability_and_statistics_4ce7886c8b212c0282d441a75027669fa0167be6.png for two distributions.
  • Want to check if probability_and_statistics_a80db9da2b01d35139abc7f677fb51f10aee8675.png is true
  • Consider some statistic which measure discrepancy between the two sample-sets, e.g. mean difference

    probability_and_statistics_02720daf57070537d7cec345cbc02fdb54455ac3.png

  • Under probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png, for every sample probability_and_statistics_f6a80e579378a458ed59240345cd9fc828c28786.png, whether or not the sample came from distribution probability_and_statistics_b346b8eaf5913ed81b033dacc1d6568c142bbb4a.png or probability_and_statistics_9974194a836f8c06efda1c78a55ab184c6c2d746.png should be equally likely, since under probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png we have probability_and_statistics_3e686986fa9cc2f56eb9cc431fd34a037a04fae1.png
  • Therefore we consider consider permutations between the samples form the distributions!
    • Consider probability_and_statistics_a69ebb03d9bcb7ba94c132de752d8496774a3d48.png tuple

      probability_and_statistics_9cf7350d37b5cecf332da0102839d32d47894958.png

    • Permute the tuple

      probability_and_statistics_ab141c48c4393cbdd6f8d3a1fc0a2d48338184cd.png

      where probability_and_statistics_8a303c22b5ff8640fb384bd5649f789459e04e8b.png is a permutation on probability_and_statistics_a69ebb03d9bcb7ba94c132de752d8496774a3d48.png symbols.

    • Let probability_and_statistics_e7170aa8452f4fe4d4440f35eb0475c3ff042011.png and probability_and_statistics_35b7ee2510e0c4f981c3b864ea9512e0b6cf8ab3.png
    • Compute probability_and_statistics_d230f926a902b2185c158fb5856b87bb81ae8f0a.png assuming probability_and_statistics_5036811fd4364a0798b563f989d8e6ee3c22d83b.png to come from probability_and_statistics_b346b8eaf5913ed81b033dacc1d6568c142bbb4a.png and probability_and_statistics_338d9f203b7188d6398690c9280adf4652d108c6.png to come from probability_and_statistics_9974194a836f8c06efda1c78a55ab184c6c2d746.png.
  • Gives us achieved significance level (ASL), also known as the p-value, by considering

    probability_and_statistics_d7295286c3c3c78a84d951ef24d524317907c0d3.png

    i.e. the probability of getting a probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png large value when probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png is true

    • Observe that probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png is a "discrepancy" measurement between the samples, e.g. the difference, and so "large values = BAD"

Practically, there can be a lot of permutations to compute; probability_and_statistics_0c560cab312a2b4a279c936b4a0b373c1bbaa9a1.png possibilities, in fact. Therefore we estimate the probability_and_statistics_2e1a1699b694378a2af0f1081d7dbc2394c0731b.png by "sampling" permutations uniformly:

probability_and_statistics_91cce131327a1f2d46159bdf0772cf1de01867b8.png

where probability_and_statistics_69fa4106b8606c4db436c6bfa4a630596898ca5f.png corresponds to the estimate of probability_and_statistics_43fda354132c99645b4d5dfafbf52bdd3f824196.png using the m-th sampled permutated dataset, and probability_and_statistics_4041476b998bc5a6ca4d7b8f4880e883a5a433a8.png is the number of permutations considered.

Note: permutation test is an exact test when all permutations are computed.

Frequentist bootstrapping

  • Data approximates unknown distribution
  • Sampling distribution of a statistic can be approximated by repeatedly resampling the observations with replacement and computing the statistic for each sample

Let

  • probability_and_statistics_a1d4609bb6dac6629a8cb57f000a53016b439f90.png denote the original samples
  • probability_and_statistics_1a6ee079e30c78e564d70d442f30de78b12443e2.png denote the bootstrap sample
    • Likely have some observations repeated one or more times, while some are absent

Equivalently, we can instead make the following observation:

  • Each original observation probability_and_statistics_87cd40eef3fe85e8c05a075dc177daf4eedc6722.png occurs anywhere from probability_and_statistics_07b269ab2ac24e4eeaaa1cb31593283d236cad7d.png to probability_and_statistics_5403f6bbface4889f05450c96efe5dfafd041d71.png times
  • Let probability_and_statistics_ecd189fc17218cf997b3467ac63c20732d48f46f.png denote # of times probability_and_statistics_87cd40eef3fe85e8c05a075dc177daf4eedc6722.png occurs in probability_and_statistics_489ed6d45a1bc42e6367649d0368be4a4c7c8995.png and

    probability_and_statistics_7af6c40241e605ffa4f5d695c3d5751c50ca73ea.png

    thus

    probability_and_statistics_120581a3d5d65c202ab1b865c540a43bb1936129.png

  • Let probability_and_statistics_b0352089e838b86b619203562a01ca2adf839e5a.png such that

    probability_and_statistics_f7f5ba3283e3fa662d331001b1155b7fab7f421c.png

    then

    probability_and_statistics_a90bb96e56e560fa2d9868dd82196b202bc4621b.png

  • Hence we can compute the bootstrap sample probability_and_statistics_489ed6d45a1bc42e6367649d0368be4a4c7c8995.png by instead drawing probability_and_statistics_10514e65fd51ead410a0ed41b7ca8012b501fb60.png this way, and weighting the original samples by these drawn weights!

Recall that

probability_and_statistics_6fd53c3e4cdeaff7a138f3921d82cd5134be3b32.png

has the property that

probability_and_statistics_cc7be5285bbc10a81baa63b71663ec525d52033b.png

Hence,

probability_and_statistics_99fd15daee78a2476c135adb4a4ff019ecedd18e.png

TODO Bayesian bootstrapping

Problems / Examples

Multi-parameter MLE

Normal samples

Suppose we have the samples probability_and_statistics_4a6648704a4fca6d2ecb60b88bdf1e9a5bc11774.png following the model

probability_and_statistics_e16aa263aa6e54d3fbf30c5550f5e85ee4c7de4e.png

Then we want to test probability_and_statistics_80c72ca5b37d0e04d425853da83db7deffcf760e.png against probability_and_statistics_c45876766998a098e334455ccbd025d53fefeb7e.png.

The likelihood is given by

probability_and_statistics_a338c5c02555f1e1e6b8c6fad1994ed57d8e0e2d.png

Under our restricted parameter-space probability_and_statistics_0771ce7b70bacdbe91b857a46becb2dd6d7e0e92.png (probability_and_statistics_80c72ca5b37d0e04d425853da83db7deffcf760e.png): then the MLE

probability_and_statistics_78d82616c1f912b5ad4add1551867d122704ee35.png

Under probability_and_statistics_fe30c01743657dbedfd14d304b72a97b701e4752.png (probability_and_statistics_e71e16a8d024f38f88b719a205f37ce5b85c758e.png and probability_and_statistics_356e0d9e7374e9995c8e63ce904f486552077872.png can take on any value):

probability_and_statistics_b56db7861f8bb6a040807447ef50e9f1d7e75ae8.png

Thus, the likelihood ratio is

probability_and_statistics_7b6ccdfd941ffcafbc467ef4fecb56778cf18dac.png

But

probability_and_statistics_159df65bcc6bc052b62d779e622f2f5334852d10.png

and thus

probability_and_statistics_f06055624334748f6e1f3026becbe506b35ceb07.png

where

probability_and_statistics_718278fb9ca57678f20ef0f0e47c4f0718190ad0.png

Rejecting for small values for LR is equivalent to rejecting for large values of probability_and_statistics_d8a7e27db3f2696c02a4063584b656fbec5d3e35.png, i.e. this test is equivalent to a two-tailed t-test.

Here we can determine an exact distribution for LR, or rather the equivalent test statistics probability_and_statistics_7858a550c696df9ba92761068d9d9803e3bb69e4.png, i.e.

probability_and_statistics_8b0ee3497d2a6e2df78c928d791278bab47b7b25.png

However, in general, the distribution of LR (under probability_and_statistics_16192e248f1557f7044d75899be5ca42e34f2859.png) is difficult to determine and we need to use the approximate probability_and_statistics_a7d0e1a2db3c47f81bea40f74592a777f901e8e5.png method.

Fisher's method of scoring

Multiparameter case

Let probability_and_statistics_00d09b41a06132a6bf0aff21be0bcf0bf29eca55.png be i.i.d. with

probability_and_statistics_dbb76bd71b1dd14fdc725aa493b19a9b2cde33f3.png

with p.d.f.

probability_and_statistics_5203e5484a3d00727386b315e2e258831c4a646a.png

that is probability_and_statistics_726bdfde89caa9c467d172436387306a65eaa767.png.

  • Log-likelihood:

    probability_and_statistics_4e7682d96c0862513f455518e86ba811dc64d8a6.png

  • Score vector:

    probability_and_statistics_bee0633d0e6bb4966484023a2d88be872633c979.png

  • Hessian matrix:

    probability_and_statistics_61bb7efe0040f156a668e35d80e161cc92f21ed4.png

  • Observed information matrix:

    probability_and_statistics_656cd223fbdc1835be16c4cdbf8ab22993acbab1.png

  • Expected information matrix:

    probability_and_statistics_f29e012757118db6d9320a024ac83df7081f935e.png

  • Asymptotic variance-covariance matrix of probability_and_statistics_578cbd7bdb0f3b6d8fd7e4ab19c1931204860fa7.png is:

    probability_and_statistics_71b9508bdb102ded333f5372bc976ab4081ec540.png

  • Estimated standard errors:

    probability_and_statistics_726c186c01eac3dad7e6699304d1b7d5bc3f21b5.png

  • Covariance between our estimates:

    probability_and_statistics_07e806d8f5e5db94109368d482de04a84e095ecf.png

  • Asymptotic distribution:

    probability_and_statistics_9a0658011a17ce9d249c8e2ac426c7759d742a63.png

    Which is the distribution we would base a Wald test of probability_and_statistics_897eb830d6649a86843925979572d8fd5f5ee17a.png on, i.e.

    probability_and_statistics_3a585b1cf32da610c29fc24bdc82e540422126aa.png

Bibliography