# Probability & Statistics

## Notation

• denotes the distribution of some random variable
• means convergence in probability for random elements , refers to

• means convergence in distribution for random elements , i.e. in distribution if for all continuity points of .
• means almost surely convergence for random elements

## Theorems

### Chebyshev Inequality

Let (integrable) be a random variable with finite expected value and finite non-zero variance .

Then for any real number ,

Let be a measure space, and let be an extended real-valued measureable-function defined on .

Then for any real number and

Or more generally, if is an extended real-valued measurable function, nonnegative and nondecreasing on the range of , then

Hmm, not quite sure how to relate the probabilistic and measure-theoretic definition..

### Markov Inequality

If is a nonnegative random variable and , then

First note that

and

Hence,

### Law of large numbers

There are mainly two laws of large numbers:

Sample averages converge in probability to the mean, i.e.

or equivalently,

Often the assumption of finite variance is made, but this is in fact not necessary (but makes the proof easier). Large or infinite variance will make the convergence slower, but the Law of Large Numbers holds anyway.

### Central Limit Theorem

Suppose is a sequence of i.i.d. random variables with and . Then,

where means convergence in distribution, i.e. that the cumulative distribution functions of converge pointwise to the cdf of the distribution: for any real number ,

### Jensen's Inequality

If

• is a convex function.
• is a rv.

Then .

Further, if ( is strictly convex), then "with probability 1".

• is a concave function

Then . This is the one we need for the derivation of EM-algorithm.

To get a visual intuition, here is a image from Wikipedia:

### Continuous Mapping Theorem

Let , be random elements defined on a metric space .

Suppose a function (where is another metric space) has the set of discountinuity points such that .

Then,

We start with convergence in distribution, for which we will need a particular statement from the

### Slutsky's Lemma

Let , be sequences of scalarvectormatrix random elements.

If and , where is a constant, then

This follows from the fact that if:

then the joint vector converges in distribution to , i.e.

Next we simply apply the Continuous Mapping Theorem, recognizing the functions

as continuous (for the last mapping to be continuous, has to invertible).

## Definitions

### Probability space

A probability space is a measure space such that the measure of the whole space is equal to one, i.e.

• is the sample space (some arbitrary non-empty set)
• is the σ-algebra over , which is the set of possible events
• which is the probability measure such that

### Random measure

Let be a separable metric space (e.g. ) and let be its Borel σ-algebra.

is a transition kernel from a probability space to if

We say a transition kernel is locally finite, if

satisfy for all bounded measurable sets and for all except for some zero-measure set (under ).

Let be a separable metric space (e.g. ) and let be its Borel σ-algebra.

A random measure is a (a.s.) /locally finite transition kernel from a (abstract) probability space to .

Let be a random measure on the measurable space and denote the expected value of a random element with .

The intensity measure is defined

So it's a non-random measure which sends an element of the sigma-algebra to the expected value of the , since is a measurable function, i.e. a random variable.

#### Poisson process

A Poisson process is a generalized notion of a Poisson random variable.

A point process is a (general) Poisson process with intensity if it has the two following properties:

1. Number of points in a bounded Borel set is a Poisson random variable with mean .
• In other words, denote the total number of points located in by , then the probability of random variable being equal to is given by

2. Number of points in disjoint Borel sets form independent random variables.

The Radon measure maintains its previous interpretation of being the expected number of points located in the bounded region , namely

#### Cox process

Let be a random measure.

A random measure is called a Cox process directed by , if is a Poisson process with intensity measure .

Here is the conditional distribution of , given .

### Degeneracy of probability distributions

A degenerate probability distribution is a probability distribution with support only on a lower-dimensional space.

E.g. if the degenerate distribution is univariate (invoving a single random variable), then it's a deterministic distribution, only taking on a single value.

### Characteristic function

Let be a random variable with density and cumulative distribution function .

Then the characteristic equation is defined as the Fourier transform of :

## Examples

### 10 dice: you want at least one 4, one 3, one 2, one 1

Consider the number of misses before a first "hit". When we say "hit", we refer to throwing something in the set .

Then we can do as follows:

• is the # of misses before our first "hit"
• is the # of misses between our first "hit" and second "hit"

We then observe the following:

• since if we have more than misses before our first hit, we'll never be able to get all the events in our target set

We also observe that there are , , and ways of missing for each of the respective target events.

from sympy import *

r_1, r_2, r_3, r_4 = symbols("r_1 r_2 r_3 r_4", nonnegative=True, integers=True)

s = Sum(
Sum(
Sum(
Sum(
(2**(r_1 - 1) * 3**(r_2 - 1) * 4**(r_3 - 1) * 5**(r_4 - 1)) \
/ 6**(r_1 + r_2 + r_3 + r_4),
(r_4, 1, 10 - r_1 - r_2 - r_3)
), (r_3, 1, 9 - r_1 - r_2)
), (r_2, 1, 8 - r_1)
), (r_1, 1, 7)
)

res = factorial(4) * s.doit()
print(res.evalf())


## Probability "metrics" / divergences

### Overview of the "metrics" / divergences

These definitions are taken from arjovsky17_wasser_gan

The Total Variation (TV) distance

Or, using slightly different notation, the total variation between two probability measures and is given by

These two notions are completely equivalent. One can see this by observing that any discrepancy between and is "accounted for twice", since , which is why we get the factor of in front. We can then gather all where into a subset , making sure to choose or it's complement such that the difference is positive when in the first term. Then we end up with the seen above.

The Jensen-Shannon (JS) divergence

where is the mixture .

This divergence is symmetrical and always defiend because we can choose .

The Earth-Mover (EM) distance or Wasserstein-1

where denotes the set of all joint distributions whose marginals are respectively and .

Intuitively, indicates how much "mass" must be transported from to in order to transform the distributions into the distribution . The EM distance then is the "cost" of the optimal transport plan.

### Kullback-Leibler divergence

#### Definition

The Kullback-Leibler (KL) divergence

where both and are assumed to be absolutely continuous, and therefore admit densities, wrt. to the same measure defined on .

The KL divergence is famously asymmetric and possibly infinite / undefined when there are points s.t. and .

where in the inequality we have use Jensen's inequality together with the fact that is a convex function.

#### Why?

Kullback-Leibner divergence from some probability-distributions and , denoted , is a measure of the information gained when one revises one's beliefs from the prior distribution to the posterior distribution . In other words, amount of information lost when is used to approximate .

Most importantly, the KL-divergence can be written

where is the optimal parameter and is the one we vary to approximate . The second term in the equation above is the only one which depend on the "unknown" parameter ( is fixed, since this is the parameter we assume to take on). Now, suppose we have samples from , then observe that the negative log-likelihood for some parametrizable distribution is given by

By the Law of Large numbers, we have

where denotes the expectation over the probability density . But this is exactly the second term in the KL-divergence! Hence, minimizing the KL-divergence between and is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!

### Wasserstein metric

Let be a metric space for which every probability measure on is a Radon measure.

For , let denote the collection of all probability measures on with finite p-th moment (expectation of rv. to the p-th power) for some ,

Then the p-th Wasserstein distance between two probability measures and in is defined as

where denotes the collection of all measures on with marginals and on the first and second factors respectively (i.e. all possible "joint distributions").

Or equivalently,

Intuitively, if each distribution is viewed as a unit amount of "dirt" piled on the metric space , the metric minimum "cost" of turning one pile into the other, which is assumed to eb the amount dirt that needs to moved times the distance it has to be moved.

Because of this analogy, the metric is sometimes known as the "earth mover's distance".

Using the dual representation of , when and have bounded support:

where denotes the minimal Lipschitz constant for .

Compare this with the definition of the Radon metric (metric induced by distance between two measures):

If the metric is bounded by some constant , then

and so convergence in the Radon metric implies convergence in the Wasserstein metric, but not vice versa.

Observe that we can write the duality as

where we let denotes that we're finding the supremum over all functions which are 1-Lipschitz, i.e. Lipschitz continuous with Lipschitz constant 1.

### Integral Probability Metric

Let and be a probablity distributions, and be some space of real-valued functions. Further, let

When is sufficently "large", then

We then say that together with defines an integral probabilty metric (IPM).

## Stein's method

Absolutely fantastic description of it: https://statweb.stanford.edu/~souravc/beam-icm-trans.pdf

### Notation

• any random variable
• std. Gaussian random variable

### Overview

• Technique for proving generalized central limit theorems
• Ordinary central limit theorem: if are i.i.d. rvs. then

where and

• Usual method of proof:
1. LHS is computed using Fourier Transform
2. Independence implies that FT decomposes as a product
3. Analysis
• Stein's motivation: what if are not exactly independent?!

### Method

Suppose we want to show that is "approximately Gaussian", i.e.

or,

for any well-behaved

It's a generalization because if

then

Suppose

and we want to show that the rv. is approximately std. Gaussian, i.e. for all well-behaved .

1. Given , obtain a function by solving the differential equation

2.Show that

using the properties of

1. Since

conclude that .

More generally, two smooth densities and supported on are indentical if and only if

for smooth functions with proper zero-boundary conditions, where

is called the Stein score function of .

Stein discrepancy measure between two continuous densities and is defined

where is the set of functions which satisfies

Two smooth densities and supported on are indentical if and only if

for smooth functions with proper zero-boundary conditions, where

is called the Stein score function of .

Stein discrepancy measure between two continuous densities and is defined

where is the set of functions which previous expectation and is also "rich" enough to ensure that whenever .

### Why does it work?

• Turns out that if we replace the in Stein's method, then is not well-behaved; it blows up at infinity.
• A random variable has the standard Gaussian distribution if and only if for all ! (you can easilty check this by observing that the solution to this is ).
• Differential operator , defined as

is called a characterizing operator for the standard Gaussian distribution.

## Stochastic processes

### Discrete

#### Simple random walk

Let be a set of i.i.d. random variables such that

Then, let

We then call the sequence a simple random walk.

For large , due to CLT we have

i.e. follows a normal distribution with and .

We then have the following properties:

1. Independent measurements : if

then are mutually independent, with .

2. Stationary : for all and , the distribution of is the same as the distribution of .

#### Martingale

A martingale is a stochastic process which is fair game.

We say a stochastic process is a martingale if

where

##### Examples
1. Simple random walk is a martingale
2. Balance of a Roulette player is NOT martingale
##### Optional stopping theorem

A given stochastic process , a non-negative integer r.v. is called a stopping time if

depends only on (i.e. stopping at some time only depends on the first r.v.s).

Suppose is martingale and is a stopping time.

There exists a constant s.t. , and

This implies that if we a process can be described as a martingale, then it's a "zero-sum".

### Continuous

#### Lévy process

A stochastic process is said to be a Lévy process if it satisfies the following properties:

1. almost surely.

##### Wavelet basis

A mother wavelet is such that form an orthogonal basis for some subspace of , hence

converges to in the norm!

##### Multi-Resolution Analysis (MRA)
• Overview
• Algorithm for constructing the different resolutions

We consider wavelets constructed from some mother wavelet :

and we want to expand our signal in such a wavelet basis.

Consider sequence of subspaces

in , with the follow properties:

1. Nested

2. Union

is dense in

3. Orthogonality

4. Increased "resolution"

5. such that is an orthogonal basis in
• is called the scaling function or father wavelet

Let be the "standard" scale where our mother wavelet and father wavelet live, i.e. is for .

We can map to by

which is called the dilation or two-scale or refinement equation.

We can repeat this argument for arbitrary , so with , we write

Then,

which means

Finally, this tells us that (the mother wavelet) such that

constitutes an orthogonal basis for .

If is a multi-resolution analysis (MRA) with scaling function , then there exists a mother wavelet defined

where

which allows us to obtain an orthormal basis for which is dense in :

• Haar wavelet
##### Application

For more information about how and when to apply wavelet transforms, and which mother wavelet to use, e.g., checkout MATLABs documentation on this.

### Testing

#### Notation

• and denotes the acceptance and critical region, respectively
• denotes the critical value of some statistic
• Critical region with a given significane level is denoted

• Type I error: Reject when is true with

• Type II error : Fail to reject when is false (equiv., we accept when we shouldn't):

• denotes the test function:

• refers to the expectation over whatever parametrized distribution which given you're parametrizing the distribution with .

#### Definitions

A hypothesis is simple if it defines completely:

otherwise, it's composite.

If is parametric with more than one parameter, a composite hypothesis might specify values of some or all of them (e.g. on regression coefficient).

A U-statistic is the class of unbiased statistics; this class arise naturally in producing minimum-variance unbiased estimators (MVUEs)

Important: in statistics there are TWO notions for a U-statistic

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions.

Thus, nonparametric tests or distribution-free tests are procedures for hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed.

Let the data and we wish to test two simply hypotheses:

Suppose that we choose a test statistic and reject if for some critical value .

This induces a partition of the sample space into two disjoint regions:

• the rejection region :

• the acceptance region :

We will also sometimes use the notation to denote the critical region which has the property

Consider and with corresponding p.d.f.'s , for .

For these hypotheses, comparison between the critical regions of different tests is in terms of

the power of a size critical region for alternative .

A best critical region is then the critical region with maximum power.

There are two possible types of error:

• Type I error: Reject when is true
• Type II error: Fail to reject when is false
• denotes the probability of Type I error and is called the significance level (or size ) of the test
• denotes the probability of Type II error which is uniquely defined only if is simple, in which case

denotes the power of the test. For composite , is the power function.

We can define the test function such that

which has the property that

For discrete distributions, the probabilty that the test statistic lies on the boundary of the critical region, , may be non-zero.

In that case, it is sometimes necessary to use a randomized test, for which the test function is

for some function and we reject based on observed data with probability .

Suppose now there is a parametric family of alternative p.d.f.'s for .

The power of a size critical region generalizes to the size power function

A size critical region , is then uniformly most powerful size (UMP size ) if it has maximum power uniformly over ( is NOT power, is).

A test is UMP if all its critical regions are UMP, or more formally:

A uniformly most powerful or UMP test, , of size is a test for which

1. Given any other test for which for all , we have

i.e. expectation given is at least as large as for the less powerful statistic (who's test function is ).

A test of against is called unbiased of size if

and

Informally, unbiased test is one which has higher probability of rejecting when it is false, than when it is true.

A test which is uniformly most powerful among the set of all unbiased tests is called the uniformly most powerful unbiased (UMPU).

##### Two-sample test

The problem of comparing samples from two probability distributions, by a statistical tests of the null hypothesis that these distributions are equal against the alternative hypothesis that these distributions are different (this is called the two-sample problem), i.e. in short:

• corresponds to distributions being equal
• corresponds to distributions being different
##### Exact test

A test is exact if and only if

which is in contrast to non-exact tests which only have the property

for some "critical constant" .

#### Hypothesis testing

For any size , the LR critical region is the best critical region for testing simple hypothesis vs. .

That is, suppose one is performing a hypothesis test between two simple hypothesis and , using the likelihood-ratio test with threshold which rejects in favour of at a significance level of

where

Then, is the most powerful test at significance level .

##### Notation
• and denotes the null hypothesis and the alternative hypothesis
• and are the parametric set corresponding to the and , respectively
• A test statistic based on a random sample
• -value refers to the minimum value we require from the test for us to accept
• is the information number corresponding to and
##### Stuff
• Assume large values of give evidence against
• For fixed and a real number let

Random quantity

is the value corresponding to the statistic when is the true parametric value.

For example, if

and the null hypothesis is rejected at the significance level .

If for with probability one we will have

then is called the Bahadur exact slope of .

The larger the Bahadur exact slope , the faster the rate of decay of the value under the alternative. It is known that for any , where is the information number corresponding to and .

A test statistic is called Bahadur efficient at if

Bahadur efficiency allows one to compare two (sequences of) test statistics and from the following perspective:

Let be the smallest sample size required to reject at the significance level on the basis of a random sample when is the true parametric value.

The ratio gives a measure of relative efficiency of to .

To reduce the number of arguments, , and , one usually considers the rv. which is the limit of this ratio, as . In many situations this limit does not depend on , so it represents the efficiency of against at with the convenient formula

where and are the corresponding Bahadur slopes.

I.e. can use the Bahadur efficiency to compare the number of samples needed to reject .

#### Kolmogorov-Smirnov

As ,

where is the Kolmogorov-Smirnov statistic.

The empirical distribution function is given by

##### Consistency and unbiasedness at a point

Fix , then

Consequently,

That is, is an unbiased estimator of for each fixed 4x\$. Also

Consequently, by the Chebyshev inequality,

Therefore, is consistent

#### Kolmogorov-Smirnov test

The Kolmogorov distribution is the distribution of the rv.

where is the Brownian bridge. The cumulative distribution function of is given by

The empirical distribution function for i.i.d. observations is defined as

where

• is the indicatior function defined

The Kolmogorov-Smirnov statistic for a given cumulative distribution function is

Under that the sample comes from the hypothesized distribution

in distribution, where is the Brownian bridge.

If is continuous, then under converges to the Kolmogorov distribution, which does not depend on .

Kolmogorov-Smirnov test (K-S test) is a non-parametric test of the equality of continuous , one-dimensional probability distributions that can be used to compare one sample with a reference probability distribution (one-sample K-S test ), or to compare two samples (two-sample K-S test).

The one-sample KS test is:

#### F-test

An F-test is any statistical test in which the test statistic, called a F-statistic, has an F-distribution under the null hypothesis.

#### Residual-sum of squares between two models as a F-statistic

##### Notation
• denotes the number of samples
• denotes the number of regressors (including the constant term)
• denotes the number of linear "restrictions" / "constraints", and therefore, also the degrees of freedom
• is matrix with full column rank , i.e. , which translates into constraints being "independent". This matrix is used to enforce the constraints on , i.e. if we don't want a constant term / intercept is such that , if the constant term was in the first entry.
• represents the hypothesized residual sum
##### Stuff

Consider two linear regression models with coefficients and , with . Then let and denote the respective number of coefficients, with . Then letting

we have

i.e. the quantity above is F-distributed.

Note: observe that so the above is a positive quantity.

Consider the null-hypothesis . Recall the OLS solution is given by

and we therefore have

Let

and let, using Cholesky decomposition, be such that

we have a "matrix square root" of in . The purpose of this matrix is so we can define the following random variable:

where denotes the identity identity matrix. To see that this is indeed the case, observe that

Using properties seen in the proof of , know that this is independent of the rv.

where

where

known as the residual maker matrix. Then we have

Recall that a F-statistic is a ratio between divided by their corresponding degrees of freedom, hence the above. In particular, under , this reduces to the statistic

Letting (which is the number of components of we set to zero to get ) and (since this is how many parameters we now have), we conclude our proof.

##### Proof that

Consider the regression model

with

Then the residual vector is estimated by

And is distributed according to

and thus

Consider the following linear model

where , and . The vector of residuals is estimated by

where

Since trace is invariant under cyclic permutations, we have

since matrices, and so . Therefore

Futhermore,

which is seen by the fact that

We then have the following sequence of implications:

• only has eigenvalues and
• is normal and so there exists a unitary matrix such that

Recall that

which is seen by computing and using . From this it follows that

where is the corresponding vector (since the rest of the components have zero variance and thus add nothing to the norm).

Furthermore, since is unitary, as mentioned earlier, we have

Recall that the residual sum-of-squares is given by , and so arrive at our result

Finally, observe that this implies

### Similarity testing

#### Notation

• with can take on values
• denotes a test

#### Stuff

Test vs. .

is a test of size , i.e.

Then is a similar test of size .

with on the boundary of .

### Confidence regions

#### Notation

• is drawn from some distribution

• is a size critical region for

with

• critical region refers to the sampling distribution which will lead to the rejection of the hypothesis tested when the hypothesis is true.

#### Stuff

• Generalization of confidence intervalls to multiple dimensions

is a confidence region for if

If then

#### Pivotal quantities

which has a distribution independent of , i.e. is a pivotal statistic, that is

Then we can construct a value such that we obtain a confidence region following:

### Decision Theory

#### Notation

• denotes the sample space i.e.
• denotes the parameter space
• denotes a family of probability distributions
• is the action space, i.e. set of actions an experiment can take after observing data, e.g. reject or accept , estimating the value of , etc.
• denotes the loss function
• denotes decision rules , with is a function that associates a particular decision which each possible observed data set.

#### Stuff

For a parameter , the risk of a decision rule, , is defined as

In other words, the risk is the expected loss of a particular decision rule when the true value of the unknown parameter is .

Note: this is fundamentally a frequentist concept, since the definition implicitly invokes the idea of repeated samples from the parameter space and computes the average loss over these hypothetical repetitions.

#### Selection of decision rule

Given two possible decision rules and , the rule strictly dominates the rule if

and there exists at least one value of , e.g. θ', for which

It is clear that we would always prefer to .

If, for a given decision rule , there exists some decision rule that strictly dominates , then is said to be inadmissible.

Conversely, if there is no decision rule that dominates , then is said to be admissible.

That is, generally less-than-or-equally AND has at least ONE value of for which we have strict inequality.

##### Unbiasedness

We say a loss-function is unbiased if

i.e. loss of the decision rule should be minimized for the true value of the parameter .

##### Minimax decision rules

A decision rule is minimax if it minimizes the maximum risk

which can also be written

##### Bayes decision rules

The Bayes risk of a decision rule, , is defined by

or by a sum in case of discrete-valued probability distribution.

A decision rule is a Bayes rule with respect to the prior if it minimizes the Bayes risk, i.e.,

Definition of Bayes risk assumes that the infimum is achieved by some rule, which might not always be true.

In those cases, for any , we can find a decision rule such that

Such a rule is said to be *Bayes wrt. prior .

A useful choice of prior is the one that is most conservative in its estimate of risk. This gives rise to the concept of a least favourable prior.

We say is a least favourable prior if, for any other prior we have

where are the Bayes (decision) rules corresponding to and .

##### Randomized decision rules
• decision rules which we choose from with probabilties , with and
• Define to be the rule "selection decision rule with prob. ", called the randomized decision rule, which has risk

• Sort of a linear combination of decision rules
• Minimax rules are often this way
• Supremum over of the risk for is smaller than the supremum of the risk for any of the decision rules individually

#### Finding minimax rules

Suppose that is a Bayes (decision) rule corresponding to prior and , then

1. is a minimax decision rule
2. is the least favourable prior

#### Admissibility of Bayes (decision) rules

We have the following three theorems:

Assume that the parameter space, , is finite and a given prior gives positive weight to each .

The a Bayes (decision) rule wrt. is admissible.

If a Bayes (decision) rule is unique, it is admissible.

Let be a subset of the real line. Assume that hte risk functions are continuous in for all decision rules .

If the prior has the property that for any and any , the interval has non-zero probability under , then the Bayes (decision) rule wrt. is admissible.

### James-Stein Estimators

#### Notation

• is the square loss function
• Risk of a decision rule is then given

The class of James-Stein estimators of the form

has smaller risks that are also independent of and hence strictly dominate the natural estimator

These are called shrinkage estimators as they shrink towards when , and has the following consequence: folding in information from variables that are independent of the variable of interest can, on average, improve estimation of the variable of interest!

Suppose

and is an almost differentiable function then

First we do the univariate case.

Suppose

which is absolutely continuous, and

Using change of variables to set and , then is std. normal. Then

Then one simply substitute into the Stein's lemma and prove that it is indeed satisfied.

### Bayesian Statistics

#### Model comparison

##### Bayesian Information Criterion (BIC)

Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred.

The BIC is defined as:

where

• is the MLE of the model
• the obsered data
• is the number of data points in
• is the number of parameters in the model

BIC is an asymptotic result derived under the assumptions that the data distribution follows an exponential family.

Limitations:

1. Approximation only valid for sample sizes
2. The BIC cannot handle complex collections of models as in the variable selection problem in high-dimensions
##### Akaike Information Criterion (AIC)
• Given collection of models for some data, estimates quality of each model, relative to each of the other models
• NOT a test of model in a sense of testing a null hypothesis, i.e. says nothing about the absolute quality, only relative quality

Suppose we have a statistical model of some data.

Let:

• be the number of estimated parameters in the model
• be the maximum value of the likelihood function for the model

Then the Akaike Information Criterion (AIC) value of the model is

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value.

Thus, AIC rewards goodness-of-fit (as assessed by the likelihood function), but it also penalizes large number of parmateres (complexity).

AIC is based on information theory. Suppose data was generated by some unknown process , and we're considering two candidate models to and . We could then compute the KL-divergence of and , , and of and , , i.e. the "loss of information" by representing using or , respectively. One could then compare these values, and choose the candidate model which had the smallest KL-divergence with .

Asymptotically, making this choice is equivalent of choosing the model with the smallest

AIC! Note, however, that AIC can be a bad comparison if the number of samples is small.

### Bootstrapping

• Bootstrap methods gets its name due to "using data to generate more data" seems analogous to a trick used by the fictional Baron Munchausen, who when he found himself at the bottom of a lake got out by pulling himself up by his bootstraps :)

#### Notation

• is a single homogenous sample of data with PDF and CDF
• Statistic whose value in the sample is , which estimates , an underlying characterstic of the population. Generally is a symmetric function of
• EDF stands for the empirical distribution function, denoted
• denotes the parameter of a parametric model with CDF and PDF and , respectively
• is the fitted model to data drawn from the PDF
• denotes the rv. distributed according to the fitted model, and we do the same for other moments (e.g. denotes the mean of the fitted model)
• denotes the random variable of the statistic of interested, in comparison to which is the observed estimate, or rather

#### Stuff

• Interested in probability distribution of
• describes the fact that the population parameter is equal to the value the statistic takes on for the underlying CDF
• expresses the relationship between the estimate and
• To be properly rigorous, we would write and require that as , possibly even that
• We will mostly assume
##### Moment estimates
• Suppose we simulate a dataset , i.e. from fitted model
• Properties of are then estimated from , using repetitions of the data simulation
• Important to recognize that we are not estimating absolute properties of , but rather of relative to
##### Distribution and quantile estimates
• Approximating the distribution of by that of
• Cumulative probabilities are estimated by EDF of the simulated values , i.e. if

then the simulation estimate of is

And , the exact CDF of under sampling from the fitted model

• Approximation to contains two sources of error:
1. to due to data variability
2. to due to finite simulation
• Quantiles of distribution of
• Approximated using ordered values of
• Suppose are independent distribution with CDF and if denotes the j-th ordered value, then

• This implies a sensible estimate of is the , assuming that is an integer
• Therefore we can estimate quantile of by the oredered value of , i.e.
• We're assuming is chosen so that is an integer
##### Nonparametric simulation
• Same as in previous section, but EDF to perform simulation instead of estimate of parameter for distribution (e.g. using MLE estimate of ); call this nonparametric bootstrap
##### Simple confidence intervals

If we use bootstrap estimates of quantiles for , then an equitailed confidence interval will have limits

where we explicitly write the second term in the parenthesis in the limits to emphasize that we're looking at with an expectation .

This is based on the probability implication

#### Reducing error

• Problem: choose quantity such that is as nearly pivotal as possible
• That is, it has (at least approx.) the same distribution under sampling from both and
• Let with increasing in and if is an approx. lower quantile of , then

where is the inverse transformation

• Thus, is an upper confidence limit for

So we're basically saying "Let's bound the difference between of the true and of our estimator by , with a certainty / probability of "

OR rather, "let's bound the probability of the difference between and being "

OR "we want some constant such that the probability of and differ by more than to be bounded by ", and we

#### Theoretical basis for bootstrap

Suppose we have a random sample , or equiv., its EDF .

We want to estimate some quantity , e.g.

and want to estimate the distribution function

where the conditioning on indicates that is a random sample from .

The bootstrap estimate of is then

where in this case

In order for

we need the following conditions to hold (letting be a neighborhood of in a space of suitable distributions):

1. For any , must converge weakly to a limit
2. This convergence must be uniform on
3. The function mapping to must be continuous

where converge weakly means

for all integrable functions .

Under these conditions, the bootstrap estimate is consistent.

#### Resampling for testing

##### Nonparametric Permutation tests
• Permutation test is a two-sample test
• Have samples and for two distributions.
• Want to check if is true
• Consider some statistic which measure discrepancy between the two sample-sets, e.g. mean difference

• Under , for every sample , whether or not the sample came from distribution or should be equally likely, since under we have
• Therefore we consider consider permutations between the samples form the distributions!
• Consider tuple

• Permute the tuple

where is a permutation on symbols.

• Let and
• Compute assuming to come from and to come from .
• Gives us achieved significance level (ASL), also known as the p-value, by considering

i.e. the probability of getting a large value when is true

• Observe that is a "discrepancy" measurement between the samples, e.g. the difference, and so "large values = BAD"

Practically, there can be a lot of permutations to compute; possibilities, in fact. Therefore we estimate the by "sampling" permutations uniformly:

where corresponds to the estimate of using the m-th sampled permutated dataset, and is the number of permutations considered.

Note: permutation test is an exact test when all permutations are computed.

#### Frequentist bootstrapping

• Data approximates unknown distribution
• Sampling distribution of a statistic can be approximated by repeatedly resampling the observations with replacement and computing the statistic for each sample

Let

• denote the original samples
• denote the bootstrap sample
• Likely have some observations repeated one or more times, while some are absent

Equivalently, we can instead make the following observation:

• Each original observation occurs anywhere from to times
• Let denote # of times occurs in and

thus

• Let such that

then

• Hence we can compute the bootstrap sample by instead drawing this way, and weighting the original samples by these drawn weights!

Recall that

has the property that

Hence,

### Problems / Examples

#### Multi-parameter MLE

##### Normal samples

Suppose we have the samples following the model

Then we want to test against .

The likelihood is given by

Under our restricted parameter-space (): then the MLE

Under ( and can take on any value):

Thus, the likelihood ratio is

But

and thus

where

Rejecting for small values for LR is equivalent to rejecting for large values of , i.e. this test is equivalent to a two-tailed t-test.

Here we can determine an exact distribution for LR, or rather the equivalent test statistics , i.e.

However, in general, the distribution of LR (under ) is difficult to determine and we need to use the approximate method.

#### Fisher's method of scoring

##### Multiparameter case

Let be i.i.d. with

with p.d.f.

that is .

• Log-likelihood:

• Score vector:

• Hessian matrix:

• Observed information matrix:

• Expected information matrix:

• Asymptotic variance-covariance matrix of is:

• Estimated standard errors:

• Covariance between our estimates:

• Asymptotic distribution:

Which is the distribution we would base a Wald test of on, i.e.

## Asymptotics

### Notation

• For a measure on a measurable space and a measurable function ,

• denotes the empirical measure, such that

• denotes the empirical process, i.e. the "cenetered & scaled" version of ,

• denotes a P-Brownian bridge
• denotes the upper α-quantie of a std. normal
• denotes the upper α-quantie of a χ²-distribution
• denotes the upper α-quantie of a t-distribution
• denotes absolutely continuous
• denotes contiguous
• denotes mutually contiguous
• denotes weak convergence, and if has distribution , we also write
• For a given sequence of random variables

where denotes the fact that a sequence is bounded in probability.

### Stochastic convergence

For any random vectors and the following statements are equivalent

1. for all continuity points of
2. for all bounded, continuous functions , i.e.
3. for all bounded, Lipschitz functions
4. for all nonnegative, continuous functions
5. for every open set
6. for every closed set
7. for all Borel sets with , where is the boundary of

lemma:portmanteau-weak-convergence-equivalences is a very useful lemma for talking about convergences in

Any random vector is tight : for every

A set of random vectors is called uniformly tight if can be chosen the same for every : for every

Thus, there exists a compact set to which all give probability "almost" one.

Another name of uniformly tight is bounded in probability.

Let be random vectors in .

1. If for some , then is uniformly tight.
2. If is uniformly tight, then there exists a subsequence with as for some .

Prohorov's theorem states that every uniformly tight sequence contains a weakly converging subsequence. This basically generalizes the Heine-Borel theorem to random vectors.

#### Characteristic functions

Let and be random vectors in . Then

Moreover, if convergences pointwise to a function that is continuous at zero, then is the characteristic function of a random vector and .

## Statistical Learning Theory

### Binary classification

#### Notation

• denotes the training data for the algorithm
• and for convenience
• denotes the class of binary classifiers considered, and thus

#### Symmetrization

• Idea: replace the true risk by an estimate computed on an independent set of data, i.e. (called a virtual or ghost sample)

For any .s.t that ,

where

• denotes the empirical measure of the "ghost" sample

#### Stuff

For a class of functions, the Rademacher average is defined as

where

• is defined

• are Rademacher random variables, i.e. i.i.d. such that .
• denotes the expectation over all the rvs, i.e.

Similarily, the conditional Rademacher average is defined as

where

• denotes the expectation only over the rvs , conditioned on

For all , with probability at least

and also, with prob. at least ,

where

• True risk

• Empirical risk

## Topics in Statistical Theory

### 1.2 Estimating an arbitrary distribution function

Let

• denote the class of all dsitribution functions on
• Empirical distribution is given by

We have

as .

Given , let .

Set , and

and .

Writing , which we can do since is left-continuous, we note that

Let

where

(note the that this has uses strict inequality). Hence, by a union bound, we have

which, by SLLN the terms in the absolute values goes to zero a.s. as , therefore implying that

Now fix and find s.t. . Then for any and ,

where in the second inequality we use the fact that .

Similarily, for the other way around, we have

From the two equations above, it follows that

since either we have

1. : in which case the event on the LHS can occur if

2. : in which case the event on the LHS can occur if

1

Hence the probability of such an event is upper-bounded by the RHS .

We therefore conclude that

i.e. for all , such that , the difference is less than .

In particular, note that the probability on the RHS is:

1. Decreasing in (since the event becomes smaller as increases)
2. Increasing in (since the "number of" we're taking the intersection for decreases)

This means that we have

as we wanted.

### 1.3

Let

• be a sequence of random vectors in

Suppose and a random vector such that

as .

Furthermore, suppose that is differentabile at .

Then

This can be seen by using expanding using the Taylor expansion at .

### 1.4. Concentraion Inequalities

We say the random variable is sub-Gaussian with parameter if and it has MGF

Let be a sub-Gaussian rv. with parameter .

Then

For any , we have

By Markov's inequality we have

where in the second inequality we used the fact that is sub-Gaussian with parameter . Since this holds for any ,

The RHS attains the minimum at (this can be seen by expanding the square and the solving the resulting quadratic), thus

as claimed!

Let be independent r.v.s. with

• finite variances
• for some

Moreover, let

• defined by

Then, for every ,

Consequently, defining

we have that for ,

Define

where the definition at , is simply to ensure continuity for all .

Then is icnreasing, so for ,

Hence, for ,

where in the

• second to last inequality we have multiplied by and used Jensen's inequality,
• final inequality we used the fact that for .

Consequently, by a Chernouff bound,

since the infimum is attained at .