Probability & Statistics
Table of Contents
Notation
- denotes the distribution of some random variable
means convergence in probability for random elements , refers to
- means convergence in distribution for random elements , i.e. in distribution if for all continuity points of .
- means almost surely convergence for random elements
Theorems
Chebyshev Inequality
Let (integrable) be a random variable with finite expected value and finite non-zero variance .
Then for any real number ,
Let be a measure space, and let be an extended real-valued measureable-function defined on .
Then for any real number and
Or more generally, if is an extended real-valued measurable function, nonnegative and nondecreasing on the range of , then
Hmm, not quite sure how to relate the probabilistic and measure-theoretic definition..
Markov Inequality
If is a nonnegative random variable and , then
First note that
and
Hence,
Law of large numbers
There are mainly two laws of large numbers:
Often the assumption of finite variance is made, but this is in fact not necessary (but makes the proof easier). Large or infinite variance will make the convergence slower, but the Law of Large Numbers holds anyway.
Central Limit Theorem
Suppose is a sequence of i.i.d. random variables with and . Then,
where means convergence in distribution, i.e. that the cumulative distribution functions of converge pointwise to the cdf of the distribution: for any real number ,
Jensen's Inequality
If
- is a convex function.
- is a rv.
Then .
Further, if ( is strictly convex), then "with probability 1".
If we instead have:
- is a concave function
Then . This is the one we need for the derivation of EM-algorithm.
To get a visual intuition, here is a image from Wikipedia:
Continuous Mapping Theorem
Let , be random elements defined on a metric space .
Suppose a function (where is another metric space) has the set of discountinuity points such that .
Then,
We start with convergence in distribution, for which we will need a particular statement from the
Slutsky's Lemma
Let , be sequences of scalarvectormatrix random elements.
If and , where is a constant, then
This follows from the fact that if:
then the joint vector converges in distribution to , i.e.
Next we simply apply the Continuous Mapping Theorem, recognizing the functions
as continuous (for the last mapping to be continuous, has to invertible).
Definitions
Probability space
A probability space is a measure space such that the measure of the whole space is equal to one, i.e.
- is the sample space (some arbitrary non-empty set)
- is the σ-algebra over , which is the set of possible events
which is the probability measure such that
Random measure
Let be a separable metric space (e.g. ) and let be its Borel σ-algebra.
is a transition kernel from a probability space to if
For any fixed , the mapping
is a measurable function from to
For every fixed , the mapping
is a (often, probability) measure on
We say a transition kernel is locally finite, if
satisfy for all bounded measurable sets and for all except for some zero-measure set (under ).
Let be a separable metric space (e.g. ) and let be its Borel σ-algebra.
A random measure is a (a.s.) /locally finite transition kernel from a (abstract) probability space to .
Let be a random measure on the measurable space and denote the expected value of a random element with .
The intensity measure is defined
So it's a non-random measure which sends an element of the sigma-algebra to the expected value of the , since is a measurable function, i.e. a random variable.
Poisson process
A Poisson process is a generalized notion of a Poisson random variable.
A point process is a (general) Poisson process with intensity if it has the two following properties:
- Number of points in a bounded Borel set is a Poisson random variable with mean .
In other words, denote the total number of points located in by , then the probability of random variable being equal to is given by
- Number of points in disjoint Borel sets form independent random variables.
The Radon measure maintains its previous interpretation of being the expected number of points located in the bounded region , namely
Cox process
Let be a random measure.
A random measure is called a Cox process directed by , if is a Poisson process with intensity measure .
Here is the conditional distribution of , given .
Degeneracy of probability distributions
A degenerate probability distribution is a probability distribution with support only on a lower-dimensional space.
E.g. if the degenerate distribution is univariate (invoving a single random variable), then it's a deterministic distribution, only taking on a single value.
Characteristic function
Let be a random variable with density and cumulative distribution function .
Then the characteristic equation is defined as the Fourier transform of :
Examples
10 dice: you want at least one 4, one 3, one 2, one 1
Consider the number of misses before a first "hit". When we say "hit", we refer to throwing something in the set .
Then we can do as follows:
- is the # of misses before our first "hit"
- is the # of misses between our first "hit" and second "hit"
- …
We then observe the following:
- since if we have more than misses before our first hit, we'll never be able to get all the events in our target set
We also observe that there are , , and ways of missing for each of the respective target events.
from sympy import * r_1, r_2, r_3, r_4 = symbols("r_1 r_2 r_3 r_4", nonnegative=True, integers=True) s = Sum( Sum( Sum( Sum( (2**(r_1 - 1) * 3**(r_2 - 1) * 4**(r_3 - 1) * 5**(r_4 - 1)) \ / 6**(r_1 + r_2 + r_3 + r_4), (r_4, 1, 10 - r_1 - r_2 - r_3) ), (r_3, 1, 9 - r_1 - r_2) ), (r_2, 1, 8 - r_1) ), (r_1, 1, 7) ) res = factorial(4) * s.doit() print(res.evalf())
Probability "metrics" / divergences
Notation
Overview of the "metrics" / divergences
These definitions are taken from arjovsky17_wasser_gan
The Total Variation (TV) distance
Or, using slightly different notation, the total variation between two probability measures and is given by
These two notions are completely equivalent. One can see this by observing that any discrepancy between and is "accounted for twice", since , which is why we get the factor of in front. We can then gather all where into a subset , making sure to choose or it's complement such that the difference is positive when in the first term. Then we end up with the seen above.
The Jensen-Shannon (JS) divergence
where is the mixture .
This divergence is symmetrical and always defiend because we can choose .
The Earth-Mover (EM) distance or Wasserstein-1
where denotes the set of all joint distributions whose marginals are respectively and .
Intuitively, indicates how much "mass" must be transported from to in order to transform the distributions into the distribution . The EM distance then is the "cost" of the optimal transport plan.
Kullback-Leibler divergence
Definition
The Kullback-Leibler (KL) divergence
where both and are assumed to be absolutely continuous, and therefore admit densities, wrt. to the same measure defined on .
The KL divergence is famously asymmetric and possibly infinite / undefined when there are points s.t. and .
where in the inequality we have use Jensen's inequality together with the fact that is a convex function.
Why?
Kullback-Leibner divergence from some probability-distributions and , denoted , is a measure of the information gained when one revises one's beliefs from the prior distribution to the posterior distribution . In other words, amount of information lost when is used to approximate .
Most importantly, the KL-divergence can be written
where is the optimal parameter and is the one we vary to approximate . The second term in the equation above is the only one which depend on the "unknown" parameter ( is fixed, since this is the parameter we assume to take on). Now, suppose we have samples from , then observe that the negative log-likelihood for some parametrizable distribution is given by
By the Law of Large numbers, we have
where denotes the expectation over the probability density . But this is exactly the second term in the KL-divergence! Hence, minimizing the KL-divergence between and is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!
Wasserstein metric
Let be a metric space for which every probability measure on is a Radon measure.
For , let denote the collection of all probability measures on with finite p-th moment (expectation of rv. to the p-th power) for some ,
Then the p-th Wasserstein distance between two probability measures and in is defined as
where denotes the collection of all measures on with marginals and on the first and second factors respectively (i.e. all possible "joint distributions").
Or equivalently,
Intuitively, if each distribution is viewed as a unit amount of "dirt" piled on the metric space , the metric minimum "cost" of turning one pile into the other, which is assumed to eb the amount dirt that needs to moved times the distance it has to be moved.
Because of this analogy, the metric is sometimes known as the "earth mover's distance".
Using the dual representation of , when and have bounded support:
where denotes the minimal Lipschitz constant for .
Compare this with the definition of the Radon metric (metric induced by distance between two measures):
If the metric is bounded by some constant , then
and so convergence in the Radon metric implies convergence in the Wasserstein metric, but not vice versa.
Observe that we can write the duality as
where we let denotes that we're finding the supremum over all functions which are 1-Lipschitz, i.e. Lipschitz continuous with Lipschitz constant 1.
Integral Probability Metric
Let and be a probablity distributions, and be some space of real-valued functions. Further, let
When is sufficently "large", then
We then say that together with defines an integral probabilty metric (IPM).
ϕ-divergences
Stein's method
Absolutely fantastic description of it: https://statweb.stanford.edu/~souravc/beam-icm-trans.pdf
Notation
- any random variable
- std. Gaussian random variable
Overview
- Technique for proving generalized central limit theorems
Ordinary central limit theorem: if are i.i.d. rvs. then
where and
- Usual method of proof:
- LHS is computed using Fourier Transform
- Independence implies that FT decomposes as a product
- Analysis
- Stein's motivation: what if are not exactly independent?!
Method
Suppose we want to show that is "approximately Gaussian", i.e.
or,
for any well-behaved
It's a generalization because if
then
Suppose
and we want to show that the rv. is approximately std. Gaussian, i.e. for all well-behaved .
Given , obtain a function by solving the differential equation
2.Show that
using the properties of
Since
conclude that .
More generally, two smooth densities and supported on are indentical if and only if
for smooth functions with proper zero-boundary conditions, where
is called the Stein score function of .
Stein discrepancy measure between two continuous densities and is defined
where is the set of functions which satisfies
Two smooth densities and supported on are indentical if and only if
for smooth functions with proper zero-boundary conditions, where
is called the Stein score function of .
Stein discrepancy measure between two continuous densities and is defined
where is the set of functions which previous expectation and is also "rich" enough to ensure that whenever .
Why does it work?
- Turns out that if we replace the in Stein's method, then is not well-behaved; it blows up at infinity.
- A random variable has the standard Gaussian distribution if and only if for all ! (you can easilty check this by observing that the solution to this is ).
Differential operator , defined as
is called a characterizing operator for the standard Gaussian distribution.
Stochastic processes
Discrete
- Markov chains is an example of a discrete stochastic process
Simple random walk
Let be a set of i.i.d. random variables such that
Then, let
We then call the sequence a simple random walk.
For large , due to CLT we have
i.e. follows a normal distribution with and .
We then have the following properties:
Independent measurements : if
then are mutually independent, with .
- Stationary : for all and , the distribution of is the same as the distribution of .
Martingale
A martingale is a stochastic process which is fair game.
We say a stochastic process is a martingale if
where
Examples
- Simple random walk is a martingale
- Balance of a Roulette player is NOT martingale
Optional stopping theorem
A given stochastic process , a non-negative integer r.v. is called a stopping time if
depends only on (i.e. stopping at some time only depends on the first r.v.s).
This implies that if we a process can be described as a martingale, then it's a "zero-sum".
Continuous
Lévy process
A stochastic process is said to be a Lévy process if it satisfies the following properties:
- almost surely.
- Independence of increments: for any $0 ≤ t1 ≤ t2 ≤ … ≤ tn < ∞, the increments are independent.
- Stationary increments: For any , the rvs. is equal in distribution to .
Continuity in probability: For any and it holds that
If is a Lévy process then one may construct a version of such that is almost surely right-continuous with left limits.
Wiener process
The Wiener process (or Brownian motion) is a continuous-time stochastic process characterized by the following properties:
- (almost surely)
- has independent increments: for , future increments with , are independent of the past values ,
has Gaussian increments:
- has continuous paths: with probability , is continuous in
Further, let be i.i.d. rv. with mean 0 and variance 1. For each , define the continuous-time stochastic process
Then
i.e. a random walk approaches a Wiener process.
Properties
Let be the first time such that
Then
Observe that
which is saying that at the point we have , and therefore for all time which follows we're equally likely to be above or below .
Therefore
where in the last inequality we used the fact that the set of events .
This theorem helps us solve the gambler's ruin problem!
Let denote your bankroll at time , and at each it changes by
The continuous approximation of this process is given by a Brownian motion, hence we can write the bankroll as
where denotes a Brownian motion (with ).
We're interested in how probable it is for us to be ruined before some time . Letting be the time were we go bankrupt, we have
i.e. that we hit the time were we go bankrupt before time .
Bankrupt corresponds to
where denotes a Wiener process (hence centered at 0).
Then we can write
where the last equality is due to the symmetry of a Normal distribution and thus a Brownian motion.
Hence
where denotes a standard normal distribution.
This is farily straightforward to prove, just recall the Law of Large numbers and you should be good.
Non-differentiability
is non-differentiable with probability 1.
Suppose, for the sake of contradiction, that is differentiable with probability 1. Then, by MVP we have
for some and all . This implies that
since .
Hence
As , since , then
where
Thus, when , we have
since Hence,
contracting our initial statement.
Geometric Brownian motion
A stochast process is said to follow a Geometric Brownian Motion if it satisfies the following stochastic differential equation (SDE):
where is the Brownian motion, and (percentage drift) and (percentage volatility) are constants.
Brownian bridge
A Brownian bridge is a continuous-time stochastic process whose probability distribution is the conditional probability distribution of a Brownian motion subject to the condition that
so that the process is pinned at the origin at both and . More precisely,
Then
implying that the most uncertainty is in the middle of the bridge, with zero uncertainty at the nodes.
Sometimes the notation is used for a Wiener process / Brownian motion rather than for a Brownian bridge.
Markov Chains
Notation
- denotes the transition matrix (assumed to be ergodic, unless stated otherwise)
- is the stationary distribution
- denotes the initial state
- is the state space
- denotes an uncountable state space
- denotes which means all points except those with zero-measure, i.e. such that
Definitions
A Markov chain is said to be irreducible if it's possible to reach any state from any state.
A state has a period if any return to state must occur in multiples of time steps.
Formally, the period of a state is defined as
provided that the set is non-empty.
A state is said to be transient if, given that we start in state ,t there is a non-zero probability that we will never return to .
A state is said to be recurrent (or persistent) if it is not transient, i.e. gauranteed (with prob. 1) to have a finite hitting time.
A state is said to be ergodic if it is aperiodic and positive recurrent, i.e.
- aperiodic: period of 1, i.e. can return to current state in a single step
- positive recurrent: has a finite mean recurrence time
If all states in an irreducible Markov chain are ergodic , then the chain is said to be ergodic.
The above definitions can be generalized to continuous state-spaces by considering measurable sets rather than states . E.g. recurrence now means that for any measurable set , we have
where denotes the number of steps we need to take until we "hit" , i.e. the hitting time.
[This is a personal thought though, so not 100% guaranteed to be correct.]
Coupling
- Useful for bouding the mixing rate of Markov chains, i.e. the number of steps it takes for the Markov chain to converge to the stationary distribution
Distance to Stationary Distribution
Use Total Variance as a distance measure, therefore convergence is defined through
denotes the variation distance between two Markov chain random variables and , i.e.
One can prove that
which allows us to bound the distance between a chain and the stationary distribution, by considering the difference between two chains, without knowing the stationary distribution!
Coupling
Let and be random variables with probability distributions and on , respectively.
A distribution on is a coupling if
Consider a pair of distributions and on .
For any coupling of and , with
There always exists a coupling s.t.
Observe that
Therefore,
Concluding the proof.
"Inspired" by our proof in 1., we fix diagonal entries:
to make one of the inequalities an equality. Now we simply need to construct such that the above is satisfied, and it's a coupling. One can check that
does indeed to the job.
Ergodicity Theorem
If is irreducible and aperiodic, then there is a unique stationary distribution such that
Consider two copies of the Markov chain and , both following . We create the coupling distribution as follows:
- If , then choose and independently according to
- If , choose and set
From the couppling lemma, we know that
Due to ergodicity, there exists such that . Therefore, there is some such that for all initial states and ,
Similarily, due to the Markovian property, we know
Since
we have
Hence, for any positive integer :
Hence,
Coupling lemma then gives
Finally, we observe that, letting ,
which shows that , i.e. converges to the stationary distribution.
Finally, observe that is unique, since
Mixing time
Let be some and let have the stationary distribution. Fix . By the coupling lemma, there is a coupling and random variables and such that
Using this coupling, we define a coupling of the distributions of as follows:
- If , set
- Else, let and independently
Then we have,
The first inequality holds due to the coupling lemma, and the second inequality holds by construction of the coupling.
Since never decreases, we can define the mixing time of a Markov chain as
for some chosen .
General (uncountable) state space
Now suppose we have some uncountable state-space . Here we have to be a bit more careful when going about constructing Markov chains.
- Instead of considering transitions for some , we have to consider transitions for and .
- Transition probability is therefore denote
- This is due to the fact that typically, the measure of a singleton set is zero, i.e. , when we have an uncountable number possible inputs.
- That is; we have to consider transitions to subsets of the state-space, rather than singletons
- This requires a new definition of irreducibility of a Markov chain
A chain is if there exists a non-zero sigma-finite measure on such that for all with (i.e. all non-zero measurable subsets), and for all , there exists a positive integer such that
We also need a slightly altered definition of periodicity of a Markov chain:
A Markov chain with stationary distribution is periodic if there exist and disjoint subsets with
and
such that , and hence for all . That is, we always transition between subsets, and never within these disjoint subsets.
If such does not exist, i.e. the chain is not periodic, then we say the chain is aperiodic.
Results such as the coupling lemma also hold for the case of uncountable state-space, simply by replacing the summation with the corresponding integrand over the . From roberts04_gener_state_space_markov_chain_mcmc_algor, similarily to the countable case, we have the following properties:
Let and be probability measures on the space . Then
More generally,
In particular,
If is a stationary distribution for a Markov chain with kernel , then is non-increasing in , i.e.
for .
More generally, letting
we always have
If and have densities and , respectively, wrt. to some sigma-finite measure , then
with
Given probability measures and , there are jointly defined random variables and such that
and
From roberts04_gener_state_space_markov_chain_mcmc_algor we also have the following, important theorem:
If a Markov chain on a state space with countable generated sigma-algebra is irreducible and aperiodic, and has a stationary distribution , then for almost-every ,
In particular,
We also introduce the notion of Harris recurrent:
We say a Markov chain is Harris recurrent if for all with (i.e. all non-zero-measurable subsets), and all , the chain will eventually reach from with probability 1, i.e.
This notion is stronger than the notion of irreducibility introduced earlier.
Convergence rates
Uniform Ergodicity
From this theorem, we have implied asymptotic convergence to stationarity, but it does not provide us with any information about the rate of convergence. One "qualitative" convergence rate property is
A Markov chain having stationary distribution is uniformly ergodic if
for some and .
For further developments of ensuring uniform ergodicity, we need to define
A subset is small or if there exists a positive integer , , and a probability measure on s.t. the following minorisation condition holds:
i.e. for all and all measurable .
Geometric ergodicity
A Markov chain with stationary distribution is geometrically ergodic if
for some where for .
Difference between uniform ergodicity and geometric ergodicity is the fact that in geometric ergodicity the can also depend on the initial state .
In practice
Diagnostics
Effective sample size (ESS)
(split) statistic
- Consider running chains, each for steps, from different initializations , where is even
- Then consider splitting each chain in half, resulting in two half-chains for each chain, leaving us with a total of chains
- Reason: If a chain indeed represents independent samples from the stationary distribution, then at the very least the first half and second half of the chain should indeed have the display similar behavior (for sufficiently large )
You might wonder, why not split in 3? Or 4? or, heck, ?!
Likely answer: "You could, but be sensible!" Very unsatisfying…
Let be the scalar estimand, i.e. the variable we're trying to estimate. Let be the "for and denote the different samples for the variable of interest obtained by MCMC chains, each with samples each, with different initializations.
We denote the (B)etween- and (W)ithin-sequence variances
where
empirical mean within a chain
empirical mean across all chains
empirical variance within each chain
Notice that contains a factor of because it is based on the variance of the within-sequence means , each of which is an averaeg of values .
We can then estimate , the marginal posterior variance of the estimand, by a weighted average of and , namely
Finally, we can then monitor convergence of the iterative simulation by estimating the factor by which the scale of the current distribution for might be reduced if the simulations were continued in the limit . This potential scale reduction is estimated by
which has the property that
(remember that depend on ).
If
- is large → reason to believe that further simulations may improve our inference about the target distribution of the associated scalar estimand
- is small → ???
Specific distributions & processes
Hazard function
The hazard function / failure rate is defined
where denotes the PDF and denotes the CDF.
TODO Renewal processes
- Let with be a sequence of waiting times between and the event.
Arrival time of the event is
- Denote by the total number of events in
- If is fixed, then is the count variable we wish to model
It follows that
i.e. total number of events is less than if and only if the arrival time of the event is greater than , which makes sense
If is the distribution function of , we have
Furthermore
- This is the fundamental relationship between the count variable and the timing process
- If , the process is called a renewal process
In this case, we can extend the above equation to the following recursive relationship:
where we have
- Intuitively: probability of exactly events occuring by time is the probability that the first event occurs at time , and that exactly events occur in the remaining time interval, integrated over all times .
- Evaluting this integral, we can obtain from the above recursive relationship.
Weibull renewal process
Statistics
Notation
- denotes converges in probability to
- denotes convergence in law to
Definitions
Efficency
The efficency of an unbiased estimator is the ratio of the minimum possible variance to .
An unbiased estimator with efficiency equal to 1 is called efficient or a minimum variance unbiased estimator (MVUE).
Statistic
Suppose a random vector has distribution function in a parametric family and realized value .
A statistic is r.v. which is a function of independent of . Its realized value is .
A statistic is said to be sufficient for if the distribution of given does not depend on , i.e.
Further, we say is a minimal sufficient statistic if it's the smallest / least complex "proxy" for .
Observe that,
- if is sufficient for , so is any one-to-one function of
- is trivially sufficient
We say a statistic is pivotal if it does not depend on any unknown parameters.
E.g. if we are considering a normal distribution and is known, then the mean could be a pivotal statistic, since we know all information about the distribution except the statistic itself. But if we didn't know , the mean would not be pivotal.
Let .
Then statistic is sufficient for if and only if there exists functions of and of such that
where denotes the likelihood.
U-statistic
Let be independent observations on a distribution .
Consider a "parametric function" for which there is an unbiased estimator. That is, may be represented as
for some function , called a kernel.
Without loss of generality, we may assume is symmetric, otherwise we could simply
For any kernel , the corresponding U-statistic for estimation of on the basis of a sample for size is obtained by averaging the kernel symmetrically over the observations:
where denotes the summation over combinations of distinct elements from .
Clearly, is then an unbiased estimate of .
To conclude, a U-statistic is then a statistic which has the property of being an unbiased estimator for the corresponding , where is such that it can be written stated above.
V-statistic
Statistics that can be represented as functionals of the empirical distribution function, , are called statistical functionals.
A V-statistic is a statistical function (of a sample) defined by a particular statistical functional of a probability distribution.
Suppose is a sample. In typical applications the statistical function has a representation as the V-statistic
where is a symmetric kernel function.
is called a V-statistic of degree .
Seems very much like a form of boostrap-estimate, does it not?
Informally, the type of asymptotic distribution of a statistical function depends on the order of "degeneracy," which is determined by which term is the first non-vanishing term in the Taylor expansion of the functional .
Quantiles
Let be a random variable with cumulative density function (CDF) , i.e.
Let be such that
for some .
Then we say that is called the quantile of , i.e. the value such that
We then say:
- is the median; half probability mass on each side
- is the lower quantile
- is the upper quantile
is strictly monotonically increasing so exists, hence we can compute the quantiles!
Convergence in law
A sequence of cdfs is said to converge to iff on all continuity points of .
We say that if a random variable has cdf and the rv has cdf , then converges in law to , and we write
This does not mean that and are arbitrarily close as random variables. Consider the random variables and .
Let and be random variables. Then
where means converges in probability.
Or, equivalently
This is the notion used by WLLN.
If where is the limit distribution and then
Let be a sequence of random variables, and be a random variable.
Then
This is the notion of convergence used by SLLN.
Kurtosis
The kurtosis of a random variable is the 4th moment, i.e.
where denotes the 4th central moment and is the std. dev.
Words
A collection of random variables is homoskedastic if all of these random variables have the same finite variance.
This is also known as homogeneity of variance.
A collection of random variables is heteroskedastic if there are sub-populations that have different variability from others.
Thus heteroskedasticity is the absence of homoskedastic.
Consistency
Loosely speaking, an estimator of parameter is said to be consistent, if it converges in probability to the true value of the parameter:
Or more rigorously, suppose is a family of distributions (the parametric model) and is an infinite sample from the distribution .
Let be a sequence of estimators for some parameter . Usually will be based on the first observations of a sample. Then this sequence is said to be (weakly) consistent if
Jeffreys Prior
The Jeffreys prior is a non-informative prior distribution for a parameter space, defiend as:
It has the key feature that it is invariant under reparametrization.
Moment generating function
For r.v. the moment generating function is defined as
Taking the derivative wrt. we observe that
Letting , we get
i.e. the mean. We can take this derivative times to obtain the expectation of , which is why is useful.
Distributions
Binomial
or equivalently,
Negative binomial
A negative binomial random variable represents the number of success before obtaining failures, where is the probability of a failure for each of these binomial rvs.
Derivation
Let denote the rv. for # of successes and the number of failures.
Suppose we have a run of successes and failure, then
where the failure is of course the last thing that happens before we terminate the run.
Now, suppose that the run above is followed up by failures, i.e. but we have a specific ordering. Then, letting denote the result of the i-th Binomial trail,
But for of the failures, we don't actually care when they happen, so we don't want any ordering. That is, we have sequences of the form described above which are acceptable.
Hence we get the pmf
Gamma distribution
Chi-squared distribution
The chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables.
It's a special case of the gamma distribution.
T-distribution
The T-distribution arises as a result of the Bessel-corrected sample variance.
Why?
Our population-model is as follows:
Let
then let
be the Bessel-corrected variance. Then the random variable
and the random variable
(where has been substituted for ) has a Student's t-distribution with degrees of freedom.
Note that the numerator and the denominator in the preceding expression are independent random variables, which can be proved by a simple induction.
F-distribution
A random variate of the F-distribution with parameters and arises as the ratio of the two appropriately scaled chi-quared variates:
where
- and have a chi-squared distribution with and degrees of freedom respectively
- and are independent
Power laws
Notation
- such that we have a power law , in such cases we say that the tail of the distribution follows a power law
Definition
A continuous power-law is one described by probability density such that
where is the observed value and is the normalization constant. Clearly this diverges as , hence cannot hold for all ; must be a lower bound to power-law behavior.
Provided , we have
A discrete power-law is defined by
Again, diverges at zero, hence must be some lb on power-law behaviour:
where
is the generalized or Hurwitz zeta function.
Important
Sources: clauset07_power_law_distr_empir_data and so_you_think_you_have_a_power_law_shalizi
- Log-log plot being roughly linear is necessary but not sufficient
- Errors are hard to estimate because they are not well-defined by the usual regression formulas, which are based on assumptions that do not apply in this case: noise of logarithm is not Gaussian as assumed by linear regression. In continuous case, this can be made worse by choice of binning scheme used to construct histogram, hence we have an additional free parameter.
- A fit to a power-law distribution can account for a large fraction of the variance even when fitted data do not follow a power law, hence high values of ("explained variance by regression fit") cannot be taken as evidence in favor of the power-law form.
- Fits extracted by regression methods usually do not satisfy basic requirements on probability distributions, such as normalization and hence cannot be correct.
- Abusing linear regression. Standard methods, e.g. least-squares fitting are known to produce systematically biased estimates of parameters for power-law distributions and should not be used in most circumstances
- Use MLE to estimate scaling exponent
- Use goodness of fit to estimate scaling regions.
- In some cases there are only parts of the data which actually follows a power law.
- Method based on Kolmogorv-Smirnov goodness-of-fit statistic
- Use goodnes-of-fit test to check goodness of fit
- E.g. Kolmogorov-Smirnov test to check data drawn from estimated power law vs. real data
- Use Vuong's test to check for alternative non-power laws (see likelihood_ratio_test_for_model_selection_and_non_nested_hypothesis_vyong89)
Maximum Likelihood Estimation
Notation
- denotes the log-likelihood , i.e.
- is the Fisher's information (expected information)
- is the observed information , i.e. information without taking the expectation
- is called the score function
- denotes that true (and thus unobserved) parameter
- means the probability density evaluated at
Appoximate (asymptotic) variance of MLE
For large samples (and under certain conditions ) the (asymptotic) variance of the MLE is given by
where
where is called the Fisher information .
Estimated standard error
The estimated standard error of the MLE of is given by
Regularity conditions
The lower bound of the variance of the MLE is true under the following conditions on the probability density function, , and the estimator :
- The Fisher information is always defined; equivalently for all such that :
exists and is finite.
- The operations of integration wrt. to and differentiation wrt. can be interchanged in the expectation of ; that is,
whenever the RHS is finite. This condition can often be confirmed by using the fact that integration and differentiation can be swapped when either of the following is true:
- The function has bounded support in , and the bounds do not depend on
- The function has infinite support, is continuously differentiable, and the integral converges uniformly for all
Proof (single variable)
"Proof" (alternative, also single variable)
- Notation
- denotes the expectation over the data , assuming the data arises from the model specified by the parameter
- is the same as above, but assuming
- denotes the score, where we've made the dependence on the data explicit by including it as an argument
- Stuff
- Consistency of as an estimator
Suppose the true parameter is , that is:
Then, for any (not necessarily ), the Law of Large Numbersimplies the convergence in probability
Under suitable regularity conditions, this implies that the value of maximizing the LHS, which is , converges in probability to the value of maximizing RHS, which we claim is .
Indeed, for any
Noting that is concave, Jensen's Inequality implies
for any positive random variable , so
Which establishes "consistency" of since is maximized at .
- Expectation and variance of score
For ,
First, for the expectation we have
Assuming we can interchange the order of the derivative and the integral (which we can for analytic functions), we have
For the variance, we can differentiate the above identity:
where we've used the fact that and , which implies that .
This is equivalent to
as wanted.
- Asymptotic behavior
Now, since maximizes , we must have .
Consistency of ensures that is close to (for large n, with high probability). Thus, we an apply first-order Taylor expansion to the equation about the point :
Thus,
For the denominator, we rewrite as
and then, by the Law of Large Numbers again, we have
in probability.
For the numerator, due to what we showed earlier, we know that
We then have,
and by the Central Limit Theorem, we have
Finally, by Continuous Mapping Theorem and Slutsky's Lemma, we have
- Consistency of as an estimator
Score test
For a random sample the total score
is the sum of i.i.d. random variables. Thus, by the central limit theorem, it follows that as
and
This can then be used as an asymptotic test of the null-hypothesis .
Reject null-hypothesis if
for a suitable critical value . OR equivalently,
Likelihood Ratio Tests
We expand the log-likelihood using Taylor expansion about the true parameter
Subtracting from both sides, and arranging
And since , we get
which means that the difference between the log-likelihoods can be considered to be a random variable drawn from a distribution
and we define the term of the left side as the likelihood-ratio .
The test which rejects the if
for a suitable significance / critical value .
The above is just saying that we reject the null hypothesis that if the left term is not drawn from a chi-squared distribution .
Wald test
We test whether or not
That is, we reject the null-hypothesis if
for some suitable signifiance / critical value .
Generalization to multi-parameter case
Hypothesis tests
Confidence regions
Likelihood test for composite hypothesis
Let denote the whole parameter space (e.g. ). In general terms we have
where .
The general likelihood ratio test compares the maximum likelihood attainable if is restricted to lie in the restricted subspace (i.e. under 'reduced' model) with maximum likelihood attainable if is unrestricted (i.e. under 'full' model):
where is the unrestricted MLE of and is the restricted MLE of .
Some authors define the general likelihood ratio test with the numerator and denominator swapped; but it doesn't matter.
Iterative methods
Simple Linear Regression (Statistical Methodology stuff)
Correlation coefficient and coefficent of determination
The (Pearson) correlation coefficient is given by
and coefficient of determination
Least squares estimates
Residual sum of squares
with the estimated standard deviation being
Laplace Approximation
Overview
Under certain regularity conditions, we can approximate a posterior pdf / pmf as a Normal distribution, that is
where
- i.e. the log-pdf, with being the second-order derivative wrt.
- is the posterior we want to approximate
"Derivation"
For any pdf that is smooth and well peaked around its point of maxima, we can use the Laplace approximation to approximate a posterior distribution by a Gaussian pdf.
It follows from taking the second-order Taylor expansion on the logarithm of the pdf.
Let denote the point of maxima of a pdf , then it also the point of maximia of the log-pdf and we can write
where in the second step we've used the fact that since is a maxima, and finally let
(notice that since is a maxima )
But the above is simply the log-pdf of a , hence the pdf is approx. normal.
Hence, if we can find the and compute the second-order derivative of the log-pdf, we can use Laplace approximation.
Guarantees
Consider the model , .
Under some regularity conditions on the pdfs/pmfs , including all of the have the same "support", and that for each , is twice continuously differentiable, we have that for any prior which is positive, bounded and twice differentiable over (the parameter space),
for large .
Under the same regularity conditions it turns out that and that .
Point estimators
Notation
- denotes a sufficient statistic
Cramér-Rao lower bound
Notation
Theorem
To state the Cramér-Rao inequality, we need to establish some regularity conditions:
- such that we have , called identifiability
- we have have common support
- is an open set
- exists
Where we remember that is the Fisher's information.
Let denote a random sample from and suppose that is an estimator of . Then, subject to the above regularity conditions,
where
- For unbiased , the lower bound simplies to
- Regularity conditions are needed to change the order of differentation and integration in the proof given below.
- Proof is for continuous r.v.s, but there is a similar proof for discrete r.v.s.
From the definition of bias we have
Differentiating both sides wrt. gives (using the regularity conditions)
since does not depend on . Since we have
Thus,
Now use the result that for any two r.v.s and ,
and let
Then
Hence
Similiarily
and since
we obtain the Cramér-Rao lower bound as
The Cramér-Rao lower bound will only be useful if it is attainable or at least nearly attainable.
Rao-Blackwell Theorem
Theorem
Let be a random sample of observations from a distribution with p.d.f. . Suppose that is a sufficient statistic for and is any unbiased estimator for .
Define . Then,
- is a function of only
Non-parametric statistics
Maximum Mean Discrepancy (MMD)
Let be a class of functions and let:
- be Borel probability measures
- and are i.i.d. observations from and , respectively
We define the maximum mean discrepancy (MMD) as
In the statistics literature, this is an instance of a integral probability metric.
A biased empirical estimate of the is obtained by replacing the population expectations with empirical expectations computed on the samples and :
We must therefore identify a function class that is rich enough to uniquely identify whether , yet restrictive enough to provide useful finite sample estimates.
A RKHS is characteristic if
Wavelets
The problem with Fourier transform is that, even though we can extract the frequencies, we do not know at which time these frequencies occurr. This can be incredibly important in cases where the signal has "switches".
Idea is to
- Start with low frequency / high wavelength
- Slide over signal to extract low frequency
- Subtract high frequency away from signal
- Take higher frequency / smaller wavelength
- Slide over signal to extract higher frequency
By subtracting the low frequency we're removing the overlap of the low frequency and high frequency without interferring with the high frequency signal! Thus it's possible to extract both low and high frequencies.
A mother wavelet is some function such that we can scale and translate to create new wavelets :
where
- is scaling
- is translation
And the Fourier transform is then
The condition required by a wavelet on is that:
Continuous Wavelet Transform
Wavelet basis
A mother wavelet is such that form an orthogonal basis for some subspace of , hence
converges to in the norm!
Multi-Resolution Analysis (MRA)
- Overview
- Algorithm for constructing the different resolutions
We consider wavelets constructed from some mother wavelet :
and we want to expand our signal in such a wavelet basis.
Consider sequence of subspaces
in , with the follow properties:
Nested
Union
is dense in
Orthogonality
Increased "resolution"
- such that is an orthogonal basis in
- is called the scaling function or father wavelet
Let be the "standard" scale where our mother wavelet and father wavelet live, i.e. is for .
We can map to by
which is called the dilation or two-scale or refinement equation.
We can repeat this argument for arbitrary , so with , we write
Then,
which means
Finally, this tells us that (the mother wavelet) such that
constitutes an orthogonal basis for .
If is a multi-resolution analysis (MRA) with scaling function , then there exists a mother wavelet defined
where
which allows us to obtain an orthormal basis for which is dense in :
Application
For more information about how and when to apply wavelet transforms, and which mother wavelet to use, e.g., checkout MATLABs documentation on this.
Testing
Notation
- and denotes the acceptance and critical region, respectively
- denotes the critical value of some statistic
Critical region with a given significane level is denoted
Type I error: Reject when is true with
Type II error : Fail to reject when is false (equiv., we accept when we shouldn't):
denotes the test function:
- refers to the expectation over whatever parametrized distribution which given you're parametrizing the distribution with .
Definitions
A hypothesis is simple if it defines completely:
otherwise, it's composite.
If is parametric with more than one parameter, a composite hypothesis might specify values of some or all of them (e.g. on regression coefficient).
A U-statistic is the class of unbiased statistics; this class arise naturally in producing minimum-variance unbiased estimators (MVUEs)
Important: in statistics there are TWO notions for a U-statistic
Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions.
Thus, nonparametric tests or distribution-free tests are procedures for hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed.
Let the data and we wish to test two simply hypotheses:
Suppose that we choose a test statistic and reject if for some critical value .
This induces a partition of the sample space into two disjoint regions:
the rejection region :
the acceptance region :
We will also sometimes use the notation to denote the critical region which has the property
Consider and with corresponding p.d.f.'s , for .
For these hypotheses, comparison between the critical regions of different tests is in terms of
the power of a size critical region for alternative .
A best critical region is then the critical region with maximum power.
There are two possible types of error:
- Type I error: Reject when is true
- Type II error: Fail to reject when is false
- denotes the probability of Type I error and is called the significance level (or size ) of the test
denotes the probability of Type II error which is uniquely defined only if is simple, in which case
denotes the power of the test. For composite , is the power function.
We can define the test function such that
which has the property that
For discrete distributions, the probabilty that the test statistic lies on the boundary of the critical region, , may be non-zero.
In that case, it is sometimes necessary to use a randomized test, for which the test function is
for some function and we reject based on observed data with probability .
Suppose now there is a parametric family of alternative p.d.f.'s for .
The power of a size critical region generalizes to the size power function
A size critical region , is then uniformly most powerful size (UMP size ) if it has maximum power uniformly over ( is NOT power, is).
A test is UMP if all its critical regions are UMP, or more formally:
A uniformly most powerful or UMP test, , of size is a test for which
Given any other test for which for all , we have
i.e. expectation given is at least as large as for the less powerful statistic (who's test function is ).
A test of against is called unbiased of size if
and
Informally, unbiased test is one which has higher probability of rejecting when it is false, than when it is true.
A test which is uniformly most powerful among the set of all unbiased tests is called the uniformly most powerful unbiased (UMPU).
Two-sample test
The problem of comparing samples from two probability distributions, by a statistical tests of the null hypothesis that these distributions are equal against the alternative hypothesis that these distributions are different (this is called the two-sample problem), i.e. in short:
- corresponds to distributions being equal
- corresponds to distributions being different
Exact test
A test is exact if and only if
which is in contrast to non-exact tests which only have the property
for some "critical constant" .
Hypothesis testing
For any size , the LR critical region is the best critical region for testing simple hypothesis vs. .
That is, suppose one is performing a hypothesis test between two simple hypothesis and , using the likelihood-ratio test with threshold which rejects in favour of at a significance level of
where
Then, is the most powerful test at significance level .
Bahadur efficiency
Most of this stuff is from https://www.encyclopediaofmath.org/index.php/Bahadur_efficiency.
Notation
- and denotes the null hypothesis and the alternative hypothesis
- and are the parametric set corresponding to the and , respectively
- A test statistic based on a random sample
- -value refers to the minimum value we require from the test for us to accept
- is the information number corresponding to and
Stuff
- Assume large values of give evidence against
- For fixed and a real number let
Random quantity
is the value corresponding to the statistic when is the true parametric value.
For example, if
and the null hypothesis is rejected at the significance level .
If for with probability one we will have
then is called the Bahadur exact slope of .
The larger the Bahadur exact slope , the faster the rate of decay of the value under the alternative. It is known that for any , where is the information number corresponding to and .
A test statistic is called Bahadur efficient at if
Bahadur efficiency allows one to compare two (sequences of) test statistics and from the following perspective:
Let be the smallest sample size required to reject at the significance level on the basis of a random sample when is the true parametric value.
The ratio gives a measure of relative efficiency of to .
To reduce the number of arguments, , and , one usually considers the rv. which is the limit of this ratio, as . In many situations this limit does not depend on , so it represents the efficiency of against at with the convenient formula
where and are the corresponding Bahadur slopes.
I.e. can use the Bahadur efficiency to compare the number of samples needed to reject .
Kolmogorov-Smirnov
The empirical distribution function is given by
Consistency and unbiasedness at a point
Fix , then
Consequently,
That is, is an unbiased estimator of for each fixed 4x$. Also
Consequently, by the Chebyshev inequality,
Therefore, is consistent
Kolmogorov-Smirnov test
The Kolmogorov distribution is the distribution of the rv.
where is the Brownian bridge. The cumulative distribution function of is given by
The empirical distribution function for i.i.d. observations is defined as
where
is the indicatior function defined
The Kolmogorov-Smirnov statistic for a given cumulative distribution function is
Under that the sample comes from the hypothesized distribution
in distribution, where is the Brownian bridge.
If is continuous, then under converges to the Kolmogorov distribution, which does not depend on .
Kolmogorov-Smirnov test (K-S test) is a non-parametric test of the equality of continuous , one-dimensional probability distributions that can be used to compare one sample with a reference probability distribution (one-sample K-S test ), or to compare two samples (two-sample K-S test).
The one-sample KS test is:
Using critical values of the Kolmogorov distribution, where we reject at level if
where
F-test
An F-test is any statistical test in which the test statistic, called a F-statistic, has an F-distribution under the null hypothesis.
Residual-sum of squares between two models as a F-statistic
Notation
- denotes the number of samples
- denotes the number of regressors (including the constant term)
- denotes the number of linear "restrictions" / "constraints", and therefore, also the degrees of freedom
- is matrix with full column rank , i.e. , which translates into constraints being "independent". This matrix is used to enforce the constraints on , i.e. if we don't want a constant term / intercept is such that , if the constant term was in the first entry.
- represents the hypothesized residual sum
Stuff
Consider two linear regression models with coefficients and , with . Then let and denote the respective number of coefficients, with . Then letting
we have
i.e. the quantity above is F-distributed.
Note: observe that so the above is a positive quantity.
Consider the null-hypothesis . Recall the OLS solution is given by
and we therefore have
Let
and let, using Cholesky decomposition, be such that
we have a "matrix square root" of in . The purpose of this matrix is so we can define the following random variable:
where denotes the identity identity matrix. To see that this is indeed the case, observe that
Using properties seen in the proof of , know that this is independent of the rv.
where
where
known as the residual maker matrix. Then we have
Recall that a F-statistic is a ratio between divided by their corresponding degrees of freedom, hence the above. In particular, under , this reduces to the statistic
Letting (which is the number of components of we set to zero to get ) and (since this is how many parameters we now have), we conclude our proof.
Proof that
Consider the regression model
with
Then the residual vector is estimated by
And is distributed according to
and thus
Consider the following linear model
where , and . The vector of residuals is estimated by
where
Since trace is invariant under cyclic permutations, we have
since matrices, and so . Therefore
Futhermore,
which is seen by the fact that
We then have the following sequence of implications:
- only has eigenvalues and
is normal and so there exists a unitary matrix such that
Recall that
which is seen by computing and using . From this it follows that
where is the corresponding vector (since the rest of the components have zero variance and thus add nothing to the norm).
Furthermore, since is unitary, as mentioned earlier, we have
Recall that the residual sum-of-squares is given by , and so arrive at our result
Finally, observe that this implies
Similarity testing
Notation
- with can take on values
- denotes a test
Stuff
Test vs. .
is a test of size , i.e.
Then is a similar test of size .
with on the boundary of .
Confidence regions
Notation
is drawn from some distribution
is a size critical region for
with
- critical region refers to the sampling distribution which will lead to the rejection of the hypothesis tested when the hypothesis is true.
Stuff
- Generalization of confidence intervalls to multiple dimensions
is a confidence region for if
If then
Pivotal quantities
which has a distribution independent of , i.e. is a pivotal statistic, that is
Then we can construct a value such that we obtain a confidence region following:
Decision Theory
Notation
- denotes the sample space i.e.
- denotes the parameter space
- denotes a family of probability distributions
- is the action space, i.e. set of actions an experiment can take after observing data, e.g. reject or accept , estimating the value of , etc.
- denotes the loss function
- denotes decision rules , with is a function that associates a particular decision which each possible observed data set.
Stuff
For a parameter , the risk of a decision rule, , is defined as
In other words, the risk is the expected loss of a particular decision rule when the true value of the unknown parameter is .
Note: this is fundamentally a frequentist concept, since the definition implicitly invokes the idea of repeated samples from the parameter space and computes the average loss over these hypothetical repetitions.
Selection of decision rule
Given two possible decision rules and , the rule strictly dominates the rule if
and there exists at least one value of , e.g. θ', for which
It is clear that we would always prefer to .
If, for a given decision rule , there exists some decision rule that strictly dominates , then is said to be inadmissible.
Conversely, if there is no decision rule that dominates , then is said to be admissible.
That is, generally less-than-or-equally AND has at least ONE value of for which we have strict inequality.
Unbiasedness
We say a loss-function is unbiased if
i.e. loss of the decision rule should be minimized for the true value of the parameter .
Minimax decision rules
A decision rule is minimax if it minimizes the maximum risk
which can also be written
Bayes decision rules
The Bayes risk of a decision rule, , is defined by
or by a sum in case of discrete-valued probability distribution.
A decision rule is a Bayes rule with respect to the prior if it minimizes the Bayes risk, i.e.,
Definition of Bayes risk assumes that the infimum is achieved by some rule, which might not always be true.
In those cases, for any , we can find a decision rule such that
Such a rule is said to be *Bayes wrt. prior .
A useful choice of prior is the one that is most conservative in its estimate of risk. This gives rise to the concept of a least favourable prior.
We say is a least favourable prior if, for any other prior we have
where are the Bayes (decision) rules corresponding to and .
Randomized decision rules
- decision rules which we choose from with probabilties , with and
Define to be the rule "selection decision rule with prob. ", called the randomized decision rule, which has risk
- Sort of a linear combination of decision rules
- Minimax rules are often this way
- Supremum over of the risk for is smaller than the supremum of the risk for any of the decision rules individually
Finding minimax rules
Suppose that is a Bayes (decision) rule corresponding to prior and , then
- is a minimax decision rule
- is the least favourable prior
Admissibility of Bayes (decision) rules
We have the following three theorems:
Assume that the parameter space, , is finite and a given prior gives positive weight to each .
The a Bayes (decision) rule wrt. is admissible.
If a Bayes (decision) rule is unique, it is admissible.
Let be a subset of the real line. Assume that hte risk functions are continuous in for all decision rules .
If the prior has the property that for any and any , the interval has non-zero probability under , then the Bayes (decision) rule wrt. is admissible.
James-Stein Estimators
Notation
- is the square loss function
Risk of a decision rule is then given
Stein's Paradox
The class of James-Stein estimators of the form
has smaller risks that are also independent of and hence strictly dominate the natural estimator
These are called shrinkage estimators as they shrink towards when , and has the following consequence: folding in information from variables that are independent of the variable of interest can, on average, improve estimation of the variable of interest!
Suppose
and is an almost differentiable function then
First we do the univariate case.
Suppose
which is absolutely continuous, and
Using change of variables to set and , then is std. normal. Then
Then one simply substitute into the Stein's lemma and prove that it is indeed satisfied.
Bayesian Statistics
Model comparison
Bayesian Information Criterion (BIC)
Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred.
The BIC is defined as:
where
- is the MLE of the model
- the obsered data
- is the number of data points in
- is the number of parameters in the model
BIC is an asymptotic result derived under the assumptions that the data distribution follows an exponential family.
Limitations:
- Approximation only valid for sample sizes
- The BIC cannot handle complex collections of models as in the variable selection problem in high-dimensions
Akaike Information Criterion (AIC)
- Given collection of models for some data, estimates quality of each model, relative to each of the other models
- NOT a test of model in a sense of testing a null hypothesis, i.e. says nothing about the absolute quality, only relative quality
Suppose we have a statistical model of some data.
Let:
- be the number of estimated parameters in the model
- be the maximum value of the likelihood function for the model
Then the Akaike Information Criterion (AIC) value of the model is
Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value.
Thus, AIC rewards goodness-of-fit (as assessed by the likelihood function), but it also penalizes large number of parmateres (complexity).
AIC is based on information theory. Suppose data was generated by some unknown process , and we're considering two candidate models to and . We could then compute the KL-divergence of and , , and of and , , i.e. the "loss of information" by representing using or , respectively. One could then compare these values, and choose the candidate model which had the smallest KL-divergence with .
Asymptotically, making this choice is equivalent of choosing the model with the smallest
AIC! Note, however, that AIC can be a bad comparison if the number of samples is small.
TODO ANOVA
One-way
Two-way
Bootstrapping
- Bootstrap methods gets its name due to "using data to generate more data" seems analogous to a trick used by the fictional Baron Munchausen, who when he found himself at the bottom of a lake got out by pulling himself up by his bootstraps :)
Notation
- is a single homogenous sample of data with PDF and CDF
- Statistic whose value in the sample is , which estimates , an underlying characterstic of the population. Generally is a symmetric function of
- EDF stands for the empirical distribution function, denoted
- denotes the parameter of a parametric model with CDF and PDF and , respectively
- is the fitted model to data drawn from the PDF
- denotes the rv. distributed according to the fitted model, and we do the same for other moments (e.g. denotes the mean of the fitted model)
denotes the random variable of the statistic of interested, in comparison to which is the observed estimate, or rather
Stuff
- Interested in probability distribution of
- describes the fact that the population parameter is equal to the value the statistic takes on for the underlying CDF
- expresses the relationship between the estimate and
- To be properly rigorous, we would write and require that as , possibly even that
- We will mostly assume
Moment estimates
- Suppose we simulate a dataset , i.e. from fitted model
- Properties of are then estimated from , using repetitions of the data simulation
- Important to recognize that we are not estimating absolute properties of , but rather of relative to
Distribution and quantile estimates
- Approximating the distribution of by that of
Cumulative probabilities are estimated by EDF of the simulated values , i.e. if
then the simulation estimate of is
And , the exact CDF of under sampling from the fitted model
- Approximation to contains two sources of error:
- to due to data variability
- to due to finite simulation
- Quantiles of distribution of
- Approximated using ordered values of
Suppose are independent distribution with CDF and if denotes the j-th ordered value, then
- This implies a sensible estimate of is the , assuming that is an integer
- Therefore we can estimate quantile of by the oredered value of , i.e.
- We're assuming is chosen so that is an integer
Nonparametric simulation
- Same as in previous section, but EDF to perform simulation instead of estimate of parameter for distribution (e.g. using MLE estimate of ); call this nonparametric bootstrap
Simple confidence intervals
If we use bootstrap estimates of quantiles for , then an equitailed confidence interval will have limits
where we explicitly write the second term in the parenthesis in the limits to emphasize that we're looking at with an expectation .
This is based on the probability implication
Reducing error
- Problem: choose quantity such that is as nearly pivotal as possible
- That is, it has (at least approx.) the same distribution under sampling from both and
Let with increasing in and if is an approx. lower quantile of , then
where is the inverse transformation
- Thus, is an upper confidence limit for
So we're basically saying "Let's bound the difference between of the true and of our estimator by , with a certainty / probability of "
OR rather, "let's bound the probability of the difference between and being "
OR "we want some constant such that the probability of and differ by more than to be bounded by ", and we
Theoretical basis for bootstrap
Suppose we have a random sample , or equiv., its EDF .
We want to estimate some quantity , e.g.
and want to estimate the distribution function
where the conditioning on indicates that is a random sample from .
The bootstrap estimate of is then
where in this case
In order for
we need the following conditions to hold (letting be a neighborhood of in a space of suitable distributions):
- For any , must converge weakly to a limit
- This convergence must be uniform on
- The function mapping to must be continuous
where converge weakly means
for all integrable functions .
Under these conditions, the bootstrap estimate is consistent.
Resampling for testing
Nonparametric Permutation tests
- Permutation test is a two-sample test
- Have samples and for two distributions.
- Want to check if is true
Consider some statistic which measure discrepancy between the two sample-sets, e.g. mean difference
- Under , for every sample , whether or not the sample came from distribution or should be equally likely, since under we have
- Therefore we consider consider permutations between the samples form the distributions!
Consider tuple
Permute the tuple
where is a permutation on symbols.
- Let and
- Compute assuming to come from and to come from .
Gives us achieved significance level (ASL), also known as the p-value, by considering
i.e. the probability of getting a large value when is true
- Observe that is a "discrepancy" measurement between the samples, e.g. the difference, and so "large values = BAD"
Practically, there can be a lot of permutations to compute; possibilities, in fact. Therefore we estimate the by "sampling" permutations uniformly:
where corresponds to the estimate of using the m-th sampled permutated dataset, and is the number of permutations considered.
Note: permutation test is an exact test when all permutations are computed.
Frequentist bootstrapping
- Data approximates unknown distribution
- Sampling distribution of a statistic can be approximated by repeatedly resampling the observations with replacement and computing the statistic for each sample
Let
- denote the original samples
- denote the bootstrap sample
- Likely have some observations repeated one or more times, while some are absent
Equivalently, we can instead make the following observation:
- Each original observation occurs anywhere from to times
Let denote # of times occurs in and
thus
Let such that
then
- Hence we can compute the bootstrap sample by instead drawing this way, and weighting the original samples by these drawn weights!
Recall that
has the property that
Hence,
TODO Bayesian bootstrapping
Problems / Examples
Multi-parameter MLE
Normal samples
Suppose we have the samples following the model
Then we want to test against .
The likelihood is given by
Under our restricted parameter-space (): then the MLE
Under ( and can take on any value):
Thus, the likelihood ratio is
But
and thus
where
Rejecting for small values for LR is equivalent to rejecting for large values of , i.e. this test is equivalent to a two-tailed t-test.
Here we can determine an exact distribution for LR, or rather the equivalent test statistics , i.e.
However, in general, the distribution of LR (under ) is difficult to determine and we need to use the approximate method.
Fisher's method of scoring
Multiparameter case
Let be i.i.d. with
with p.d.f.
that is .
Log-likelihood:
Score vector:
Hessian matrix:
Observed information matrix:
Expected information matrix:
Asymptotic variance-covariance matrix of is:
Estimated standard errors:
Covariance between our estimates:
Asymptotic distribution:
Which is the distribution we would base a Wald test of on, i.e.
Asymptotics
Overview
- A lot of it based on van2000asymptotic
Notation
For a measure on a measurable space and a measurable function ,
denotes the empirical measure, such that
denotes the empirical process, i.e. the "cenetered & scaled" version of ,
- denotes a P-Brownian bridge
- denotes the upper α-quantie of a std. normal
- denotes the upper α-quantie of a χ²-distribution
- denotes the upper α-quantie of a t-distribution
- denotes absolutely continuous
- denotes contiguous
- denotes mutually contiguous
- denotes weak convergence, and if has distribution , we also write
For a given sequence of random variables
where denotes the fact that a sequence is bounded in probability.
Stochastic convergence
For any random vectors and the following statements are equivalent
- for all continuity points of
- for all bounded, continuous functions , i.e.
- for all bounded, Lipschitz functions
- for all nonnegative, continuous functions
- for every open set
- for every closed set
- for all Borel sets with , where is the boundary of
lemma:portmanteau-weak-convergence-equivalences is a very useful lemma for talking about convergences in
Any random vector is tight : for every
A set of random vectors is called uniformly tight if can be chosen the same for every : for every
Thus, there exists a compact set to which all give probability "almost" one.
Another name of uniformly tight is bounded in probability.
Let be random vectors in .
- If for some , then is uniformly tight.
- If is uniformly tight, then there exists a subsequence with as for some .
Prohorov's theorem states that every uniformly tight sequence contains a weakly converging subsequence. This basically generalizes the Heine-Borel theorem to random vectors.
Characteristic functions
Let and be random vectors in . Then
Moreover, if convergences pointwise to a function that is continuous at zero, then is the characteristic function of a random vector and .
Statistical Learning Theory
Notation
Binary classification
Notation
- denotes the training data for the algorithm
- and for convenience
denotes the class of binary classifiers considered, and thus
Symmetrization
- Idea: replace the true risk by an estimate computed on an independent set of data, i.e. (called a virtual or ghost sample)
For any .s.t that ,
where
- denotes the empirical measure of the "ghost" sample
Stuff
For a class of functions, the Rademacher average is defined as
where
is defined
- are Rademacher random variables, i.e. i.i.d. such that .
- denotes the expectation over all the rvs, i.e.
Similarily, the conditional Rademacher average is defined as
where
- denotes the expectation only over the rvs , conditioned on
For all , with probability at least
and also, with prob. at least ,
where
True risk
Empirical risk
Topics in Statistical Theory
1.2 Estimating an arbitrary distribution function
Let
- denote the class of all dsitribution functions on
Empirical distribution is given by
We have
as .
Given , let .
Set , and
and .
Writing , which we can do since is left-continuous, we note that
Let
where
(note the that this has uses strict inequality). Hence, by a union bound, we have
which, by SLLN the terms in the absolute values goes to zero a.s. as , therefore implying that
Now fix and find s.t. . Then for any and ,
where in the second inequality we use the fact that .
Similarily, for the other way around, we have
From the two equations above, it follows that
since either we have
: in which case the event on the LHS can occur if
: in which case the event on the LHS can occur if
1
Hence the probability of such an event is upper-bounded by the RHS .
We therefore conclude that
i.e. for all , such that , the difference is less than .
In particular, note that the probability on the RHS is:
- Decreasing in (since the event becomes smaller as increases)
- Increasing in (since the "number of" we're taking the intersection for decreases)
This means that we have
as we wanted.
1.3
Let
- be a sequence of random vectors in
Suppose and a random vector such that
as .
Furthermore, suppose that is differentabile at .
Then
This can be seen by using expanding using the Taylor expansion at .
1.4. Concentraion Inequalities
We say the random variable is sub-Gaussian with parameter if and it has MGF
For any , we have
By Markov's inequality we have
where in the second inequality we used the fact that is sub-Gaussian with parameter . Since this holds for any ,
The RHS attains the minimum at (this can be seen by expanding the square and the solving the resulting quadratic), thus
as claimed!
Let be independent r.v.s. with
- finite variances
- for some
Moreover, let
defined by
Then, for every ,
Consequently, defining
we have that for ,
Define
where the definition at , is simply to ensure continuity for all .
Then is icnreasing, so for ,
Hence, for ,
where in the
- second to last inequality we have multiplied by and used Jensen's inequality,
- final inequality we used the fact that for .
Consequently, by a Chernouff bound,
since the infimum is attained at .
Bibliography
Bibliography
- [arjovsky17_wasser_gan] Arjovsky, Chintala, Bottou & L\'eon, Wasserstein Gan, CoRR, (2017). link.
- [roberts04_gener_state_space_markov_chain_mcmc_algor] Roberts & Rosenthal, General State Space Markov Chains and Mcmc Algorithms, CoRR, (2004). link.
- [gelman2013bayesian] Gelman, Carlin, Stern, Dunson, Vehtari & Rubin, Bayesian data analysis, Chapman and Hall/CRC (2013).
- [adrian13_count_model_based_weibul_inter_times] Adrian, Bradlow, Fader, & McShane, Count Models Based on Weibull Interarrival Times, CoRR, (2013). link.
- [yannaros1994weibull] Yannaros, Weibull renewal processes, Annals of the Institute of Statistical Mathematics, 46(4), 641-648 (1994).
- [clauset07_power_law_distr_empir_data] Clauset, Shalizi, & Newman, Power-Law Distributions in Empirical Data, CoRR, (2007). link.
- [so_you_think_you_have_a_power_law_shalizi] Cosma Shalizi, So You Think You Have a Power Law - Well Isn't That Special?, link.. (Accessed on 05/22/2018)
- [likelihood_ratio_test_for_model_selection_and_non_nested_hypothesis_vyong89] Quang Vuong, Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses, Econometrica, 57(2), 307-333 (1989). link.
- [van2000asymptotic] Van der Vaart, Asymptotic statistics, Cambridge university press (2000).