Probability & Statistics

Table of Contents

Notation

  • $F(x)$ denotes the distribution of some random variable $X$
  • $A \overset{p}{\to} B$ means convergence in probability for random elements $A, B$, refers to

    \begin{equation*}
  P \big( |X_n - X| > \varepsilon \big) \to \infty \text{ as } n \to \infty
\end{equation*}
  • $A \overset{d}{\to} B$ means convergence in distribution for random elements $A, B$, i.e. $X_n \to X$ in distribution if $F_n(x) \to F(x)$ for all continuity points $x$ of $F$.
  • $A \overset{as}{\to} B$ means almost surely convergence for random elements $A, B$

Theorems

Chebyshev Inequality

Let $X$ (integrable) be a random variable with finite expected value $\mu$ and finite non-zero variance $\sigma^2$.

Then for any real number $k > 0$,

\begin{equation*}
\Pr( |X - \mu | \ge k \sigma) \le \frac{1}{k^2}
\end{equation*}

Let $(X, \Sigma, \mu)$ be a measure space, and let $f$ be an extended real-valued measureable-function defined on $X$.

Then for any real number $t > 0$ and $0 < p < \infty$

\begin{equation*}
\mu ( \{ x \in X : |f(x)| \ge \} ) \le \frac{1}{t^p} \int_{|f| \ge t} |f|^p \ d \mu
\end{equation*}

Or more generally, if $g$ is an extended real-valued measurable function, nonnegative and nondecreasing on the range of $f$, then

\begin{equation*}
\mu( \{ x \in X : f(x) \ge t \}) \le \frac{1}{g(t)} \int_X g \circ f \ d \mu
\end{equation*}

Hmm, not quite sure how to relate the probabilistic and measure-theoretic definition..

Markov Inequality

If $X$ is a nonnegative random variable and $a > 0$, then

\begin{equation*}
P \big( X \ge a \big) \le \frac{\mathbb{E}[X]}{a}
\end{equation*}

First note that

\begin{equation*}
\mathbb{E}[X] = \int_{0}^{\infty} \mathbb{P}(X \ge t) \dd{t}
\end{equation*}

and

\begin{equation*}
\begin{split}
  \int_{0}^{\infty} \mathbb{P}(X \ge s) \dd{s} &= \int_{0}^{t} \mathbb{P}(X \ge s) \dd{s} + \int_{t}^{\infty} \mathbb{P}(X \ge s) \dd{s} \\
  &\ge \int_{0}^{t} \mathbb{P}(X \ge s) \dd{s} \\
  &\ge \Big( \inf_{0 \le s \le t} \mathbb{P}(X \ge s) \Big) (t - 0) \\
  &= t \mathbb{P}(X \ge t) 
\end{split}
\end{equation*}

Hence,

\begin{equation*}
\mathbb{P}(X \ge t) \le \frac{\mathbb{E}[X]}{t}
\end{equation*}

Law of large numbers

There are mainly two laws of large numbers:

Sample averages converge in probability to the mean, i.e.

\begin{equation*}
\overline{X}_n \overset{p}{\longrightarrow} \mu
\end{equation*}

or equivalently,

\begin{equation*}
\lim_{n \to \infty} P \Big( \left| \overline{X}_n - \mu \right| > \varepsilon \Big) = 0
\end{equation*}

Often the assumption of finite variance is made, but this is in fact not necessary (but makes the proof easier). Large or infinite variance will make the convergence slower, but the Law of Large Numbers holds anyway.

Central Limit Theorem

Suppose $X_1, X_2, \dots$ is a sequence of i.i.d. random variables with $\mathbb{E}[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2 < \infty$. Then,

\begin{equation*}
\sqrt{n} \Bigg( \Big( \frac{1}{n} \sum_{i=1}^{n} X_i \Big) - \mu \Bigg) \overset{d}{\to} \mathcal{N}(0, \sigma^2)
\end{equation*}

where $\overset{d}{\to}$ means convergence in distribution, i.e. that the cumulative distribution functions of $\sqrt{n} (S_n - \mu)$ converge pointwise to the cdf of the $\mathcal{N}(0, \sigma^2)$ distribution: for any real number $z$,

\begin{equation*}
\underset{n \to \infty}{\lim} \Pr [ \sqrt{n} (S_n - \mu) \le z ] = \Phi \Big( \frac{z}{\sigma} \Big)
\end{equation*}

Jensen's Inequality

If

  • $f$ is a convex function.
  • $X$ is a rv.

Then $f(E[X]) \leq E[f(x)]$.

Further, if $f''(x) > 0$ ($f$ is strictly convex), then $E[(f(X)] = f(E[X]) \iff X = E[X]$ "with probability 1".

If we instead have:

  • $f$ is a concave function

Then $f(E[X]) \geq E[f(X)]$. This is the one we need for the derivation of EM-algorithm.

To get a visual intuition, here is a image from Wikipedia: jensens_inequality.png

Continuous Mapping Theorem

Let $\{ X_n \}$, $X$ be random elements defined on a metric space $S$.

Suppose a function $g: S \to S'$ (where $S'$ is another metric space) has the set of discountinuity points $D_g$ such that $\Pr \{ X \in D_g \} = 0$.

Then,

  1. $X_n \overset{d}{\to} X \quad \implies \quad g(X_n) \overset{d}{\to} g(X)$
  2. $X_n \overset{p}{\to} X \quad \implies \quad g(X_n) \overset{p}{\to} g(X)$
  3. $X_n \overset{as}{\to} X \quad \implies \quad g(X_n) \overset{as}{\to} g(X)$

We start with convergence in distribution, for which we will need a particular statement from the

Slutsky's Lemma

Let $\{ X_n \}$, $\{ Y_n \}$ be sequences of scalarvectormatrix random elements.

If $X_n \overset{d}{\to} X$ and $Y_n \overset{p}{\to} c$, where $c$ is a constant, then

  • $X_n + Y_n \overset{d}{\to} X + c$
  • $X_n Y_n \overset{d}{\to} c X$
  • $X_n / Y_n \overset{d}{\to} X / c$

This follows from the fact that if:

\begin{equation*}
X_n \overset{d}{\to} X, \qquad Y_n \overset{p}{\to} c
\end{equation*}

then the joint vector $(X_n, Y_n)$ converges in distribution to $(X, c)$, i.e.

\begin{equation*}
(X_n, Y_n) \overset{d}{\to} (X, c)
\end{equation*}

Next we simply apply the Continuous Mapping Theorem, recognizing the functions

  • $g(x, y) = x + y$
  • $g(x, y) = x y$
  • $g(x, y) = x y^{-1}$

as continuous (for the last mapping to be continuous, $y$ has to invertible).

Definitions

Probability space

A probability space is a measure space $(\Omega, \mathcal{A}, P)$ such that the measure of the whole space is equal to one, i.e.

  • $\Omega$ is the sample space (some arbitrary non-empty set)
  • $\mathcal{A} \subseteq 2^{\Omega}$ is the σ-algebra over $\Omega$, which is the set of possible events
  • $P: \mathcal{A} \to [0, 1]$ which is the probability measure such that

    \begin{equation*}
P(\Omega) = 1
\end{equation*}

Random measure

Let $E$ be a separable metric space (e.g. $\mathbb{R}^n$) and let $\mathcal{E}$ be its Borel σ-algebra.

$\zeta$ is a transition kernel from a probability space $\big( \Omega, \mathcal{A}, P \big)$ to $\big( E, \mathcal{E} \big)$ if

We say a transition kernel is locally finite, if

\begin{equation*}
B \mapsto \zeta(\omega, B)
\end{equation*}

satisfy $\zeta(\omega, \overline{B}) < \infty$ for all bounded measurable sets $\overline{B} \in \mathcal{E}$ and for all $\omega \in \Omega$ except for some zero-measure set (under $P$).

Let $E$ be a separable metric space (e.g. $\mathbb{R}^n$) and let $\mathcal{E}$ be its Borel σ-algebra.

A random measure $\zeta$ is a (a.s.) /locally finite transition kernel from a (abstract) probability space $\big( \Omega, \mathcal{A}, P \big)$ to $\big( E, \mathcal{E} \big)$.

Let $\zeta$ be a random measure on the measurable space $\big( \mathcal{A}, \Omega \big)$ and denote the expected value of a random element $Y$ with $\mathbb{E}[Y]$.

The intensity measure $\zeta$ is defined

\begin{equation*}
\begin{split}
  \mathbb{E}\zeta: \quad & \mathcal{A} \to [0, \infty] \\
  & A \mapsto \mathbb{E}\zeta(A) := \mathbb{E} \big[ \zeta(A) \big], \quad \forall A \in \mathcal{A}
\end{split}
\end{equation*}

So it's a non-random measure which sends an element $A$ of the sigma-algebra $\mathcal{A}$ to the expected value of the $\zeta(A)$, since $\zeta(A)$ is a measurable function, i.e. a random variable.

Poisson process

A Poisson process is a generalized notion of a Poisson random variable.

A point process $N$ is a (general) Poisson process with intensity $\Lambda$ if it has the two following properties:

  1. Number of points in a bounded Borel set $B$ is a Poisson random variable with mean $\Lambda(B)$.
    • In other words, denote the total number of points located in $B$ by $N(B)$, then the probability of random variable $N(B)$ being equal to $n$ is given by

      \begin{equation*}
P \left\{ N(B) = n \right\} = \frac{\big( \Lambda(B) \big)^n}{n!} e^{- \Lambda(B)}
\end{equation*}
  2. Number of points in $n$ disjoint Borel sets form $n$ independent random variables.

The Radon measure $\Lambda$ maintains its previous interpretation of being the expected number of points $N$ located in the bounded region $B$, namely

\begin{equation*}
\Lambda(B) = \mathbb{E} \big[ N(B) \big]
\end{equation*}

Cox process

Let $\zeta$ be a random measure.

A random measure $\eta$ is called a Cox process directed by $\zeta$, if $\mathcal{L}(\eta \mid \zeta = \mu)$ is a Poisson process with intensity measure $\mu$.

Here $\mathcal{L}(\eta \mid \zeta = \mu)$ is the conditional distribution of $\eta$, given $\left\{ \zeta = \mu \right\}$.

Degeneracy of probability distributions

A degenerate probability distribution is a probability distribution with support only on a lower-dimensional space.

E.g. if the degenerate distribution is univariate (invoving a single random variable), then it's a deterministic distribution, only taking on a single value.

Characteristic function

Let $X$ be a random variable with density $f(x) = \text{Pr}\left\{ X = x \right\}$ and cumulative distribution function $F(x) = \text{Pr}\left\{ X \le x \right\}$.

Then the characteristic equation is defined as the Fourier transform of $f$:

\begin{equation*}
\begin{split}
  \varphi_X(t) &= \mathbb{E} \big[ e^{i t X} \big] \\
  &= \int_{\mathbb{R}}^{} e^{i t x} \dd{F_X(x)} \\
  &= \int_{\mathbb{R}}^{} e^{i t x} f(x) \ \dd{x} \\
  &= \mathcal{F} \left\{ f(x) \right\}
\end{split}
\end{equation*}

Examples

10 dice: you want at least one 4, one 3, one 2, one 1

Consider the number of misses before a first "hit". When we say "hit", we refer to throwing something in the set $\{ 1, 2, 3, 4 \}$.

Then we can do as follows:

  • $r_1$ is the # of misses before our first "hit"
  • $r_2$ is the # of misses between our first "hit" and second "hit"

We then observe the following:

  • $r_1 \le 6$ since if we have more than $6$ misses before our first hit, we'll never be able to get all the events in our target set
  • $r_2 \le 7 - r_1$
  • $r_3 \le 8 - r_1 - r_2$
  • $r_4 \le 9 - r_1 - r_2 - r_3$

We also observe that there are $2^{r_1}$, $3^{r_2}$, $4^{r_3}$ and $5^{r_4}$ ways of missing for each of the respective target events.

from sympy import *

r_1, r_2, r_3, r_4 = symbols("r_1 r_2 r_3 r_4", nonnegative=True, integers=True)

s = Sum(
    Sum(
        Sum(
            Sum(
                (2**(r_1 - 1) * 3**(r_2 - 1) * 4**(r_3 - 1) * 5**(r_4 - 1)) \
                / 6**(r_1 + r_2 + r_3 + r_4),
                (r_4, 1, 10 - r_1 - r_2 - r_3)
            ), (r_3, 1, 9 - r_1 - r_2)
        ), (r_2, 1, 8 - r_1)
    ), (r_1, 1, 7)
)

res = factorial(4) * s.doit()
print(res.evalf())

Probability "metrics" / divergences

Notation

Overview of the "metrics" / divergences

These definitions are taken from arjovsky17_wasser_gan

The Total Variation (TV) distance

\begin{equation*}
\delta(\mathbb{P}_r, \mathbb{P}_g) = \sup_{A \in \Sigma} | \mathbb{P}_r(A) - \mathbb{P}_g(A)|
\end{equation*}

Or, using slightly different notation, the total variation between two probability measures $\mu$ and $\nu$ is given by

\begin{equation*}
||\mu - \nu||_{\rm{TV}} := \frac{1}{2} \sum_{x \in X}^{} |\mu(x) - \nu(x)|
\end{equation*}

These two notions are completely equivalent. One can see this by observing that any discrepancy between $\mu$ and $\nu$ is "accounted for twice", since $\mu(E) - \nu(E) = - (\mu(\bar{E}) - \nu(\bar{E}))$, which is why we get the factor of $1 / 2$ in front. We can then gather all $x \in X$ where $\mu(x) \ne \nu(x)$ into a subset $A \subseteq X$, making sure to choose $x$ or it's complement $X \setminus \left\{ x \right\}$ such that the difference is positive when $\mu$ in the first term. Then we end up with the $\sup$ seen above.

The Jensen-Shannon (JS) divergence

\begin{equation*}
\text{JS}(\mathbb{P}_r, \mathbb{P}_g) = \text{KL}(\mathbb{P}_r || \mathbb{P}_m) + \text{KL}(\mathbb{P}_g || \mathbb{P}_m)
\end{equation*}

where $\mathbb{P}_m$ is the mixture $(\mathbb{P}_r + \mathbb{P}_g) / 2$.

This divergence is symmetrical and always defiend because we can choose $\mu = \mathbb{P}_m$.

The Earth-Mover (EM) distance or Wasserstein-1

\begin{equation*}
W(\mathbb{P}_r, \mathbb{P}_g) = \inf_{\gamma \in \Pi(\mathbb{P}_r, \mathbb{P}_g)} \mathbb{E}_{(x, y) \sim \gamma} \big[ \norm{x - y} \big]
\end{equation*}

where $\Pi(\mathbb{P}_r, \mathbb{P}_g)$ denotes the set of all joint distributions $\gamma(x, y)$ whose marginals are respectively $\mathbb{P}_r$ and $\mathbb{P}_g$.

Intuitively, $\gamma(x, y)$ indicates how much "mass" must be transported from $x$ to $y$ in order to transform the distributions $\mathbb{P}_r$ into the distribution $\mathbb{P}_g$. The EM distance then is the "cost" of the optimal transport plan.

Kullback-Leibler divergence

Definition

The Kullback-Leibler (KL) divergence

\begin{equation*}
\KL( q, p ) = \int \log \Bigg( \frac{q(x)}{p(x)} \Bigg) q(x) \ d \mu(x)
\end{equation*}

where both $q(x)$ and $p(x)$ are assumed to be absolutely continuous, and therefore admit densities, wrt. to the same measure $\mu$ defined on $\mathcal{X}$.

The KL divergence is famously asymmetric and possibly infinite / undefined when there are points s.t. $p(x) = 0$ and $q(x) > 0$.

$\KL(q, p) \ge 0$

\begin{equation*}
\begin{split}
  \KL(q, p) &= \int_{}^{} \log \bigg( \frac{q(x)}{p(x)} \bigg) q(x) \dd{x} \\
  &= - \int \log \bigg( \frac{p(x)}{q(x)} \bigg) q(x) \dd{x} \\
  &\ge - \log \bigg( \int \frac{p(x)}{q(x)} q(x) \dd{x} \bigg) \\
  &= - \log \bigg( \int p(x) \dd{x} \bigg) \\
  &= - \log(1) \\
  &= 0
\end{split}
\end{equation*}

where in the inequality we have use Jensen's inequality together with the fact that $\log$ is a convex function.

Why?

Kullback-Leibner divergence from some probability-distributions $Q$ and $P$, denoted $D_{KL} (P || Q)$, is a measure of the information gained when one revises one's beliefs from the prior distribution $Q$ to the posterior distribution $P$. In other words, amount of information lost when $Q$ is used to approximate $P$.

Most importantly, the KL-divergence can be written

\begin{equation*}
D_{\rm{KL}}(p \ || \ q) = \mathbb{E}_p\Bigg[\log \frac{p(x \mid \theta^*)}{q(x \mid \theta)} \Bigg] = \mathbb{E}_p \big[ \log p(x \mid \theta^*) \big] - \mathbb{E}_p \big[ \log q(x \mid \theta) \big]
\end{equation*}

where $\theta^*$ is the optimal parameter and $\theta$ is the one we vary to approximate $p(x \mid \theta^*)$. The second term in the equation above is the only one which depend on the "unknown" parameter $\theta$ ($\theta^*$ is fixed, since this is the parameter we assume $p$ to take on). Now, suppose we have $N$ samples $\{ x_i \}$ from $p$, then observe that the negative log-likelihood for some parametrizable distribution $q$ is given by

\begin{equation*}
- \frac{1}{N} \sum_{i=1}^{N} \log q(x_i \mid \theta)
\end{equation*}

By the Law of Large numbers, we have

\begin{equation*}
\lim_{N \to \infty} - \frac{1}{N} \sum_{i=1}^{N} \log q(x_i \mid \theta) = - \mathbb{E}_p[\log q(x \mid \theta)]
\end{equation*}

where $\mathbb{E}_p$ denotes the expectation over the probability density $p$. But this is exactly the second term in the KL-divergence! Hence, minimizing the KL-divergence between $p(x \mid \theta^{\ast})$ and $q(x \mid \theta)$ is equivalent of minimizing the negative log-likeliood, or equivalently, maximizing the likelihood!

Wasserstein metric

Let $(M, d)$ be a metric space for which every probability measure on $M$ is a Radon measure.

For $p \ge 1$, let $P_p(M)$ denote the collection of all probability measures $\mu$ on $M$ with finite p-th moment (expectation of rv. to the p-th power) for some $x_0 \in M$,

\begin{equation*}
\int_{M} d(x, x_0)^p \ d \mu(x) < + \infty
\end{equation*}

Then the p-th Wasserstein distance between two probability measures $\mu$ and $\nu$ in $P_p(M)$ is defined as

\begin{equation*}
W_p(\mu, \nu) := \Bigg( \inf_{\gamma \in \gamma(\mu, \nu)} \int_{M \times M} d(x, y)^p \ \dd \gamma(x, y) \Bigg)^{1 / p}
\end{equation*}

where $\Gamma(\mu, \nu)$ denotes the collection of all measures on $M \times M$ with marginals $\mu$ and $\nu$ on the first and second factors respectively (i.e. all possible "joint distributions").

Or equivalently,

\begin{equation*}
W_p(\mu, \nu)^p = \inf \mathbb{E} \big[ d(X, Y)^p \big]
\end{equation*}

Intuitively, if each distribution is viewed as a unit amount of "dirt" piled on the metric space $M$, the metric minimum "cost" of turning one pile into the other, which is assumed to eb the amount dirt that needs to moved times the distance it has to be moved.

Because of this analogy, the metric is sometimes known as the "earth mover's distance".

Using the dual representation of $W_1$, when $\mu$ and $\nu$ have bounded support:

\begin{equation*}
W_1(\mu, \nu) = \sup \Bigg\{ \int_M f(x) \dd(\mu - \nu)(x) \mid \text{continuous } f: M \to \mathbb{R}, \quad \text{Lip}(f) \le 1 \Bigg\}
\end{equation*}

where $\text{Lip}(f)$ denotes the minimal Lipschitz constant for $f$.

Compare this with the definition of the Radon metric (metric induced by distance between two measures):

\begin{equation*}
\rho(\mu, \nu) := \sup \Bigg\{ \int_M f(x) \dd(\mu - \nu)(x) \mid \text{continuous } f: M \to [-1, 1] \Bigg\}
\end{equation*}

If the metric $d$ is bounded by some constant $C$, then

\begin{equation*}
2 W_1 (\mu, \nu) \le C \rho(\mu, \nu)
\end{equation*}

and so convergence in the Radon metric implies convergence in the Wasserstein metric, but not vice versa.

Observe that we can write the duality as

\begin{equation*}
\begin{split}
  W_1(\mu, \nu) &= \sup_{\norm{f}_L \le 1} \int_M f(x) \dd (\mu - \nu)(x) \\
  &= \sup_{\norm{f}_L \le 1} \int_M f(x) \dd \mu - \int_M f(x) \dd \nu \\
  &= \sup_{\norm{f}_L \le 1} \Big( \mathbb{E}_{x \sim \mu(x)} \big[ f(x) \big] - \mathbb{E}_{x \sim \nu(x)} \big[ f(x) \big] \Big)
\end{split}
\end{equation*}

where we let $\norm{f}_L \le 1$ denotes that we're finding the supremum over all functions which are 1-Lipschitz, i.e. Lipschitz continuous with Lipschitz constant 1.

Integral Probability Metric

Let $P$ and $Q$ be a probablity distributions, and $\mathcal{H}$ be some space of real-valued functions. Further, let

\begin{equation*}
d_{\mathcal{H}}(q, p) = \sup_{h \in \mathcal{H}} \Big[ \mathbb{E}_q [h(X)] - \mathbb{E}_p[h(Z)] \Big]
\end{equation*}

When $\mathcal{H}$ is sufficently "large", then

\begin{equation*}
d_{\mathcal{H}}(q_m, p) \longrightarrow 0 \implies \big( q_m \big)_{m \ge 1} \longrightarrow p \text{ (weakly)}
\end{equation*}

We then say that $\mathcal{H}$ together with $d_{\mathcal{H}}(q, p)$ defines an integral probabilty metric (IPM).

ϕ-divergences

\begin{equation*}
D_{\phi}(P, Q) = \int_{\mathcal{X}} q(x) \phi \bigg( \frac{p(x)}{q(x)} \bigg) \dd{x}
\end{equation*}

Stein's method

Absolutely fantastic description of it: https://statweb.stanford.edu/~souravc/beam-icm-trans.pdf

Notation

  • $W$ any random variable
  • $Z$ std. Gaussian random variable

Overview

  • Technique for proving generalized central limit theorems
  • Ordinary central limit theorem: if $X_1, X_2, \dots$ are i.i.d. rvs. then

    \begin{equation*}
\lim_{n \to \infty} \mathbb{P} \Bigg( \frac{\sum_{i = 1}^{n} X_i - n \mu}{\sigma \sqrt{n}} \le x \Bigg) = \int_{-\infty}^{x} \frac{1}{\sqrt{2 \pi}} e^{- u^2 / 2} \ du
\end{equation*}

    where $\mu = \mathbb{E}(X_i)$ and $\sigma^2 = \text{Var}(X_i)$

  • Usual method of proof:
    1. LHS is computed using Fourier Transform
    2. Independence implies that FT decomposes as a product
    3. Analysis
  • Stein's motivation: what if $X_i$ are not exactly independent?!

Method

Suppose we want to show that $W$ is "approximately Gaussian", i.e.

\begin{equation*}
\mathbb{P}(W \le x) \approx \mathbb{P}(Z \le x) \quad \forall x
\end{equation*}

or,

\begin{equation*}
\mathbb{E}[h(W)] \approx \mathbb{E}[h(Z)]
\end{equation*}

for any well-behaved $h$

It's a generalization because if

\begin{equation*}
h(u) = 
\begin{cases}
  1 & \text{if } u \le x \\
  0 & \text{otherwise}
\end{cases}
\end{equation*}

then

\begin{equation*}
\mathbb{P}(W \le x) = \mathbb{E}[h(W)]
\end{equation*}

Suppose

\begin{equation*}
W = \frac{1}{\sigma \sqrt{n}} \sum_{i=1}^{n} X_i \ n \mu
\end{equation*}

and we want to show that the rv. $W$ is approximately std. Gaussian, i.e. $\mathbb{E}[h(W)] \approx \mathbb{E}[h(Z)$ for all well-behaved $h$.

  1. Given $h$, obtain a function $f$ by solving the differential equation

    \begin{equation*}
f'(x) - x f(x) = h(x) - \mathbb{E}[h(z)]
\end{equation*}

2.Show that

\begin{equation*}
\mathbb{E} \big( f'(W) - W f(W) \big) = \mathbb{E}[h(W)] - \mathbb{E}[h(Z)]
\end{equation*}

using the properties of $W$

  1. Since

    \begin{equation*}
\mathbb{E} \big( f'(W) - W f(W) \big) = \mathbb{E}[h(W)] - \mathbb{E}[h(Z)
\end{equation*}

    conclude that $\mathbb{E}[h(W)] \approx \mathbb{E}[h(Z)]$.

More generally, two smooth densities $p(x)$ and $q(x)$ supported on $\mathbb{R}$ are indentical if and only if

\begin{equation*}
\mathbb{E}_p [\mathbf{s}_p(x) f(x) + \nabla_x f(x) ] = 0
\end{equation*}

for smooth functions $f(x)$ with proper zero-boundary conditions, where

\begin{equation*}
\mathbf{s}_q(x) = \nabla_x \log q(x) = \nabla_x \frac{\nabla_x q(x)}{q(x)}
\end{equation*}

is called the Stein score function of $q(x)$.

Stein discrepancy measure between two continuous densities $p(x)$ and $q(x)$ is defined

\begin{equation*}
\mathbb{S}(p, q) = \max_{f \in \mathcal{F}} \big( \mathbb{E}_p [ \mathbf{s}_q(x) f(x) + \nabla_x f(x)] \big)^2
\end{equation*}

where $\mathcal{F}$ is the set of functions which satisfies

Two smooth densities $p(x)$ and $q(x)$ supported on $\mathbb{R}$ are indentical if and only if

\begin{equation*}
\mathbb{E}_p [\mathbf{s}_q(x) f(x) + \nabla_x f(x) ] = 0
\end{equation*}

for smooth functions $f(x)$ with proper zero-boundary conditions, where

\begin{equation*}
\mathbf{s}_q(x) = \nabla_x \log q(x) = \frac{\nabla_x q(x)}{q(x)}
\end{equation*}

is called the Stein score function of $q(x)$.

Stein discrepancy measure between two continuous densities $p(x)$ and $q(x)$ is defined

\begin{equation*}
\mathbb{S}(p, q) = \max_{f \in \mathcal{F}} \big( \mathbb{E}_p [ \mathbf{s}_q(x) f(x) + \nabla_x f(x)] \big)^2
\end{equation*}

where $\mathcal{F}$ is the set of functions which previous expectation and is also "rich" enough to ensure that $\mathbb{S}(p, q) > 0$ whenever $p \ne q$.

Why does it work?

  • Turns out that if we replace the $\mathbb{E}[h(Z)]$ in Stein's method, then $f$ is not well-behaved; it blows up at infinity.
  • A random variable $X$ has the standard Gaussian distribution if and only if $\mathbb{E} \big( f'(X) - X f(X) \big) = f$ for all $f$! (you can easilty check this by observing that the solution to this is $e^{-x^2}$).
  • Differential operator $T$, defined as

    \begin{equation*}
T f(x) = f'(x) - x f(x)
\end{equation*}

    is called a characterizing operator for the standard Gaussian distribution.

Stochastic processes

Discrete

Simple random walk

Let $\left\{ Y_i \right\}$ be a set of i.i.d. random variables such that

\begin{equation*}
Y_i := 
\begin{cases}
  1 & \text{with prob. } \frac{1}{2} \\
  -1 & \text{with prob. } \frac{1}{2}
\end{cases}
\end{equation*}

Then, let

\begin{equation*}
X_t := \sum_{i=1}^{t} Y_i, \quad X_0 := 0
\end{equation*}

We then call the sequence $(X_0, X_1, X_2, \dots)$ a simple random walk.

For large $t$, due to CLT we have

\begin{equation*}
\frac{X_t}{\sqrt{t}} \sim \mathcal{N}(0, 1)
\end{equation*}

i.e. $X_t$ follows a normal distribution with $\mathbb{E}[X_t] = 0$ and $\text{Var}(X_t) = t$.

We then have the following properties:

  1. $\mathbb{E}[X] = 0$
  2. Independent measurements : if

    \begin{equation*}
0 = t_0 \le t_1 \le \dots \le t_k
\end{equation*}

    then $X_{t + i} - X_t$ are mutually independent, with $i > 0$.

  3. Stationary : for all $h \ge 1$ and $t \ge 0$, the distribution of $X_{t + h} - X_t$ is the same as the distribution of $X_h$.

Martingale

A martingale is a stochastic process which is fair game.

We say a stochastic process $\{ X_0, X_1, X_2, \dots \}$ is a martingale if

\begin{equation*}
X_t = \mathbb{E}[X_{t + i} \mid \mathcal{F}_t], \quad i \ge 0
\end{equation*}

where

\begin{equation*}
\mathcal{F}_t = \left\{ X_0, X_1, \dots, X_t \right\}
\end{equation*}
Examples
  1. Simple random walk is a martingale
  2. Balance of a Roulette player is NOT martingale
Optional stopping theorem

A given stochastic process $\{ X_0, X_1, \dots \}$, a non-negative integer r.v. $\tau$ is called a stopping time if

\begin{equation*}
\forall k \in \mathbb{N} \quad \text{such that} \quad \tau \le k
\end{equation*}

depends only on $X_1, \dots, X_k$ (i.e. stopping at some time $\tau$ only depends on the first $k$ r.v.s).

Suppose $X_0, X_1, X_2, \dots$ is martingale and $\tau$ is a stopping time.

There exists a constant $T$ s.t. $\tau \le T$, and

\begin{equation*}
\mathbb{E}[X_{\tau}] = X_0
\end{equation*}

This implies that if we a process can be described as a martingale, then it's a "zero-sum".

Continuous

Lévy process

A stochastic process $X = \left\{ X_t : t \ge 0 \right\}$ is said to be a Lévy process if it satisfies the following properties:

  1. $X_0 = 0$ almost surely.
  2. Independence of increments: for any $0 ≤ t1 ≤ t2 ≤ … ≤ tn < ∞, the increments $X_{t_3} - X_{t_2}, \dots, X_{t_n} - X_{t_{n - 1}}$ are independent.
  3. Stationary increments: For any $s &lt; t$, the rvs. $X_t - X_s$ is equal in distribution to $X_{t - s}$.
  4. Continuity in probability: For any $\varepsilon &gt; 0$ and $t \ge 0$ it holds that

    \begin{equation*}
\lim_{h \to 0} P \big( \left| X_{t + h} - X_t \right| &gt; \varepsilon \big) = 0
\end{equation*}

If $X$ is a Lévy process then one may construct a version of $X$ such that $t \mapsto X_t$ is almost surely right-continuous with left limits.

Wiener process

The Wiener process (or Brownian motion) is a continuous-time stochastic process $W_t$ characterized by the following properties:

  1. $W_0 = 0$ (almost surely)
  2. $W$ has independent increments: for $t &gt; 0$, future increments $W_{t + u} - W_t$ with $u \ge 0$, are independent of the past values $W_s$, $s \le t$
  3. $W$ has Gaussian increments:

    \begin{equation*}
W_{t + u} - W_t \sim \mathcal{N}(0, u)   
\end{equation*}
  4. $W$ has continuous paths: with probability $1$, $W_t$ is continuous in $t$

Further, let $\xi_1, \xi_2, \dots$ be i.i.d. rv. with mean 0 and variance 1. For each $n$, define the continuous-time stochastic process

\begin{equation*}
W_n(t)  = \frac{1}{\sqrt{n}} \sum_{1 \le k \le \lfloor n t \rfloor} \xi_k
\end{equation*}

Then

\begin{equation*}
W_n(t)  \to W_t \quad \text{as} \quad n \to \infty
\end{equation*}

i.e. a random walk approaches a Wiener process.

Properties

Let $W(t)$ be a Wiener process and let

\begin{equation*}
M(t) := \max_{s \le t} W(s)
\end{equation*}

Then

\begin{equation*}
P \Big( M(t) &gt; a \Big) = 2 P \Big( W(t) &gt; a \Big)
\end{equation*}

for any $a \in \mathbb{R}$.

Let $\tau_a$ be the first time such that

\begin{equation*}
W(\tau_a) = a
\end{equation*}

Then

\begin{equation*}
P \Big( M(t) &gt; a \Big) = P (\tau_a &lt; t)
\end{equation*}

Observe that

\begin{equation*}
P \Big( W(t) - W(\tau_a) \textcolor{blue}{&gt;} 0 \mid \tau_a &lt; t \Big) = P \Big( W(t) - W(\tau_a) \textcolor{red}{&lt;} 0 \mid \tau_a &lt; t \Big)
\end{equation*}

which is saying that at the point $\tau_a$ we have $W(\tau_a) = a$, and therefore for all time which follows we're equally likely to be above or below $a$.

Therefore

\begin{equation*}
\begin{split}
  P \Big( M(t) &amp;&gt; a \Big) = P (\tau_a &lt; t) \\
  &amp;= P \Big( \left\{ W(t) - W(\tau_a) &gt; 0 \right\} \cap \left\{ \tau_a &lt; t \right\} \Big) \\
  &amp; \ \ + P \Big( \left\{ W(t) - W(\tau_a) &lt; 0 \right\} \cap \left\{ \tau_a &lt; t \right\} \Big) \\
  &amp;= 2 P \Big( \left\{ W(t) - W(\tau_a) &gt; 0 \right\} \cap \left\{ \tau_a &lt; t \right\} \Big) \\
  &amp;= 2 P \Big( W(t) &gt; W(\tau_a) = a \Big)
\end{split}
\end{equation*}

where in the last inequality we used the fact that the set of events $\{ W(t) - W(\tau_a) &gt; 0 \} = \left\{ \tau_a &lt; t \right\}$.

This theorem helps us solve the gambler's ruin problem!

Let $B_t$ denote your bankroll at time $t$, and at each $t$ it changes by

\begin{equation*}
\Delta B_t = 
\begin{cases}
  + 1 &amp; \text{with prob. } \frac{1}{2} \\
  -1 &amp; \text{with prob. } \frac{1}{2}
\end{cases}
\end{equation*}

The continuous approximation of this process is given by a Brownian motion, hence we can write the bankroll as

\begin{equation*}
B_t = \max \Big\{ W_t + B_0, 0 \Big\}
\end{equation*}

where $W_t$ denotes a Brownian motion (with $\mathbb{E}[W_t] = 0$).

We're interested in how probable it is for us to be ruined before some time $t$. Letting $\tau$ be the time were we go bankrupt, we have

\begin{equation*}
P \Big( B_t = 0 \Big) = P \Big( \tau &lt; t \Big)
\end{equation*}

i.e. that we hit the time were we go bankrupt before time $t$.

Bankrupt corresponds to

\begin{equation*}
B_{\tau} = 0 \iff W_{\tau} = - B_0
\end{equation*}

where $W_t$ denotes a Wiener process (hence centered at 0).

Then we can write

\begin{equation*}
P \Big( B_t = 0 \Big) = P \Big( \tau &lt; t \Big) = P \Big( \min_{\tau \le t} W_{\tau} &lt; - B_0 \Big) = P \Big( \max_{\tau \le t} W_{\tau} &gt; B_0 \Big)
\end{equation*}

where the last equality is due to the symmetry of a Normal distribution and thus a Brownian motion.

Hence

\begin{equation*}
P( \tau &lt; t ) = P( \max_{\tau \le t} W_{\tau} &gt; B_0 ) = 2 P( W_t &gt; B_0) = 2 \Big( 1 - \Phi( B_0 / \sqrt{t} ) \Big)
\end{equation*}

where $\Phi(\cdot)$ denotes a standard normal distribution.

\begin{equation*}
\lim_{n \to \infty} \Big[ B \Big(T (t / n) \Big) - B\Big(T (t - 1) / n \Big) \Big]^2 = T
\end{equation*}

This is farily straightforward to prove, just recall the Law of Large numbers and you should be good.

Non-differentiability

$W(t)$ is non-differentiable with probability 1.

Suppose, for the sake of contradiction, that $W(t)$ is differentiable with probability 1. Then, by MVP we have

\begin{equation*}
\left| W(t + \varepsilon) - W(t) \right| \le A \left| t + \varepsilon - t \right| = \varepsilon A
\end{equation*}

for some $A$ and all $\varepsilon &gt; 0$. This implies that

\begin{equation*}
M(\varepsilon) := \max_{s \le \varepsilon} W(t) \le \varepsilon A
\end{equation*}

since $W(t + \varepsilon) - W(t) \overset{d}{=} W(\varepsilon)$.

Hence

\begin{equation*}
P \Big( M(\varepsilon) &gt; \varepsilon A \Big) = 2 P \Big( W(\varepsilon) &gt; \varepsilon A \Big)
\end{equation*}

As $\varepsilon \to 0$, since $W(\varepsilon) \sim \mathcal{N}(0, \varepsilon)$, then

\begin{equation*}
P \Big( W(\varepsilon) \le \varepsilon A \Big) = P \Big( W(\varepsilon) / \sqrt{\varepsilon} \le \sqrt{\varepsilon} A \Big)
\end{equation*}

where

\begin{equation*}
\Big( W(\varepsilon) / \varepsilon \Big) \sim \mathcal{N}(0, 1)
\end{equation*}

Thus, when $\varepsilon \to 0$, we have

\begin{equation*}
P \Big( W(0) &gt; 0 \Big) = \frac{1}{2}
\end{equation*}

since $W(0) \sim \mathcal{N}(0, 1)$ Hence,

\begin{equation*}
P \Big( M(\varepsilon) &gt; \varepsilon A \Big) \overset{p}{=} 2 \frac{1}{2} = 1
\end{equation*}

contracting our initial statement.

Geometric Brownian motion

A stochast process $S_t$ is said to follow a Geometric Brownian Motion if it satisfies the following stochastic differential equation (SDE):

\begin{equation*}
d S_t = \mu S_t \ dt + \sigma S_t \ W_t
\end{equation*}

where $W_t$ is the Brownian motion, and $\mu$ (percentage drift) and $\sigma$ (percentage volatility) are constants.

Brownian bridge

A Brownian bridge is a continuous-time stochastic process $B(t)$ whose probability distribution is the conditional probability distribution of a Brownian motion $W(t)$ subject to the condition that

\begin{equation*}
W(T) = 0
\end{equation*}

so that the process is pinned at the origin at both $t=0$ and $t = T$. More precisely,

\begin{equation*}
B_t := \big( W_t \mid W_T = 0 \big), \quad t \in [0, T]
\end{equation*}

Then

\begin{equation*}
\mathbb{E}[B_t] = 0, \quad \text{Var}(B_t) = \frac{t (T - t)}{T}
\end{equation*}

implying that the most uncertainty is in the middle of the bridge, with zero uncertainty at the nodes.

Sometimes the notation $B(t)$ is used for a Wiener process / Brownian motion rather than for a Brownian bridge.

Markov Chains

Notation

  • $Q$ denotes the transition matrix (assumed to be ergodic, unless stated otherwise)
  • $\pi$ is the stationary distribution
  • $x_0$ denotes the initial state
  • $\Omega$ is the state space
  • $\mathcal{X}$ denotes an uncountable state space
  • $\pi\text{-a.e.} \ x \in \mathcal{X}$ denotes $\pi\text{-almost-every} \ x \in \mathcal{X}$ which means all points except those with zero-measure, i.e. $x \in \mathcal{X}$ such that $\pi(x) = 0$

Definitions

A Markov chain is said to be irreducible if it's possible to reach any state from any state.

A state $i$ has a period $k$ if any return to state $i$ must occur in multiples of $k$ time steps.

Formally, the period of a state is defined as

\begin{equation*}
k  = \gcd \left\{  n &gt; 0 : \rm{Pr}(X_n = i \mid X_0 = i) &gt; 0 \right\}
\end{equation*}

provided that the set is non-empty.

A state $i$ is said to be transient if, given that we start in state $i$,t there is a non-zero probability that we will never return to $i$.

A state $i$ is said to be recurrent (or persistent) if it is not transient, i.e. gauranteed (with prob. 1) to have a finite hitting time.

A state $i$ is said to be ergodic if it is aperiodic and positive recurrent, i.e.

  • aperiodic: period of 1, i.e. can return to current state in a single step
  • positive recurrent: has a finite mean recurrence time

If all states in an irreducible Markov chain are ergodic , then the chain is said to be ergodic.

The above definitions can be generalized to continuous state-spaces by considering measurable sets $A \in \Omega$ rather than states $i$. E.g. recurrence now means that for any measurable set $A \in \Omega$, we have

\begin{equation*}
\mathbb{P} \big( \left\{ N_A &lt; \infty \right\} \big) = 1
\end{equation*}

where $N_A$ denotes the number of steps we need to take until we "hit" $A$, i.e. the hitting time.

[This is a personal thought though, so not 100% guaranteed to be correct.]

Coupling

  • Useful for bouding the mixing rate of Markov chains, i.e. the number of steps it takes for the Markov chain to converge to the stationary distribution

Distance to Stationary Distribution

  • Use Total Variance as a distance measure, therefore convergence is defined through

    \begin{equation*}
d(t) := \max_{x \in \Omega} || Q^t(x, \cdot) - \pi||_{\rm{TV}}
\end{equation*}
  • $\bar{d}(t)$ denotes the variation distance between two Markov chain random variables $X_t \sim Q^t(x, \cdot)$ and $Y_t \sim Q^t(y, \cdot)$, i.e.

    \begin{equation*}
\bar{d}(t) := \max_{x, y \in \Omega} ||Q^t(x, \cdot) - Q^t(y, \cdot)||_{\rm{TV}}
\end{equation*}
  • One can prove that

    \begin{equation*}
d(t) \le \bar{d}(t) \le 2 d(t)
\end{equation*}

    which allows us to bound the distance between a chain and the stationary distribution, by considering the difference between two chains, without knowing the stationary distribution!

Coupling

Let $X$ and $Y$ be random variables with probability distributions $\mu$ and $\nu$ on $\Omega$, respectively.

A distribution $\omega$ on $\Omega \times \Omega$ is a coupling if

\begin{equation*}
\begin{split}
  \sum_{y \in \Omega}^{} \omega(x, y) &amp;= \mu(x), \quad \forall x \in \Omega \\
  \sum_{x \in \Omega}^{} \omega(x, y) &amp;= \nu(y), \quad \forall y \in \Omega 
\end{split}
\end{equation*}

Consider a pair of distributions $\mu$ and $\nu$ on $\Omega$.

  1. For any coupling $\omega$ of $\mu$ and $\nu$, with $(X, Y) \sim \omega$

    \begin{equation*}
|| \mu - \nu ||_{\rm{TV}} \le P(X \ne Y)
\end{equation*}
  2. There always exists a coupling $\omega$ s.t.

    \begin{equation*}
||\mu - \nu||_{\rm{TV}} \le P(X \ne Y)
\end{equation*}
  1. Observe that

    \begin{equation*}
\omega(z, z) \le \min \left\{ \mu(z), \nu(z) \right\}, \quad \forall z \in \Omega
\end{equation*}

    Therefore,

    \begin{equation*}
\begin{split}
  P(X \ne Y) &amp;= 1 - P(X = Y) \\
  &amp;= 1 - \sum_{z}^{} \omega(z, z) \\
  &amp;\ge \sum_{z}^{} \mu(z) - \sum_{z}^{} \min \left\{ \mu(z), \nu(z) \right\} \\
  &amp;= \sum_{z : \ \mu(z) &gt; \nu(z)}^{} \big( \mu(z) - \nu(z) \big) \\
  &amp;= ||\mu - \nu||_{\rm{TV}}
\end{split}
\end{equation*}

    Concluding the proof.

  2. "Inspired" by our proof in 1., we fix diagonal entries:

    \begin{equation*}
\omega(z, z) = \min \left\{ \mu(z), \nu(z) \right\}, \quad \forall z \in \Omega
\end{equation*}

    to make one of the inequalities an equality. Now we simply need to construct $\omega$ such that the above is satisfied, and it's a coupling. One can check that

    \begin{equation*}
\omega(y, z) = \frac{\Big( \mu(y) - \omega(y, y) \Big) \Big( \nu(z) - \omega(z, z) \Big)}{1 - \sum_{x}^{} \omega(x, x)}
\end{equation*}

    does indeed to the job.

Ergodicity Theorem

If $Q$ is irreducible and aperiodic, then there is a unique stationary distribution $\pi$ such that

\begin{equation*}
\lim_{t \to \infty} Q^t(x, \cdot) = \pi, \quad \forall t
\end{equation*}

Consider two copies of the Markov chain $X_t$ and $Y_t$, both following $Q$. We create the coupling distribution as follows:

  1. If $X_t \ne Y_t$, then choose $X_{t + 1}$ and $Y_{t + 1}$ independently according to $Q$
  2. If $X_t = Y_t$, choose $X_{t + 1} \sim Q$ and set $Y_{t + 1} = X_{t + 1}$

From the couppling lemma, we know that

\begin{equation*}
||X^t - Y^t||_{\rm{TV}} \le P(X^t \ne Y^t)
\end{equation*}

Due to ergodicity, there exists $t^*$ such that $P^{t^*}(x, y) &gt; 0, \forall x, y \in \Omega$. Therefore, there is some $\varepsilon &gt; 0$ such that for all initial states $X_0$ and $Y_0$,

\begin{equation*}
P(X^{t^*} \ne Y^{t^*} \mid X_0, Y_0) \le 1 - \varepsilon
\end{equation*}

Similarily, due to the Markovian property, we know

\begin{equation*}
P(X^{2 t^*} \ne Y^{2t^*} \mid X^{t^*} \ne Y^{t^*}) \le 1 - \varepsilon
\end{equation*}

Since

\begin{equation*}
X^{2 t^*} \ne Y^{2 t^*} \implies X^{t^*} \ne Y^{t^*}
\end{equation*}

we have

\begin{equation*}
\begin{split}
  P (X^{2 t^*} \ne Y^{2 t^*} \mid X_0, Y_0) &amp;= P(X^{t^*} \ne Y^{t^*} \text{ and } X^{2 t^*} \ne Y^{2 t^*} \mid X_0, Y_0) \\
  &amp;= P(X^{2t^*} \ne Y^{2t^*} \mid X^{t^*} \ne Y^{t^*}) P(X^{t^*} \ne Y^{t^*} \mid X_0, Y_0) \\
  &amp;\le \big( 1 - \varepsilon \big)^2
\end{split}
\end{equation*}

Hence, for any positive integer $k$:

\begin{equation*}
  P (X^{k t^*} \ne Y^{k t^*} \mid X_0, Y_0) \le \big( 1 - \varepsilon \big)^k
\end{equation*}

Hence,

\begin{equation*}
 \lim_{k \to \infty} P (X^{k t^*} \ne Y^{k t^*} \mid X_0, Y_0) \to 0 \implies \lim_{t \to \infty} P (X^{t} \ne Y^{t} \mid X_0, Y_0) \to 0
\end{equation*}

Coupling lemma then gives

\begin{equation*}
\lim_{t \to \infty} ||Q^t(x, \cdot) - Q^t(y, \cdot)||_{\rm{TV}} \to 0
\end{equation*}

Finally, we observe that, letting $\sigma = \lim_{t \to \infty} Q^t(x, \cdot)$,

\begin{equation*}
\begin{split}
  \sum_{x}^{} \sigma(x) P(x, y) &amp;= \sum_{x}^{} \lim_{t \to \infty} Q^t(z, x) Q(x, y), \quad \forall z \in \Omega \\
  &amp;= \lim_{t \to \infty} Q^{t + 1}(z, y) \\
  &amp;= \sigma(y)
\end{split}
\end{equation*}

which shows that $\sigma P = \sigma$, i.e. converges to the stationary distribution.

Finally, observe that $\sigma$ is unique, since

\begin{equation*}
 || \lim_{t \to \infty} Q^t(x, \cdot) - \lim_{t \to \infty} Q^t(y, \cdot)||_{\rm{TV}}  \to 0
\end{equation*}

Mixing time

Let $X_0$ be some $x \in \Omega$ and let $Y_0$ have the stationary distribution. Fix $t$. By the coupling lemma, there is a coupling and random variables $X^t \sim Q^t(x, \cdot)$ and $Y^t \sim \pi$ such that

\begin{equation*}
d_x(t) = ||Q^t(x, \cdot) - \pi||_{\rm{TV}} = P(X^t \ne Y^t)
\end{equation*}

Using this coupling, we define a coupling of the distributions of $X^{t + 1}, Y^{t + 1}$ as follows:

  1. If $X^t = Y^t$, set $X^{t + 1} = Y^{t + 1}$
  2. Else, let $X^t \to X^{t + 1}$ and $Y^t \to Y^{t + 1}$ independently

Then we have,

\begin{equation*}
d_x(t + 1) = ||Q^{t + 1}(x, \cdot) - \pi||_{\rm{TV}} \le P(X^{t + 1} \ne Y^{t + 1}) \le P(X^t \ne Y^t) = d_x(t)
\end{equation*}

The first inequality holds due to the coupling lemma, and the second inequality holds by construction of the coupling.

Since $d(t)$ never decreases, we can define the mixing time $\tau(\varepsilon)$ of a Markov chain as

\begin{equation*}
\tau(\varepsilon) = \min_t \left\{ d(t) \le \varepsilon \right\}
\end{equation*}

for some chosen $\varepsilon &gt; 0$.

General (uncountable) state space

Now suppose we have some uncountable state-space $\mathcal{X}$. Here we have to be a bit more careful when going about constructing Markov chains.

  • Instead of considering transitions $x \mapsto y$ for some $x, y \in \mathcal{X}$, we have to consider transitions $x \mapsto y \in A$ for $x \in \mathcal{X}$ and $A \subseteq \mathcal{X}$.
    • Transition probability is therefore denote $Q(x, A)$
  • This is due to the fact that typically, the measure of a singleton set $\{ x \}$ is zero, i.e. $\mu(\left\{ x \right\}) = 0$, when we have an uncountable number possible inputs.
  • That is; we have to consider transitions to subsets of the state-space, rather than singletons
  • This requires a new definition of irreducibility of a Markov chain

A chain is $\phi\text{-irreducible}$ if there exists a non-zero sigma-finite measure $\phi$ on $\mathcal{X}$ such that for all $A \subseteq \Omega$ with $\phi(A) &gt; 0$ (i.e. all non-zero measurable subsets), and for all $x \in \mathcal{X}$, there exists a positive integer $t = t(x, A)$ such that

\begin{equation*}
Q^t(x, A) &gt; 0
\end{equation*}

We also need a slightly altered definition of periodicity of a Markov chain:

A Markov chain with stationary distribution $\pi$ is periodic if there exist $d \ge 2$ and disjoint subsets $\mathcal{X}_1, \mathcal{X}_2, \dots, \mathcal{X}_d \subseteq \mathcal{X}$ with

\begin{equation*}
P(x, \mathcal{X}_{i + 1}) = 1, \quad x \in \mathcal{X}_i, i = 1, \dots, d - 1
\end{equation*}

and

\begin{equation*}
P(x, \mathcal{X}_1) = 1, \quad x \in \mathcal{X}_{d}
\end{equation*}

such that $\pi(\mathcal{X}_1) &gt; 0$, and hence $\pi(\mathcal{X}_i) &gt; 0$ for all $i$. That is, we always transition between subsets, and never within these disjoint subsets.

If such does not exist, i.e. the chain is not periodic, then we say the chain is aperiodic.

Results such as the coupling lemma also hold for the case of uncountable state-space, simply by replacing the summation with the corresponding integrand over the $\mathcal{X}$. From roberts04_gener_state_space_markov_chain_mcmc_algor, similarily to the countable case, we have the following properties:

Let $\mu$ and $\nu$ be probability measures on the space $\mathcal{X}$. Then

  1. $||\mu - \nu||_{\rm{TV}} = \sup_{f : \mathcal{X} \to [0, 1]} \left| \int f d \mu - \int f d \nu \right|$
  2. More generally,

    \begin{equation*}
||\mu - \nu||_{\rm{TV}} = \frac{1}{b - a} \sup_{f: \mathcal{X} \to [a, b]} \left| \int f \ d \mu - \int f  \ d \nu \right|
\end{equation*}

    In particular,

    \begin{equation*}
||\mu - \nu||_{\rm{TV}} = \frac{1}{2} \sup_{f: \mathcal{X} \to [-1, 1]} \left| \int f \ d \mu - \int f  \ d \nu \right|
\end{equation*}
  3. If $\pi$ is a stationary distribution for a Markov chain with kernel $Q$, then $||Q^t(x, \cdot) - \pi||_{\rm{TV}}$ is non-increasing in $t$, i.e.

    \begin{equation*}
||Q^t(x, \cdot) - \pi||_{\rm{TV}} \le ||Q^{t - 1}(x, \cdot) - \pi||_{\rm{TV}}$
\end{equation*}

    for $t \in \mathbb{N}$.

  4. More generally, letting

    \begin{equation*}
\big( \mu P \big)(A) = \int \mu (dx) P(x, A)
\end{equation*}

    we always have

    \begin{equation*}
 || \mu Q - \nu Q ||_{\rm{TV}} \le || \mu - \nu||_{\rm{TV}}
\end{equation*}
  5. If $\mu$ and $\nu$ have densities $g$ and $h$, respectively, wrt. to some sigma-finite measure $\rho$, then

    \begin{equation*}
|| \mu - \nu ||_{\rm{TV}} = \frac{1}{2} \int_{\mathcal{X}} \big( M - m \big) d \rho = 1 - \int_{\mathcal{X}} m \ d \rho
\end{equation*}

    with

    \begin{equation*}
M = \max \left\{ g, h \right\}, \quad m = \min \left\{ g, h \right\}
\end{equation*}
  6. Given probability measures $\mu$ and $\nu$, there are jointly defined random variables $X$ and $Y$ such that

    \begin{equation*}
X \sim \mu \quad \text{and} \quad Y \sim \nu
\end{equation*}

    and

    \begin{equation*}
P \big( X = Y \big) = 1 - ||\mu - \nu||_{\rm{TV}}
\end{equation*}

From roberts04_gener_state_space_markov_chain_mcmc_algor we also have the following, important theorem:

If a Markov chain on a state space with countable generated sigma-algebra is $\phi$ irreducible and aperiodic, and has a stationary distribution $\pi$, then for $\pi$ almost-every $x \in \mathcal{X}$,

\begin{equation*}
\lim_{n \to \infty} ||Q^n(x, \cdot) - \pi||_{\rm{TV}} = 0
\end{equation*}

In particular,

\begin{equation*}
\lim_{n \to \infty} Q^n(x, A) = \pi(A), \quad \forall A \subseteq \mathcal{X}, \quad A \text{ measurable}
\end{equation*}

We also introduce the notion of Harris recurrent:

We say a Markov chain is Harris recurrent if for all $B \subseteq \mathcal{X}$ with $\pi(B) &gt; 0$ (i.e. all non-zero-measurable subsets), and all $x \in \mathcal{X}$, the chain will eventually reach $B$ from $x$ with probability 1, i.e.

\begin{equation*}
P \big( \exists n: X_n \in B \mid X_0 = x \big) = 1
\end{equation*}

This notion is stronger than the notion of irreducibility introduced earlier.

Convergence rates

Uniform Ergodicity

From this theorem, we have implied asymptotic convergence to stationarity, but it does not provide us with any information about the rate of convergence. One "qualitative" convergence rate property is

A Markov chain having stationary distribution $\pi$ is uniformly ergodic if

\begin{equation*}
||Q^n(x, \cdot) - \pi||_{\mathrm{TV}} \le M \alpha^n, \quad n = 1, 2, 3, \dots
\end{equation*}

for some $\alpha &lt; 1$ and $M &lt; \infty$.

For further developments of ensuring uniform ergodicity, we need to define

A subset $C \subseteq \mathcal{X}$ is small or $(t_0, \varepsilon, \nu)\text{-small}$ if there exists a positive integer $t_0$, $\varepsilon &gt; 0$, and a probability measure $\nu$ on $\mathcal{X}$ s.t. the following minorisation condition holds:

\begin{equation*}
Q^{t_0}(x, \cdot) \ge \varepsilon \nu, \quad x \in C
\end{equation*}

i.e. $Q^{t_0}(x, A) \ge \varepsilon \nu(A)$ for all $x \in C$ and all measurable $A \subseteq \mathcal{X}$.

Geometric ergodicity

A Markov chain with stationary distribution $\pi$ is geometrically ergodic if

\begin{equation*}
\norm{Q^n(x, \cdot) - \pi(\cdot)}_{\mathrm{TV}} \le M(x) \alpha^n, \quad n = 1, 2, 3, \dots
\end{equation*}

for some $\alpha &lt; 1$ where $M(x) &lt; \infty$ for $\pi \text{-a.e.}$ $x \in \mathcal{X}$.

Difference between uniform ergodicity and geometric ergodicity is the fact that in geometric ergodicity the $M$ can also depend on the initial state $x \in \mathcal{X}$.

  • In practice
    • Useful to use the split $\hat{R}$

In practice

Diagnostics

Effective sample size (ESS)

(split) $\hat{R}$ statistic
  • Consider running $m / 2$ chains, each for $2n$ steps, from different initializations $x \in \mathcal{X}$, where $m \ge 2$ is even
  • Then consider splitting each chain in half, resulting in two half-chains for each chain, leaving us with a total of $2 \cdot \frac{m}{2} = m$ chains
    • Reason: If a chain indeed represents independent samples from the stationary distribution, then at the very least the first half and second half of the chain should indeed have the display similar behavior (for sufficiently large $n$)

You might wonder, why not split in 3? Or 4? or, heck, $m$?!

Likely answer: "You could, but be sensible!" Very unsatisfying…

gelman2013bayesian

Let $\psi$ be the scalar estimand, i.e. the variable we're trying to estimate. Let $\psi_{ij}$ be the "for $i = 1, \dots, n$ and $j = 1, \dots, m$ denote the different samples for the variable of interest $\psi$ obtained by $m$ MCMC chains, each with $n$ samples each, with different initializations.

We denote the (B)etween- and (W)ithin-sequence variances

\begin{equation*}
\begin{split}
  B &amp;= \frac{n}{m - 1} \sum_{j = 1}^{m} \big( \overline{\psi}_{\cdot j} - \overline{\psi}_{\cdot \cdot} \big)^2 \\
  W &amp;= \frac{1}{m} \sum_{j=1}^{m} s_j^2
\end{split}
\end{equation*}

where

  • empirical mean within a chain

    \begin{equation*}
\overline{\psi}_{\cdot j} = \frac{1}{n} \sum_{i=1}^{n} \psi_{ij}
\end{equation*}
  • empirical mean across all chains

    \begin{equation*}
\overline{\psi}_{\cdot \cdot} = \frac{1}{m} \sum_{j=1}^{m} \overline{\psi}_{\cdot j} = \frac{1}{nm} \sum_{j=1}^{m} \sum_{i=1}^{n} \psi_{ij}
\end{equation*}
  • empirical variance within each chain

    \begin{equation*}
s_j^2 := \frac{1}{n - 1} \sum_{i = 1}^{n} \big( \psi_{ij} - \overline{\psi}_{\cdot j} \big)^2
\end{equation*}

Notice that $B$ contains a factor of $n$ because it is based on the variance of the within-sequence means $\overline{\psi}_{\cdot j}$, each of which is an averaeg of $n$ values $\psi_{ij}$.

We can then estimate $\Var(\psi \mid y)$, the marginal posterior variance of the estimand, by a weighted average of $W$ and $B$, namely

\begin{equation*}
\widehat{\Var}^{ + }(\psi \mid y) = \frac{n - 1}{n} W + \frac{1}{n} B
\end{equation*}

Finally, we can then monitor convergence of the iterative simulation by estimating the factor by which the scale of the current distribution for $\psi$ might be reduced if the simulations were continued in the limit $n \to \infty$. This potential scale reduction is estimated by

\begin{equation*}
\hat{R} = \sqrt{\frac{\widehat{\Var}^{ + }(\psi \mid y)}{W}}
\end{equation*}

which has the property that

\begin{equation*}
\lim_{n \to \infty} \hat{R} = 1
\end{equation*}

(remember that $\widehat{\Var}^{ + }(\psi \mid y)$ depend on $n$).

If

  • $\hat{R}$ is large → reason to believe that further simulations may improve our inference about the target distribution of the associated scalar estimand
  • $\hat{R}$ is small → ???

Specific distributions & processes

Hazard function

The hazard function / failure rate $h(x)$ is defined

\begin{equation*}
h(x) = \frac{P(x)}{S(x)} = \frac{P(x)}{1 - F(x)}
\end{equation*}

where $P$ denotes the PDF and $F$ denotes the CDF.

TODO Renewal processes

  • Let $\tau_k$ with $k \in \mathbb{N}$ be a sequence of waiting times between $(k - 1) \text{-th}$ and the $k \text{-th}$ event.
  • Arrival time of the $m\text{-th}$ event is

    \begin{equation*}
a_m = \sum_{k=1}^{m} \tau_k, \quad m = 1, 2, \dots
\end{equation*}
  • Denote $N$ by the total number of events in $[0, t)$
  • If $t$ is fixed, then $N_t = N(t)$ is the count variable we wish to model
    • It follows that

        \begin{equation*}
N_t &lt; m \quad \iff \quad a_m \ge t  
\end{equation*}

      i.e. total number of events is less than $m$ if and only if the arrival time of the $m \text{-th}$ event is greater than $t$, which makes sense

  • If $F_m$ is the distribution function of $a_m$, we have

    \begin{equation*}
P(N_t &lt; m) = P(a_m \ge t) = 1 - F_m(t)
\end{equation*}
  • Furthermore

    \begin{equation*}
\begin{split}
  P \big( N_t = m \big) &amp;= P(N_t &lt; m + 1) - P(N_t &lt; m) \\
  &amp;= F_m(t) - F_{m + 1}(t) \\
  &amp;= P_m(t)
\end{split}
\end{equation*}
    • This is the fundamental relationship between the count variable $N$ and the timing process $a_m$
  • If $\tau_k \overset{\text{i.i.d.}}{\sim} f(\tau)$, the process is called a renewal process
  • In this case, we can extend the above equation to the following recursive relationship:

    \begin{equation*}
\begin{split}
  P_{m + 1}(t) &amp;= \int_{0}^{t} F_m(t - u) \dd{F(u)} - \int_{0}^{t} F_{m + 1}(t - u) \dd{F(u)} \\
  &amp;= \int_{0}^{t} P_m(t - u) \dd{F(u)}
\end{split}
\end{equation*}

    where we have

    \begin{equation*}
P_0(u) = S(u) = 1 - F(u)
\end{equation*}
    • Intuitively: probability of exactly $m + 1$ events occuring by time $t$ is the probability that the first event occurs at time $0 \le u &lt; t$, and that exactly $m$ events occur in the remaining time interval, integrated over all times $u$.
    • Evaluting this integral, we can obtain $P_1(t), \dots, P_m(t)$ from the above recursive relationship.

Statistics

Notation

Definitions

Efficency

The efficency of an unbiased estimator $\hat{\theta}$ is the ratio of the minimum possible variance to $\text{Var}(\hat{\theta})$.

An unbiased estimator with efficiency equal to 1 is called efficient or a minimum variance unbiased estimator (MVUE).

Statistic

Suppose a random vector $\mathbf{Y}$ has distribution function in a parametric family $\{ F(\mathbf{y} \mid \theta); \theta \in \Theta \}$ and realized value $\mathbf{y}$.

A statistic is r.v. $T = t(\mathbf{Y})$ which is a function of $\mathbf{Y}$ independent of $\theta$. Its realized value is $t = t(\mathbf{y})$.

A statistic $S$ is said to be sufficient for $\theta$ if the distribution of $\mathbf{Y}$ given $S$ does not depend on $\theta$, i.e.

\begin{equation*}
f_{\mathbf{Y} \mid S} (\mathbf{y} \mid s, \theta) = f_{\mathbf{Y} \mid S} (\mathbf{y} \mid s)
\end{equation*}

Further, we say $S$ is a minimal sufficient statistic if it's the smallest / least complex "proxy" for $\theta$.

Observe that,

  1. if $S$ is sufficient for $\theta$, so is any one-to-one function of $S$
  2. $\mathbf{Y}$ is trivially sufficient

We say a statistic is pivotal if it does not depend on any unknown parameters.

E.g. if we are considering a normal distribution $\mathcal{N}(\mu, \sigma^2)$ and $\sigma$ is known, then the mean could be a pivotal statistic, since we know all information about the distribution except the statistic itself. But if we didn't know $\sigma$, the mean would not be pivotal.

Let $\mathbf{Y} = (Y_1, \dots, Y_n) \sim f(\mathbf{y} \mid \theta)$.

Then statistic $s = s(Y_1, \dots, Y_n)$ is sufficient for $\theta$ if and only if there exists functions $h$ of $\mathbf{y}$ and $g$ of $(s, \theta)$ such that

\begin{equation*}
f( \mathbf{y} \mid \theta) = L(\theta; \mathbf{y}) = g \Big( s(\mathbf{y}), \theta \Big) \cdot h(\mathbf{y}), \quad \forall \theta \in \Theta, \mathbf{y} \in \mathcal{Y}
\end{equation*}

where $L(\theta; \mathbf{y})$ denotes the likelihood.

U-statistic

Let $X_1, X_2, \dots$ be independent observations on a distribution $F$.

Consider a "parametric function" $\theta = \theta(F)$ for which there is an unbiased estimator. That is, $\theta(F)$ may be represented as

\begin{equation*}
\theta(F) = \mathbb{E}_F \big[ h(X_1, \dots, X_m) \big] = \int \dots \int h(x_1, \dots, x_m) \dd{F(x_1)} \cdots \dd{F(x_m)}
\end{equation*}

for some function $h = h(x_1, \dots, x_m)$, called a kernel.

Without loss of generality, we may assume $h$ is symmetric, otherwise we could simply

\begin{equation*}
h(x_1, \dots, x_m) \mapsto \frac{1}{m!} \sum_{\pi \in \text{Perm(m)}}^{} h \Big(x_{\pi(1)}, \dots, x_{\pi(m)} \Big)
\end{equation*}

For any kernel $h$, the corresponding U-statistic for estimation of $\theta$ on the basis of a sample $X_1, \dots, X_n$ for size $n \ge m$ is obtained by averaging the kernel $h$ symmetrically over the observations:

\begin{equation*}
U_n = U(X_1, \dots, X_n) = \frac{1}{{n \choose m}} \sum_{C_m^n}^{} h \Big( X_{i_1}, \dots, X_{i_m} \Big)
\end{equation*}

where $\sum_{C_m^n}^{}$ denotes the summation over ${n \choose m}$ combinations of $m$ distinct elements $\{ i_1, \dots, i_m \}$ from $\{ 1, \dots, n \}$.

Clearly, $U_n$ is then an unbiased estimate of $\theta$.

To conclude, a U-statistic is then a statistic which has the property of being an unbiased estimator for the corresponding $\theta$, where $\theta$ is such that it can be written stated above.

V-statistic

Statistics that can be represented as functionals $T(F_n)$ of the empirical distribution function, $F_n$, are called statistical functionals.

A V-statistic is a statistical function (of a sample) defined by a particular statistical functional of a probability distribution.

Suppose $x_1, \dots, x_n$ is a sample. In typical applications the statistical function has a representation as the V-statistic

\begin{equation*}
V_{mn} = \frac{1}{n^m} \sum_{i_1 = 1}^{n} \cdots \sum_{i_m = 1}^{n} h(x_{i_1}, x_{i_2}, \dots, x_{i_m})
\end{equation*}

where $h$ is a symmetric kernel function.

$V_{mn}$ is called a V-statistic of degree $m$.

Seems very much like a form of boostrap-estimate, does it not?

Informally, the type of asymptotic distribution of a statistical function depends on the order of "degeneracy," which is determined by which term is the first non-vanishing term in the Taylor expansion of the functional $T$.

Quantiles

Let $X$ be a random variable with cumulative density function (CDF) $F$, i.e.

\begin{equation*}
F(x) = P \left\{ X \le x \right\}
\end{equation*}

Let $x_{\alpha}$ be such that

\begin{equation*}
F^{-1}(\alpha) = x_{\alpha}
\end{equation*}

for some $\alpha \in [0, 1]$.

Then we say that $x_{\alpha}$ is called the $\alpha$ quantile of $X$, i.e. the value such that

\begin{equation*}
P \left\{ X \le x_{\alpha} \right\} = \alpha
\end{equation*}

We then say:

  • $F^{-1}(0.5)$ is the median; half probability mass on each side
  • $F^{-1}(0.25)$ is the lower quantile
  • $F^{-1}(0.75)$ is the upper quantile

$F$ is strictly monotonically increasing so $F^{-1}$ exists, hence we can compute the quantiles!

Convergence in law

A sequence of cdfs $H_n$ is said to converge to $H$ iff $H_n(x) \longrightarrow H(x)$ on all continuity points of $H$.

We say that if a random variable $Y_n$ has cdf $H_n$ and the rv $Y$ has cdf $H$, then $Y_n$ converges in law to $Y$, and we write

\begin{equation*}
Y_n \overset{\mathcal{L}}{\longrightarrow} Y
\end{equation*}

This does not mean that $Y_n$ and $Y$ are arbitrarily close as random variables. Consider the random variables $U \sim U(0, 1)$ and $1 - U$.

Let $Y_n$ and $Y$ be random variables. Then

\begin{equation*}
Y_n \overset{p}{\longrightarrow} Y \quad \iff \quad \forall \varepsilon &gt; 0, \quad  P( \left| Y_n - Y \right| \le \varepsilon) \longrightarrow 1
\end{equation*}

where $Y_n \overset{p}{\longrightarrow} Y$ means converges in probability.

Or, equivalently

\begin{equation*}
Y_n \overset{p}{\longrightarrow} Y \quad \iff \quad \forall \varepsilon &gt; 0, P( \left| Y_n - Y \right| &gt; \varepsilon) \longrightarrow 0
\end{equation*}

This is the notion used by WLLN.

If $k_n Y_n \overset{\mathcal{L}}{\longrightarrow} H$ where $H$ is the limit distribution and $k_n \longrightarrow \infty$ then

\begin{equation*}
Y_n \overset{p}{\longrightarrow} 0
\end{equation*}

Let $\left\{ X_n \right\}_{i = 1}^{\infty}$ be a sequence of random variables, and $X$ be a random variable.

Then

\begin{equation*}
X_n \overset{\text{a.s.}}{\longrightarrow} X \quad \iff \quad \lim_{n \to \infty} X_n \overset{\text{a.s.}}{=} X
\end{equation*}

This is the notion of convergence used by SLLN.

Kurtosis

The kurtosis of a random variable is the 4th moment, i.e.

\begin{equation*}
\text{Kurt}[X] = \kappa(X) = \mathbb{E} \Bigg[ \bigg( \frac{X - \mu}{\sigma} \bigg)^4 \Bigg] = \frac{\mu_4}{\sigma^4} = \frac{\mathbb{E}[(X - \mu)^4]}{\big(\mathbb{E}[(X - \mu)^2]\big)^2}
\end{equation*}

where $\mu_4$ denotes the 4th central moment and $\sigma$ is the std. dev.

Words

A collection of random variables is homoskedastic if all of these random variables have the same finite variance.

This is also known as homogeneity of variance.

A collection of random variables is heteroskedastic if there are sub-populations that have different variability from others.

Thus heteroskedasticity is the absence of homoskedastic.

Consistency

Loosely speaking, an estimator $T_n$ of parameter $\theta$ is said to be consistent, if it converges in probability to the true value of the parameter:

\begin{equation*}
T_n \overset{P}{\to} \theta
\end{equation*}

Or more rigorously, suppose $\{ p_{\theta} : \theta \in \Theta \}$ is a family of distributions (the parametric model) and $X^{\theta} = \{ X_1, X_2, \dots : X_i \sim p_{\theta} \}$ is an infinite sample from the distribution $p_{\theta}$.

Let $\{ T_n(X^{\theta}) \}$ be a sequence of estimators for some parameter $g(\theta)$. Usually $T_n$ will be based on the first $n$ observations of a sample. Then this sequence is said to be (weakly) consistent if

\begin{equation*}
T_n \big( X^{\theta} \big) \overset{P}{\to} g(\theta), \quad \forall \theta \in \Theta
\end{equation*}

Jeffreys Prior

The Jeffreys prior is a non-informative prior distribution for a parameter space, defiend as:

\begin{equation*}
p \Big(\boldsymbol{\theta} \Big) \propto \sqrt{\det \mathcal{I} \Big(\boldsymbol{\theta}\Big)}
\end{equation*}

It has the key feature that it is invariant under reparametrization.

Moment generating function

For r.v. $X$ the moment generating function is defined as

\begin{equation*}
M_x(t) = \mathbb{E} [e^{tX}], \quad t \in \mathbb{R}
\end{equation*}

Taking the derivative wrt. $t$ we observe that

\begin{equation*}
\frac{\partial M_x}{\partial t} = \mathbb{E}[X e^{tX}]
\end{equation*}

Letting $t = 0$, we get

\begin{equation*}
\frac{\partial M_x}{\partial t} = \mathbb{E}[X]
\end{equation*}

i.e. the mean. We can take this derivative $n$ times to obtain the expectation of $X^n$, which is why $M_x(t)$ is useful.

Distributions

Binomial

\begin{equation*}
P\{X = x\} = \theta^{1 - x} (1 - \theta)^x, \quad x \in \{0, 1\}
\end{equation*}

or equivalently,

\begin{equation*}
P\{X = x\} = \begin{cases}
  (1 - \theta) &amp; x = 1 \\
  \theta &amp; x = 0
\end{cases}
\end{equation*}

Negative binomial

A negative binomial random variable represents the number of success $k$ before obtaining $r$ failures, where $\theta$ is the probability of a failure for each of these binomial rvs.

\begin{equation*}
p(k; r) = {k + r - 1 \choose k} (1 - \theta)^k \theta^r
\end{equation*}
Derivation

Let $K$ denote the rv. for # of successes and $R$ the number of failures.

Suppose we have a run of $K = k$ successes and $R = 1$ failure, then

\begin{equation*}
P\{K = k | R = 1\} = (1 - \theta)^k \theta
\end{equation*}

where the failure is of course the last thing that happens before we terminate the run.

Now, suppose that the run above is followed up by $r - 1$ failures, i.e. $R = r$ but we have a specific ordering. Then, letting $X_i \in \{0, 1\}$ denote the result of the i-th Binomial trail,

\begin{equation*}
P\{X_1 = 1, X_2 = 1, \dots, X_k = 1, X_{k + 1} = 0, \dots, X_{k + r} = 0 \} = (1 - \theta)^k \theta^r
\end{equation*}

But for $r - 1$ of the failures, we don't actually care when they happen, so we don't want any ordering. That is, we have ${k + r - 1 \choose r - 1}$ sequences of the form described above which are acceptable.

Hence we get the pmf

\begin{equation*}
\begin{split}
  P\{K = k | R = r\} = {k + r - 1 \choose r - 1} \theta^r (1 - \theta)^k
\end{split}
\end{equation*}

Gamma distribution

Chi-squared distribution

The chi-squared distribution with $k$ degrees of freedom is the distribution of a sum of the squares of $k$ independent standard normal random variables.

It's a special case of the gamma distribution.

T-distribution

The T-distribution arises as a result of the Bessel-corrected sample variance.

Why?

Our population-model is as follows:

\begin{equation*}
X_1, X_2, \dots, X_n \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\mu, \sigma^2)
\end{equation*}

Let

\begin{equation*}
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i
\end{equation*}

then let

\begin{equation*}
S^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (X_i - \bar{X})^2
\end{equation*}

be the Bessel-corrected variance. Then the random variable

\begin{equation*}
\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1)
\end{equation*}

and the random variable

\begin{equation*}
\frac{\bar{X} - \mu}{S / \sqrt{n}}
\end{equation*}

(where $S$ has been substituted for $\sigma$ ) has a Student's t-distribution with $n - 1$ degrees of freedom.

Note that the numerator and the denominator in the preceding expression are independent random variables, which can be proved by a simple induction.

F-distribution

A random variate of the F-distribution with parameters $d_1$ and $d_2$ arises as the ratio of the two appropriately scaled chi-quared variates:

\begin{equation*}
X = \frac{U_1 / d_1}{U_2 / d_2}
\end{equation*}

where

Power laws

Notation
  • $x_{\text{min}}$ such that we have a power law $\forall x \ge x_{\text{min}}$, in such cases we say that the tail of the distribution follows a power law
Definition

A continuous power-law is one described by probability density $p(x)$ such that

\begin{equation*}
p(x) \ dx = \text{Pr} \left\{ x \le X \le x + dx \right\} = C x^{-\alpha} \ dx
\end{equation*}

where $X$ is the observed value and $C$ is the normalization constant. Clearly this diverges as $x \to 0$, hence cannot hold for all $x \ge 0$; must be a lower bound to power-law behavior.

Provided $\alpha &gt; 1$, we have

\begin{equation*}
p(x) = \frac{\alpha - 1}{x_{\text{min}}} \Bigg( \frac{x}{x_{\text{min}}} \Bigg)^{- \alpha}
\end{equation*}

A discrete power-law is defined by

\begin{equation*}
p(x) = \text{Pr} \left\{ X = x \right\} = C x^{- \alpha}
\end{equation*}

Again, diverges at zero, hence must be some lb $x_{\text{min}} &gt; 0$ on power-law behaviour:

\begin{equation*}
p(x) = \frac{x^{- \alpha}}{\zeta(\alpha, x_{\text{min}})}
\end{equation*}

where

\begin{equation*}
\zeta(\alpha, x_{\text{min}}) = \sum_{n = 0}^{\infty} \big( n + x_{\text{min}} \big)^{-\alpha}
\end{equation*}

is the generalized or Hurwitz zeta function.

Important

Sources: clauset07_power_law_distr_empir_data and so_you_think_you_have_a_power_law_shalizi

  1. Log-log plot being roughly linear is necessary but not sufficient
    1. Errors are hard to estimate because they are not well-defined by the usual regression formulas, which are based on assumptions that do not apply in this case: noise of logarithm is not Gaussian as assumed by linear regression. In continuous case, this can be made worse by choice of binning scheme used to construct histogram, hence we have an additional free parameter.
    2. A fit to a power-law distribution can account for a large fraction of the variance even when fitted data do not follow a power law, hence high values of $r^2$ ("explained variance by regression fit") cannot be taken as evidence in favor of the power-law form.
    3. Fits extracted by regression methods usually do not satisfy basic requirements on probability distributions, such as normalization and hence cannot be correct.
  2. Abusing linear regression. Standard methods, e.g. least-squares fitting are known to produce systematically biased estimates of parameters for power-law distributions and should not be used in most circumstances
  3. Use MLE to estimate scaling exponent $\alpha$
  4. Use goodness of fit to estimate scaling regions.
    • In some cases there are only parts of the data which actually follows a power law.
    • Method based on Kolmogorv-Smirnov goodness-of-fit statistic
  5. Use goodnes-of-fit test to check goodness of fit
  6. Use Vuong's test to check for alternative non-power laws (see likelihood_ratio_test_for_model_selection_and_non_nested_hypothesis_vyong89)
Fitting power laws
  • Continuous

    The MLE for the continuous case is

    \begin{equation*}
\hat{\alpha} = 1 + n \Bigg[ \sum_{i=1}^{n} \ln \bigg( \frac{x_i}{x_{\text{min}}} \bigg) \Bigg]
\end{equation*}

Maximum Likelihood Estimation

Notation

  • $l(\theta)$ denotes the log-likelihood , i.e. $\log p(X \ | \ \theta)$
  • $I_\theta$ is the Fisher's information (expected information)
  • $J(\theta, \mathbf{y})$ is the observed information , i.e. information without taking the expectation
  • $U(\theta) = \frac{dl}{d\theta}$ is called the score function
  • $\theta_0$ denotes that true (and thus unobserved) parameter
  • $\chi_p(\alpha)$ means the probability density $\chi_p$ evaluated at $\alpha$

Appoximate (asymptotic) variance of MLE

For large samples (and under certain conditions ) the (asymptotic) variance of the MLE $\hat{\theta}$ is given by

\begin{equation*}
\text{Var}(\hat{\theta}) \ge I_\theta^{-1} \text{ or } \text{Var}(\hat{\theta}) \approx I_\theta^{-1}
\end{equation*}

where

\begin{equation*}
I_\theta = \mathbb{E} \Bigg( \Bigg[ \frac{dl}{d\theta} \Bigg]^2 \Bigg) = - \mathbb{E} \Bigg( \frac{d^2 l}{d\theta^2} \Bigg)
\end{equation*}

where $I_\theta$ is called the Fisher information .

Estimated standard error

The estimated standard error of the MLE $\hat{\theta}$ of $\theta$ is given by

\begin{equation*}
ESE(\hat{\theta}) = \sqrt{I_{\hat{\theta}}^{-1}}
\end{equation*}
Regularity conditions

The lower bound of the variance of the MLE is true under the following conditions on the probability density function, $f(x; \theta)$, and the estimator $T(X)$:

\begin{equation*}
\frac{\partial}{\partial \theta} \log f(x; \theta)
\end{equation*}

exists and is finite.

  • The operations of integration wrt. to $x$ and differentiation wrt. $\theta$ can be interchanged in the expectation of $T$; that is,
\begin{equation*}
\frac{\partial}{\partial \theta} \Bigg[ \int T(x) f(x;\theta) \ dx \Bigg] = \int T(x) \Big[ \frac{\partial}{\partial \theta} f(x; \theta) \Big] \ dx
\end{equation*}

whenever the RHS is finite. This condition can often be confirmed by using the fact that integration and differentiation can be swapped when either of the following is true:

  1. The function $f(x; \theta)$ has bounded support in $x$, and the bounds do not depend on $\theta$
  2. The function $f(x; \theta)$ has infinite support, is continuously differentiable, and the integral converges uniformly for all $\theta$
"Proof" (alternative, also single variable)
  • Notation
    • $\mathbb{E}_\theta$ denotes the expectation over the data $X$, assuming the data arises from the model specified by the parameter $\theta$
    • $\mathbb{E}_{\theta_0}$ is the same as above, but assuming $\theta = \theta_0$
    • $U(X, \theta) = \frac{d \ell}{d \theta}$ denotes the score, where we've made the dependence on the data $X$ explicit by including it as an argument
  • Stuff
    • Consistency of $\hat{\theta}$ as an estimator

      Suppose the true parameter is $\theta_0$, that is:

      \begin{equation*}
X_1, \dots, X_n \overset{\text{i.i.d.}}{\sim} f(x \mid \theta_0)
\end{equation*}

      Then, for any $\theta \in \Omega$ (not necessarily $\theta_0$), the Law of Large Numbersimplies the convergence in probability

      \begin{equation*}
\frac{1}{n} \sum_{i=1}^{n} \log f(X_i \mid \theta) \to \mathbb{E}_{\theta_0}[\log f(X \mid \theta)]
\end{equation*}

      Under suitable regularity conditions, this implies that the value of $\theta$ maximizing the LHS, which is $\hat{\theta}$, converges in probability to the value of $\theta$ maximizing RHS, which we claim is $\theta_0$.

      Indeed, for any $\theta \in \Omega$

      \begin{equation*}
\mathbb{E}_{\theta_0}[ \log f(X \mid \theta)] - \mathbb{E}_{\theta_0} [ \log f(X \mid \theta_0)] = \mathbb{E}_{\theta_0} \Bigg[ \log \frac{f(X \mid \theta)}{f(X \mid \theta_0} \Bigg]
\end{equation*}

      Noting that $x \mapsto \log x$ is concave, Jensen's Inequality implies

      \begin{equation*}
\mathbb{E}[\log X] \le \log \mathbb{E}[X]
\end{equation*}

      for any positive random variable $X$, so

      \begin{equation*}
\begin{split}
  \mathbb{E}_{\theta_0}\Bigg[ \log \frac{f(X \mid \theta)}{f(X \mid \theta_0)} \Bigg] &amp; \le \log \mathbb{E}_{\theta_0} \Bigg[ \frac{f(X \mid \theta)}{f(X \mid \theta_0)} \Bigg] \\
  &amp; = \log \int \frac{f(x \mid \theta)}{f(x \mid \theta_0)} f(x \mid \theta_0) \ dx \\
  &amp;= \log \int f(x \mid \theta_0) \ dx \\
  &amp;= 0
\end{split}
\end{equation*}

      Which establishes "consistency" of $\hat{\theta}$ since $\theta \mapsto \mathbb{E}_{\theta_0}[\log f(X \mid \theta)]$ is maximized at $\theta = \hat{\theta}$.

    • Expectation and variance of score

      For $\theta \in \Omega$,

      \begin{equation*}
\mathbb{E}_{\theta}[U(X, \theta)] = 0, \qquad \text{Var}_{\theta} \big( U(X, \theta) \big) = - \mathbb{E}[U'(X, \theta)]
\end{equation*}

      First, for the expectation we have

      \begin{equation*}
\begin{split}
  \mathbb{E}_{\theta}[U] &amp;= \int U(x, \theta) f(x; \theta) \ dx \\
  &amp;= \int \Bigg( \frac{\partial \log f(x; \theta)}{\partial \theta} \Bigg) f(x; \theta) \ dx \\
  &amp;= \int \Bigg( \frac{1}{f(x; \theta)} \frac{\partial f}{\partial \theta}  \Bigg) f(x; \theta) \ dx \\
  &amp;= \int \frac{\partial f}{ \partial \theta} \ dx
\end{split}
\end{equation*}

      Assuming we can interchange the order of the derivative and the integral (which we can for analytic functions), we have

      \begin{equation*}
\begin{split}
  \mathbb{E}_{\theta}[U(X, \theta)] &amp;= \frac{\partial}{\partial \theta} \int f(x; \theta) \ dx \\
  &amp;= \frac{\partial}{\partial \theta} \big( 1 \big) \\
  &amp;= 0
\end{split}
\end{equation*}

      For the variance, we can differentiate the above identity:

      \begin{equation*}
\begin{split}
  0 &amp;= \frac{\partial}{\partial \theta} \mathbb{E}_{\theta}[U] \\
  &amp;= \frac{\partial}{\partial \theta} \int U(x, \theta) f(x \mid \theta) \ dx \\
  &amp;= \int U'(x, \theta) f(x \mid \theta) + U(x, \theta) \Big( \frac{\partial}{\partial \theta} f(x \mid \theta) \Big) \ dx \\
  &amp;= \int U'(x, \theta) f(x \mid \theta) \ dx + \int U(x, \theta) \Big( U(x, \theta) f(x \mid \theta) \Big) \ dx \\
  &amp;= \int U'(x, \theta) f(x \mid \theta) \ dx + \int U^2 f(x \mid \theta) \ dx \\
  &amp;= \mathbb{E}_{\theta}[U'] + \mathbb{E}_{\theta}[U^2] \\
  &amp;= \mathbb{E}_{\theta}[U'] + \text{Var}_{\theta}(U^2)
\end{split}
\end{equation*}

      where we've used the fact that $U = \frac{\partial \log f}{\partial \theta} = \frac{1}{f} \frac{\partial f}{\partial \theta}$ and $\mathbb{E}_{\theta}[U] = 0$, which implies that $\text{Var}_{\theta}(U) = \mathbb{E}_{\theta}[U^2]$.

      This is equivalent to

      \begin{equation*}
\text{Var}_{\theta}(U^2) = - \mathbb{E}_{\theta}[U']
\end{equation*}

      as wanted.

    • Asymptotic behavior

      Now, since $\hat{\theta}$ maximizes $\ell(\theta)$, we must have $0 = \ell'(\hat{\theta})$.

      Consistency of $\hat{\theta}$ ensures that $\hat{\theta}$ is close to $\theta_0$ (for large n, with high probability). Thus, we an apply first-order Taylor expansion to the equation $0 = \ell'(\hat{\theta})$ about the point $\theta = \theta_0$:

      \begin{equation*}
0 \approx \ell'(\theta_0) + (\hat{\theta} - \theta_0) \ell''(\theta_0)
\end{equation*}

      Thus,

      \begin{equation*}
\sqrt{n} (\hat{\theta} - \theta) \approx - \sqrt{n} \frac{\ell'(\theta_0)}{\ell''(\theta_0)} = - \frac{\ell'(\theta_0) / \sqrt{n}}{\ell''(\theta_0) / n}
\end{equation*}

      For the denominator, we rewrite as

      \begin{equation*}
\begin{split}
  \frac{1}{n} \ell''(\theta_0) &amp;= \frac{1}{n} \sum_{i=1}^{n} \frac{\partial^2}{\partial \theta^2} \Big[ \log f(X_i \mid \theta) \Big]_{\theta = \theta_0} \\
  &amp;= \sum_{i=1}^{n} U'(X_i, \theta_0)
\end{split}
\end{equation*}

      and then, by the Law of Large Numbers again, we have

      \begin{equation*}
 \sum_{i=1}^{n} U'(X_i, \theta_0) \overset{p}{\to} \mathbb{E}_{\theta_0}[U(X, \theta_0)] = - I(\theta_0)
\end{equation*}

      in probability.

      For the numerator, due to what we showed earlier, we know that

      \begin{equation*}
X \sim f(x \mid \theta_0) \quad \implies \quad \mathbb{E}[U] = 0, \quad \text{Var}(U) = I(\theta_0) 
\end{equation*}

      We then have,

      \begin{equation*}
\frac{1}{\sqrt{n}} \ell'(\theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^{n} \frac{\partial}{\partial \theta} \Big[ \log f(X_i \mid \theta) \Big]_{\theta = \theta_0} = \frac{1}{\sqrt{n}} \sum_{i=1}^{n} U(X_i, \theta_0)
\end{equation*}

      and by the Central Limit Theorem, we have

      \begin{equation*}
\frac{1}{\sqrt{n}} \sum_{i=1}^{n} U(X_i, \theta_0) \overset{d}{\to} \mathcal{N} \Big( 0, I(\theta_0) \Big)
\end{equation*}

      Finally, by Continuous Mapping Theorem and Slutsky's Lemma, we have

      \begin{equation*}
\sqrt{n} \big( \hat{\theta} - \theta_0 \big) \overset{d}{\to} \frac{1}{I(\theta_0)} \mathcal{N} \Big( 0, I(\theta_0) \Big) = \mathcal{N} \Big( 0, I^{-1}(\theta_0) \Big)
\end{equation*}

Score test

For a random sample $Y_1, \dots, Y_n$ the total score

\begin{equation*}
U = \frac{dl}{d \theta}
\end{equation*}

is the sum of $n$ i.i.d. random variables. Thus, by the central limit theorem, it follows that as $n \to \infty$

\begin{equation*}
\frac{U(\theta_0) - 0}{\sqrt{I(\theta_0)}} \sim \mathcal{N}(0, 1)
\end{equation*}

and

\begin{equation*}
\frac{U^2(\theta_0)}{I(\theta_0)} \sim \chi_1^2
\end{equation*}

This can then be used as an asymptotic test of the null-hypothesis $H_0: \theta = \theta_0$.

Reject null-hypothesis $H_0: \theta = \theta_0$ if

\begin{equation*}
\frac{U(\theta_0) - 0}{\sqrt{I(\theta_0)}} &gt; \mathcal{N}(0, 1)(\alpha)
\end{equation*}

for a suitable critical value $\alpha$. OR equivalently,

    \begin{equation*}
\frac{U^2(\theta_0)}{I(\theta_0)} &gt; \chi_1^2(\alpha)
\end{equation*}

Likelihood Ratio Tests

We expand the log-likelihood using Taylor expansion about the true parameter $\theta_0$

\begin{equation*}
\begin{split}
\ell (\theta_0) &amp; \approx \ell(\hat{\theta}) + (\theta - \hat{\theta}) U(\hat{\theta}) - \frac{1}{2} (\theta_0 - \hat{\theta})^2 J_{\hat{\theta}} \\
&amp; = \ell(\hat{\theta}) - \frac{1}{2} (\theta_0 - \hat{\theta})^2 J_{\hat{\theta}} \quad \text{since } U(\hat{\theta}) \\
\end{split}
\end{equation*}

Subtracting $\ell(\hat{\theta})$ from both sides, and arranging

\begin{equation*}
- 2[ \ell(\theta_0) - \ell(\hat{\theta}) ] \approx (\hat{\theta} - \theta_0)^2 J_{\hat{\theta}}
\end{equation*}

And since $J_{\hat{\theta}} \to I_{\theta_0}$, we get

\begin{equation*}
- 2[ \ell(\theta_0) - \ell(\hat{\theta}) ] \approx (\hat{\theta} - \theta_0)^2 I_{\theta_0}
\end{equation*}

which means that the difference between the log-likelihoods can be considered to be a random variable drawn from a $\chi^2$ distribution

\begin{equation*}
- 2[ \ell(\theta_0) - \ell(\hat{\theta}) ] \sim \chi^2_1
\end{equation*}

and we define the term of the left side as the likelihood-ratio .

The test which rejects the $H_0 : \theta = \theta_0$ if

\begin{equation*}
- 2[ \ell(\theta_0) - \ell(\hat{\theta}) ] &gt; \chi^2_1(\alpha)
\end{equation*}

for a suitable significance / critical value $\alpha$.

The above is just saying that we reject the null hypothesis that $\theta = \theta_0$ if the left term is not drawn from a chi-squared distribution .

Wald test

We test whether or not

\begin{equation*}
\hat{\theta} \sim \mathcal{N}(\theta_0, I_{\theta_0}^{-1})
\end{equation*}

That is, we reject the null-hypothesis $H_0 : \theta = \theta_0$ if

\begin{equation*}
\theta &gt; \mathcal{N}(\theta_0, I_{\theta_0}^{-1})(\alpha)
\end{equation*}

for some suitable signifiance / critical value $\alpha$.

Generalization to multi-parameter case

Hypothesis tests
\begin{equation*}
\begin{split}
  \text{Likelihood ratio test} \quad &amp; -2 [\ell (\boldsymbol{\theta}_0) - \ell( \hat{\boldsymbol{\theta}} )] &gt; \chi_p^2(\alpha) \\
  \text{Score test} \quad &amp; U'(\boldsymbol{\theta}_0) I^{-1} (\boldsymbol{\theta}_0) U(\boldsymbol{\theta}_0) &gt; \chi_p^2(\alpha)     \\
  \text{Wald test (ML test)} \quad &amp; (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)' I(\boldsymbol{\theta}_0) (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0) &gt; \chi_p^2(\alpha)
\end{split}
\end{equation*}
Confidence regions
\begin{equation*}
\begin{split}
 \text{Likelihood ratio} \quad &amp; \{ \boldsymbol{\theta}: -2 [\ell(\boldsymbol{\theta}) - \ell (\hat{\boldsymbol{\theta}} ) ] &lt; \chi_p^2(\alpha) \} \\
 \text{Score} \quad &amp; \{ \boldsymbol{\theta}: U'(\boldsymbol{\theta})I^{-1}(\boldsymbol{\theta}) U(\boldsymbol{\theta}) &lt; \chi_p^2(\alpha) \} \\
 \text{Wald (ML)} \quad &amp; \{ \boldsymbol{\theta}: (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta})' I(\boldsymbol{\theta}) (\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) \chi_p^2(\alpha) \}
\end{split}
\end{equation*}
Likelihood test for composite hypothesis

Let $\Omega$ denote the whole parameter space (e.g. $\mathbb{R}^p$). In general terms we have

\begin{equation*}
\begin{split}
  H_0 &amp;: \boldsymbol{\theta} \in \omega \\
  H_1 &amp;: \boldsymbol{\theta} \notin \omega
\end{split}
\end{equation*}

where $\omega \subset \Omega$.

The general likelihood ratio test compares the maximum likelihood attainable if $\boldsymbol{\theta}$ is restricted to lie in the restricted subspace $\omega$ (i.e. under 'reduced' model) with maximum likelihood attainable if $\boldsymbol{\theta}$ is unrestricted (i.e. under 'full' model):

\begin{equation*}
\begin{split}
  LR &amp;= \frac{\max_{\boldsymbol{\theta} \in \omega} L(\boldsymbol{\theta})}{\max_{\boldsymbol{\theta} \in \Omega} L(\boldsymbol{\theta})} \\
   &amp;= \frac{L( \hat{\boldsymbol{\theta}}_\omega)}{L( \hat{\boldsymbol{\theta}})}
\end{split}
\end{equation*}

where $\hat{\boldsymbol{\theta}}$ is the unrestricted MLE of $\boldsymbol{\theta}$ and $\hat{\boldsymbol{\theta}}_\omega$ is the restricted MLE of $\boldsymbol{\theta}$.

Some authors define the general likelihood ratio test with the numerator and denominator swapped; but it doesn't matter.

Iterative methods

\begin{equation*}
\boldsymbol{\theta}_{r + 1} = \boldsymbol{\theta}_r - \Big( H(\boldsymbol{\theta}_r) \Big)^{-1} U(\boldsymbol{\theta}_r) = \boldsymbol{\theta}_r + J^{-1}(\boldsymbol{\theta}_r) U(\boldsymbol{\theta}_r)
\end{equation*}
\begin{equation*}
\boldsymbol{\theta}_{r + 1} = \boldsymbol{\theta}_r - \Bigg( \mathbb{E} \Big[ H(\boldsymbol{\theta}_r) \Big] \Bigg)^{-1} U(\boldsymbol{\theta}_r) = \boldsymbol{\theta}_r + I^{-1}(\boldsymbol{\theta}_r) U(\boldsymbol{\theta}_r)
\end{equation*}

Simple Linear Regression (Statistical Methodology stuff)

Correlation coefficient and coefficent of determination

The (Pearson) correlation coefficient is given by

\begin{equation*}
r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}
\end{equation*}

and coefficient of determination

\begin{equation*}
R = r^2 = \frac{S_{xy}^2}{S_{xx} S_{yy}}
\end{equation*}
Least squares estimates
\begin{equation*}
\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} \quad \text{and} \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}
\end{equation*}
Residual sum of squares
\begin{equation*}
RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = S_{yy} - \frac{S_{xy}^2}{S_{xx}}
\end{equation*}

with the estimated standard deviation being

\begin{equation*}
\hat{\sigma}^2 = \frac{RSS}{n - 2}
\end{equation*}

Laplace Approximation

Overview

Under certain regularity conditions, we can approximate a posterior pdf / pmf as a Normal distribution, that is

\begin{equation*}
h(\theta) \approx \mathcal{N} \Big(\hat{\theta}, \big( - q''(\hat{\theta}) \big)^{-1} \Big)
\end{equation*}

where

  • $q(\theta) = \log h(\theta)$ i.e. the log-pdf, with $q''(\theta)$ being the second-order derivative wrt. $\theta$
  • $h(\theta)$ is the posterior we want to approximate

"Derivation"

For any pdf that is smooth and well peaked around its point of maxima, we can use the Laplace approximation to approximate a posterior distribution by a Gaussian pdf.

It follows from taking the second-order Taylor expansion on the logarithm of the pdf.

Let $\hat{\theta}$ denote the point of maxima of a pdf $h(\theta)$, then it also the point of maximia of the log-pdf $q(\theta) = \log h(\theta)$ and we can write

\begin{equation*}
\begin{split}
  q(\theta) &amp;\approx q(\hat{\theta}) + (\theta - \hat{\theta}) q'(\hat{\theta}) + \frac{1}{2} (\theta - \hat{\theta})^2 q''(\hat{\theta}) \\
  &amp;= q(\hat{\theta}) + 0 + \frac{1}{2} (\theta - \hat{\theta})^2 q''(\hat{\theta}) \\
  &amp;= \text{const} - \frac{1}{2} (\theta - \hat{\theta})^2 q''(\hat{\theta}) \\
  &amp;= \text{const} - \frac{(\theta - \tilde{a})^2}{2 \tilde{b}^2}
\end{split}
\end{equation*}

where in the second step we've used the fact that $q'(\hat{\theta}) = 0$ since $\hat{\theta}$ is a maxima, and finally let

\begin{equation*}
\tilde{a} = \hat{\theta}, \qquad \tilde{b} = - \big(q''(\hat{\theta}) \big)^{-1}
\end{equation*}

(notice that $q''(\hat{\theta}) &lt; 0$ since $\hat{\theta}$ is a maxima )

But the above is simply the log-pdf of a $\mathcal{N}(\tilde{a}, \tilde{b}^2)$, hence the pdf $h(\theta)$ is approx. normal.

Hence, if we can find the $\hat{\theta}$ and compute the second-order derivative $q''(\theta)$ of the log-pdf, we can use Laplace approximation.

Guarantees

Consider the model $X_1, \dots, X_n \overset{\text{IID}}{\sim} g(x_i | \theta)$, $\theta \in \Theta$.

Under some regularity conditions on the pdfs/pmfs $g(\cdot | \theta)$, including all of the have the same "support", and that for each $x_i$, $\theta \mapsto \log g(x_i \mid \theta)$ is twice continuously differentiable, we have that for any prior $\xi(\theta)$ which is positive, bounded and twice differentiable over $\Theta$ (the parameter space),

\begin{equation*}
\sup_z \Big| \Pr(\theta \le z \mid X = x) - \Phi \Big( \big( -q''(\hat{\theta}) \big)^{1 / 2} (z - \theta) \Big) \Big| \approx 0
\end{equation*}

for large $n$.

Under the same regularity conditions it turns out that $\hat{\theta} \approx \hat{\theta}_{\text{MLE}}(x)$ and that $- q''(\hat{\theta}) \approx I_{\text{obs}}(x)$.

Point estimators

Notation

Cramér-Rao lower bound

Notation
  • $U = \hat{\theta}$
  • $V = \frac{\partial l}{\partial \theta}$
Theorem

To state the Cramér-Rao inequality, we need to establish some regularity conditions:

  1. $\forall \theta_1, \theta_2 \in \Theta$ such that $\theta_1 \ne \theta_2$ we have $f(x \mid \theta_1) \ne f(x \mid \theta_2)$, called identifiability
  2. $\forall \theta \in \Theta$ we have $f(x \mid \theta)$ have common support
  3. $\Theta$ is an open set
  4. $\frac{\partial f(x \mid \theta)}{\partial \theta}$ exists
  5. $\mathbb{E} \bigg( \frac{\partial \log f(\mathbf{X} \mid \theta)}{\partial \theta} \bigg)^2 &lt; \infty$

Where we remember that $I(\theta) = \mathbb{E} \bigg( \frac{\partial \log f(\mathbf{X} \mid \theta)}{\partial \theta} \bigg)^2$ is the Fisher's information.

Let $X_1, \dots, X_n$ denote a random sample from $f(x \mid \theta)$ and suppose that $\hat{\theta}$ is an estimator of $\theta$. Then, subject to the above regularity conditions,

\begin{equation*}
\text{Var}(\hat{\theta}) \ge \frac{\big( 1 + \frac{\partial b}{\partial \theta} \big)^2}{I_{\theta}}
\end{equation*}

where

\begin{equation*}
b(\theta) = \text{bias}(\hat{\theta}) \quad \text{and} \quad I_\theta = \mathbb{E} \Bigg[ \bigg( \frac{\partial \ell}{\partial \theta} \bigg)^2 \Bigg]
\end{equation*}
  1. For unbiased $\hat{\theta}$, the lower bound simplies to $\text{Var}(\hat{\theta}) \ge I_\theta^{-1}$
  2. Regularity conditions are needed to change the order of differentation and integration in the proof given below.
  3. Proof is for continuous r.v.s, but there is a similar proof for discrete r.v.s.
\begin{equation*}
\begin{split}
  \mathbb{E}[\hat{\theta}] &amp;= \int \dots \int \hat{\theta}(x_1, \dots, x_n) \bigg( \prod_{i = 1}^n f(x_i \mid \theta) \bigg) \ d \mathbf{x} \\
  &amp;= \int \dots \int \hat{\theta}(x_1, \dots, x_n) L(\theta; \mathbf{x}) \ d \mathbf{x}
\end{split}
\end{equation*}

From the definition of bias we have

\begin{equation*}
\theta + b = \mathbb{E} [ \hat{\theta}] = \int \dots \int \hat{\theta} L(\theta; \mathbf{x}) \ d \mathbf{x}
\end{equation*}

Differentiating both sides wrt. $\theta$ gives (using the regularity conditions)

\begin{equation*}
1 + \frac{\partial b}{\partial \theta} = \int \dots \int \hat{\theta} \frac{\partial L}{\partial \theta} \ d \mathbf{x}
\end{equation*}

since $\hat{\theta}$ does not depend on $\theta$. Since $l = \ln (L)$ we have

\begin{equation*}
\frac{\partial l}{\partial \theta} = \frac{\partial \ln (L)}{\partial \theta} = \frac{1}{L} \frac{\partial L}{\partial \theta} \quad \implies \quad \frac{\partial L}{\partial \theta} = L \frac{\partial l}{\partial \theta}
\end{equation*}

Thus,

\begin{equation*}
1 + \frac{\partial b}{\partial \theta} = \int \dots \int \hat{\theta} \frac{\partial l}{\partial \theta} \ L \ d \mathbf{x} = \mathbb{E} \Bigg[ \hat{\theta} \frac{\partial l}{\partial \theta} \Bigg]
\end{equation*}

Now use the result that for any two r.v.s $U$ and $V$,

\begin{equation*}
\text{Cov}(U, V)^2 \le \text{Var}(U) \text{Var}(V)
\end{equation*}

and let

\begin{equation*}
U = \hat{\theta}, \quad V = \frac{\partial l}{\partial \theta}
\end{equation*}

Then

\begin{equation*}
\begin{split}
  \mathbb{E}[V] &amp;= \int \dots \int \frac{\partial l}{\partial \theta} \ L \ d \mathbf{x} = \int \dots \int \frac{\partial L}{\partial \theta} \ d \mathbf{x} \\
  &amp;= \frac{\partial}{\partial \theta} \Bigg( \int \dots \int L \ d \mathbf{x} \Bigg) \quad \text{(using regularity conditions)} \\
  &amp;= \frac{\partial}{\partial \theta} 1 = 0
\end{split}
\end{equation*}

Hence

\begin{equation*}
\text{Cov}(U, V) = \mathbb{E}[UV] = 1 + \frac{\partial b}{\partial \theta}
\end{equation*}

Similiarily

\begin{equation*}
\text{Var}(V) = \mathbb{E}[V^2] = \mathbb{E} \Bigg[ \bigg( \frac{\partial l}{\partial \theta} \bigg)^2 \Bigg] = I_\theta
\end{equation*}

and since

\begin{equation*}
\text{Var}(U) = \text{Var}(\hat{\theta})
\end{equation*}

we obtain the Cramér-Rao lower bound as

\begin{equation*}
\text{Var}(\hat{\theta}) \ge \frac{\big( \text{Cov}(U, V) \big)^2}{\text{Var}(V)} = \frac{\big( 1 + \frac{\partial b}{\partial \theta} \big)}{I_\theta}
\end{equation*}

The Cramér-Rao lower bound will only be useful if it is attainable or at least nearly attainable.

Rao-Blackwell Theorem

Theorem

Let $Y_1, Y_2, \dots, Y_n$ be a random sample of observations from a distribution with p.d.f. $f(y \mid \theta)$. Suppose that $S$ is a sufficient statistic for $\theta$ and $\hat{\theta}$ is any unbiased estimator for $\theta$.

Define $\hat{\theta}_S = \mathbb{E}[ \hat{\theta} \mid S]$. Then,

  1. $\hat{\theta}_S$ is a function of $S$ only
  2. $\mathbb{E}[\hat{\theta}_S] = \theta$
  3. $\text{Var}(\hat{\theta}_S) \le \text{Var}(\hat{\theta})$

Non-parametric statistics

Maximum Mean Discrepancy (MMD)

Let $\mathcal{F}$ be a class of functions $f: \mathcal{X} \to \mathbb{R}$ and let:

  • $p, q$ be Borel probability measures
  • $X := \{ x_1, \dots, x_m \}$ and $Y := \{ y_1, \dots, y_n \}$ are i.i.d. observations from $p$ and $q$, respectively

We define the maximum mean discrepancy (MMD) as

\begin{equation*}
\text{MMD}[ \mathcal{F}, p, q] := \sup_{f \in \mathcal{F}} \big( \mathbb{E}_x \big[ f(x) \big] - \mathbb{E}_{y} \big[ f(y) \big] \big)
\end{equation*}

In the statistics literature, this is an instance of a integral probability metric.

A biased empirical estimate of the $\text{MMD}$ is obtained by replacing the population expectations with empirical expectations computed on the samples $X$ and $Y$:

\begin{equation*}
\text{MMD}_b[\mathcal{F}, X, Y] := \sup_{f \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} f(x_i) - \frac{1}{n} \sum_{i=1}^{n} f(y_i) \bigg)
\end{equation*}

We must therefore identify a function class that is rich enough to uniquely identify whether $p = q$, yet restrictive enough to provide useful finite sample estimates.

A RKHS $\mathcal{F}$ is characteristic if

\begin{equation*}
\mathrm{MMD}(P, Q; \mathcal{F}) = 0 \quad \iff \quad P = Q
\end{equation*}

Wavelets

The problem with Fourier transform is that, even though we can extract the frequencies, we do not know at which time these frequencies occurr. This can be incredibly important in cases where the signal has "switches".

Idea is to

  • Start with low frequency / high wavelength
  • Slide over signal to extract low frequency
  • Subtract high frequency away from signal
  • Take higher frequency / smaller wavelength
  • Slide over signal to extract higher frequency

By subtracting the low frequency we're removing the overlap of the low frequency and high frequency without interferring with the high frequency signal! Thus it's possible to extract both low and high frequencies.

A mother wavelet is some function $\psi(t)$ such that we can scale and translate to create new wavelets $\psi_{a, b}(t)$:

\begin{equation*}
\psi_{a, b}(t) = \frac{1}{\sqrt{a}} \psi \bigg( \frac{t - b}{a} \bigg), \quad a,b \in \mathbb{R}, a \ne 0
\end{equation*}

where

  • $a$ is scaling
  • $b$ is translation

And the Fourier transform is then

\begin{equation*}
\hat{\psi}_{a, b} = \frac{1}{\sqrt{|a|}} e^{- i b \omega} \hat{\psi}(a \omega)
\end{equation*}

The condition required by a wavelet on $\hat{\psi}$ is that:

\begin{equation*}
C_{\psi} = \int_{-\infty}^{\infty} \frac{\left| \hat{\psi}(\omega) \right|^2}{\left| \omega \right|} \ d \omega &lt; \infty
\end{equation*}
Continuous Wavelet Transform
\begin{equation*}
W_{\psi} [f] (a, b) = \left\langle f, \psi_{a, b} \right\rangle = \int_{-\infty}^{\infty} f(t) \bar{\psi}_{a, b} (t) \ dt
\end{equation*}

and to invert we have

\begin{equation*}
f(t) = \frac{1}{C_{\psi}} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \big( W_{\psi} f \big) (a, b)  \ \frac{\psi_{a, b}(t)}{a^2} \ db \ da
\end{equation*}

Suppose $\psi$ is a wavelet, and $\phi$ is an integrable and bounded function, then $ψ * φ is wavelet.

Wavelet basis

A mother wavelet $\psi$ is such that $\{ \psi_{m, n} \}$ form an orthogonal basis for some subspace of $L^2$, hence

\begin{equation*}
\begin{split}
  f(x) &amp;= \sum_{n, mn = - \infty}^{\infty} c_{m, n} \psi_{m, n} (x) \\
  \text{where } \quad c_{m, n} &amp;= \left\langle f, \psi_{m, n} \right\rangle
\end{split}
\end{equation*}

converges to $f$ in the $L^2$ norm!

Multi-Resolution Analysis (MRA)
  • Overview
    • Algorithm for constructing the different resolutions

    We consider wavelets $\{ \psi_{m, n} \}$ constructed from some mother wavelet $\psi$:

    \begin{equation*}
\psi_{m, n}(x) = 2^{m / 2} \psi( 2^m x - n)
\end{equation*}

    and we want to expand our signal in such a wavelet basis.

    Consider sequence of subspaces

    \begin{equation*}
\left\{ V_m : m \in \mathbb{Z} \right\}
\end{equation*}

    in $L^2(\mathbb{R})$, with the follow properties:

    1. Nested

      \begin{equation*}
\dots \subset V_{- 2} \subset V_{-1} \subset V_0 \subset V_1 \subset V_2 \subset \dots
\end{equation*}
    2. Union

      \begin{equation*}
\bigcup_{m = - \infty}^{\infty} V_m 
\end{equation*}

      is dense in $L^2(\mathbb{R})$

    3. Orthogonality

      \begin{equation*}
\bigcap_{m = - \infty}^{\infty} V_m = \left\{ 0 \right\} = \emptyset
\end{equation*}
    4. Increased "resolution"

      \begin{equation*}
f(x) \in V_m \iff f(2 x ) \in V_{m + 1}
\end{equation*}
    5. $\exists \phi \in V_0$ such that $\{ \phi_{0, n} = \phi(x - n) \}$ is an orthogonal basis in $V_0$
      • $\phi$ is called the scaling function or father wavelet

    Let $n = 0$ be the "standard" scale where our mother wavelet $\psi$ and father wavelet $\phi$ live, i.e. $\phi$ is $\phi_{0, 0}$ for $\V_0$.

    We can map $\phi$ to $V_1$ by

    \begin{equation*}
\begin{split}
  \phi(x) &amp;= \sum_{n}^{} c_n \phi_{1, n} (x) \\
  &amp;= \sqrt{2} \sum_{n}^{} c_n \phi(2x - n)
\end{split}
\end{equation*}

    which is called the dilation or two-scale or refinement equation.

    We can repeat this argument for arbitrary $m$, so with $V_m \subset V_{m + 1}$, we write

    \begin{equation*}
\begin{split}
  V_{m + 1} &amp;= V_m \oplus W_m, \quad V_m \perp W_m \\
  V_{m + 1} &amp;= \big( V_{m - 1} \oplus W_{m - 1} \big) \oplus W_m \\
  &amp; \vdots \\
  V_{n + 1} &amp;= V_0 \oplus \Big( \bigoplus_{m = 0}^m W_m \Big)
\end{split}
\end{equation*}

    Then,

    \begin{equation*}
\lim_{n \to \infty} V_0 \oplus \Big( \bigoplus_{m = - \infty}^m W_m \Big) = L^2 (\mathbb{R})
\end{equation*}

    which means

    \begin{equation*}
\bigoplus_{m = -\infty}^{\infty} W_m = L^2 (\mathbb{R})
\end{equation*}

    Finally, this tells us that $\exists \psi \in W_0$ (the mother wavelet) such that

    \begin{equation*}
\psi_{m, n}(x) = 2^{m / 2} \psi(2^m x - n)
\end{equation*}

    constitutes an orthogonal basis for $W_{m}$.

    If $\{ V_m \}$ is a multi-resolution analysis (MRA) with scaling function $\phi$, then there exists a mother wavelet $\psi$ defined

    \begin{equation*}
\psi(x) = \sqrt{2} \sum_{n = -\infty}^{\infty} \big( -1 \big)^{n -1} \bar{c}_{n - 1} \phi(2 x - 1)
\end{equation*}

    where

    \begin{equation*}
c_n = \left\langle \phi, \phi_{1, n} \right\rangle = \sqrt{2} \int_{-\infty}^{\infty} \phi(x) \bar{\phi}(2x - n) \ dx
\end{equation*}

    which allows us to obtain an orthormal basis $\{ \psi_{m, n} \}$ for $W \subset L^2(\mathbb{R})$ which is dense in $L^2(\mathbb{R})$:

    \begin{equation*}
\psi_{m, n}(x) = 2^{m / 2} \psi(2^m x - n)
\end{equation*}
Examples of wavelets
  • Haar wavelet
    \begin{equation*}
\Psi(t) =
\begin{cases}
  1 &amp; \text{if } 0 &lt; t \le \frac{1}{2} \\
  -1 &amp; \text{if } \frac{1}{2} &lt; t &lt; 1 \\
  0 &amp; \text{otherwise}
\end{cases}
\end{equation*}
Application

For more information about how and when to apply wavelet transforms, and which mother wavelet to use, e.g., checkout MATLABs documentation on this.

Testing

Notation

  • $\tilde{C}$ and $C$ denotes the acceptance and critical region, respectively
  • $c$ denotes the critical value of some statistic $t(\mathbf{Y})$
  • Critical region with a given significane level $\alpha$ is denoted

    \begin{equation*}
R_{\alpha}: P \big( t(\mathbf{Y}) \ge c \mid \theta_0 \big) = \alpha
\end{equation*}
  • Type I error: Reject $H_0$ when $H_0$ is true with

    \begin{equation*}
\alpha = P \big( t(\mathbf{Y}) \ge c \mid H_0 \big)
\end{equation*}
  • Type II error : Fail to reject $H_0$ when $H_0$ is false (equiv., we accept $H_0$ when we shouldn't):

    \begin{equation*}
\beta = P \big( t(\mathbf{Y}) \ge c \mid H_1 \big)  
\end{equation*}
  • $\phi$ denotes the test function:

    \begin{equation*}
\phi(y) = 
\begin{cases}
  1 &amp; \text{if } t(\mathbf{y}) \in C \\
  0 &amp; \text{if } t(\mathbf{y}) \in \tilde{Y}
\end{cases}
\end{equation*}
  • $\mathbb{E}_{\theta} \left[ \phi(\mathbf{Y}) \right] = \mathbb{E} \left[ \phi(\mathbf{Y}) \mid \theta \right] = \mathbb{E}_{p_{\theta}} \left[ \phi(\mathbf{Y}) \right]$ refers to the expectation over whatever parametrized distribution $p_{\theta}$ which given you're parametrizing the distribution with $\theta$.

Definitions

A hypothesis is simple if it defines $F_{\mathbf{y}}$ completely:

\begin{equation*}
H_0 : F_{\mathbf{y}} = F_0
\end{equation*}

otherwise, it's composite.

If $F_{\mathbf{y}}$ is parametric with more than one parameter, a composite hypothesis might specify values of some or all of them (e.g. on regression coefficient).

A U-statistic is the class of unbiased statistics; this class arise naturally in producing minimum-variance unbiased estimators (MVUEs)

Important: in statistics there are TWO notions for a U-statistic

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions.

Thus, nonparametric tests or distribution-free tests are procedures for hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed.

Let the data $\mathbf{Y} \sim f(y \mid \theta)$ and we wish to test two simply hypotheses:

\begin{equation*}
H_0 : \theta = \theta_0 \quad \text{vs.} \quad H_1: \theta = \theta_1
\end{equation*}

Suppose that we choose a test statistic $t(\mathbf{Y})$ and reject $H_0$ if $t(\mathbf{Y}) \ge c$ for some critical value $c$.

This induces a partition of the sample space $\Theta$ into two disjoint regions:

  • the rejection region $C$:

    \begin{equation*}
C := \left\{ \mathbf{y} : t(\mathbf{y}) \ge c \right\}  
\end{equation*}
  • the acceptance region $\tilde{C}$:

    \begin{equation*}
\tilde{C} := \left\{ \mathbf{y} : t(\mathbf{y}) &lt; c \right\}  
\end{equation*}

We will also sometimes use the notation $R_{\alpha}$ to denote the critical region which has the property

\begin{equation*}
R_{\alpha}: P \big( t(\mathbf{Y}) \ge c \big \mid \theta_0) = \alpha
\end{equation*}

Consider $H_0$ and $H_1$ with corresponding p.d.f.'s $f_0$, $f_1$ for $\mathbf{Y}$.

For these hypotheses, comparison between the critical regions of different tests is in terms of

\begin{equation*}
P( \mathbf{Y} \in R_{\alpha} \mid H_1)
\end{equation*}

the power of a size $\alpha$ critical region $R_{\alpha}$ for alternative $H_1$.

A best critical region is then the critical region with maximum power.

There are two possible types of error:

  • Type I error: Reject $H_0$ when $H_0$ is true
  • Type II error: Fail to reject $H_0$ when $H_0$ is false
  • $\alpha$ denotes the probability of Type I error and is called the significance level (or size ) of the test
  • $\beta$ denotes the probability of Type II error which is uniquely defined only if $H_1$ is simple, in which case

    \begin{equation*}
\eta = 1 - \beta
\end{equation*}

    denotes the power of the test. For composite $H_1$, $\eta(\theta)$ is the power function.

We can define the test function $\phi(\mathbf{y})$ such that

\begin{equation*}
\phi(\mathbf{y}) = 
\begin{cases}
  1 &amp; \text{if } t(\mathbf{y}) \in C \text{ (rejected)}\\
  0 &amp; \text{if } t(\mathbf{y}) \in \tilde{C} \text{ (accepted)}
\end{cases}
\end{equation*}

which has the property that

\begin{equation*}
\begin{split}
  \mathbb{E}_{H_0} \Big( \phi(\mathbf{Y}) \Big) &amp; = \alpha \\
  \mathbb{E}_{H_1} \Big( \phi(\mathbf{Y}) \Big) &amp; = \eta
\end{split}
\end{equation*}

For discrete distributions, the probabilty that the test statistic lies on the boundary of the critical region, $\partial C$, may be non-zero.

In that case, it is sometimes necessary to use a randomized test, for which the test function is

\begin{equation*}
\phi(y) = 
\begin{cases}
  1 &amp; \text{if } t(\mathbf{y}) \in C \\
  \gamma(\mathbf{y}) &amp; \text{if } t(\mathbf{y}) \in \partial C \\
  0 &amp; \text{if } t(\mathbf{y}) \in \tilde{Y}
\end{cases}
\end{equation*}

for some function $\gamma(\mathbf{y})$ and we reject $H_0$ based on observed data $\mathbf{y}$ with probability $\phi(\mathbf{y})$.

Suppose now there is a parametric family $\{ f(\mathbf{y} \mid \theta) \ : \theta \in \Theta_1 \}$ of alternative p.d.f.'s for $\mathbf{y}$.

The power of a size $\alpha$ critical region $R_{\alpha}$ generalizes to the size $\alpha$ power function

\begin{equation*}
\begin{split}
  \text{pow}(\theta \ ; \ \alpha) &amp; = \mathbb{P} \big( \mathbf{Y} \in R_{\alpha} \mid \theta \big) \\
  &amp;= \int_{R_{\alpha}} f(\mathbf{y} \mid \theta) \ dy \quad \Bigg( \text{or} \sum_{R_{\alpha}} f(\mathbf{y} \mid \theta) \Bigg) \quad ( \theta \in \Theta_1)
\end{split}
\end{equation*}

A size $\alpha$ critical region $R_{\alpha}$, is then uniformly most powerful size $\alpha$ (UMP size $\alpha$) if it has maximum power uniformly over $\Theta_1$ ($\alpha$ is NOT power, $\eta$ is).

A test is UMP if all its critical regions are UMP, or more formally:

A uniformly most powerful or UMP test, $\phi_0(\mathbf{Y})$, of size $\alpha$ is a test $t(\mathbf{Y})$ for which

  1. $\mathbb{E}_{\theta} \big[ \phi_0 (\mathbf{Y}) \big] \le \alpha, \quad \forall \theta \in \Theta_0$
  2. Given any other test $\phi(\cdot)$ for which $\mathbb{E}_{\theta} \big[ \phi(\mathbf{Y}) \big] \le \alpha$ for all $\theta \in \Theta_0$, we have

    \begin{equation*}
\mathbb{E}_{\theta} \big[ \phi_0(\mathbf{Y}) \big] \ge \mathbb{E}_{\theta} \big[  \phi(\mathbf{Y}) \big], \quad \forall \theta \in \Theta_1
\end{equation*}

    i.e. expectation given $H_1$ is at least as large as for the less powerful statistic (who's test function is $\phi$).

A test $\phi(\mathbf{y})$ of $H_0 : \theta \in \Theta_0$ against $H_1: \theta \in \Theta_1$ is called unbiased of size $\alpha$ if

\begin{equation*}
\sup_{\theta \in \Theta_0} \mathbb{E} \left[ \phi(\mathbf{Y}) \mid \theta \right] \le \alpha
\end{equation*}

and

\begin{equation*}
\mathbb{E} \left[ \phi(\mathbf{Y}) \mid \theta \right] \ge \alpha, \quad \forall \theta \in \Theta_1
\end{equation*}

Informally, unbiased test is one which has higher probability of rejecting $H_0$ when it is false, than when it is true.

A test which is uniformly most powerful among the set of all unbiased tests is called the uniformly most powerful unbiased (UMPU).

Two-sample test

The problem of comparing samples from two probability distributions, by a statistical tests of the null hypothesis that these distributions are equal against the alternative hypothesis that these distributions are different (this is called the two-sample problem), i.e. in short:

  • $H_0$ corresponds to distributions being equal
  • $H_1$ corresponds to distributions being different
Exact test

A test is exact if and only if

\begin{equation*}
P(\text{reject} &lt; \alpha \mid H_0) = \alpha
\end{equation*}

which is in contrast to non-exact tests which only have the property

\begin{equation*}
P( \text{reject} &lt; C_{\alpha} \mid H_0) = \alpha
\end{equation*}

for some "critical constant" $C_{\alpha}$.

Hypothesis testing

For any size $\alpha$, the LR critical region is the best critical region for testing simple hypothesis $H_0$ vs. $H_1$.

That is, suppose one is performing a hypothesis test between two simple hypothesis $H_0: \theta = \theta_0$ and $H_1: \theta = \theta_1$, using the likelihood-ratio test with threshold $T$ which rejects $H_0$ in favour of $H_1$ at a significance level of

\begin{equation*}
\alpha = P \Big( \Lambda(x) \le T \Big)
\end{equation*}

where

\begin{equation*}
\Lambda(x) = \frac{\mathcal{L}(\theta_0 \mid x)}{\mathcal{L}(\theta_1 \mid x)}
\end{equation*}

Then, $\Lambda(x)$ is the most powerful test at significance level $\alpha$.

Bahadur efficiency

Notation
  • $H_0$ and $H_1$ denotes the null hypothesis and the alternative hypothesis
  • $\Theta_0$ and $\Theta_1$ are the parametric set corresponding to the $H_0$ and $H_1$, respectively
  • A test statistic $T_n = T_n(\mathbf{x})$ based on a random sample $\mathbf{x} = (x_1, \dots, x_n)$
  • $\theta \in \Theta_0$
  • $t \in \mathbb{R}$
  • $F_n(t \mid \theta) = \mathbb{P}_\theta \{ T_n &lt; t \}$
  • $L_n(t \mid \theta) = 1 - F_n(t \mid \theta)$
  • $\supset$ -value refers to the minimum value we require from the test for us to accept $H_0$
  • $K(\eta, \theta)$ is the information number corresponding to $\mathbb{P}_{\eta}$ and $\mathbb{P}_{\theta}$
Stuff
  • Assume large values of $T_n$ give evidence against $H_0$
  • For fixed $\theta \in \Theta_0$ and a real number $t$ let

Random quantity

\begin{equation*}
L_n \big( T_n(\mathbf{x}) \mid \theta_0 \big)
\end{equation*}

is the $\supset$ value corresponding to the statistic $T$ when $\theta_0$ is the true parametric value.

For example, if

\begin{equation*}
L_n \big( T_n(\mathbf{x}) \mid \theta_0 \big) &lt; \alpha
\end{equation*}

and the null hypothesis $\Theta_0 = \{ \theta_0 \}$ is rejected at the significance level $\alpha$.

If for $\eta \in \Theta_1$ with $\mathbb{P}_{\eta}$ probability one we will have

\begin{equation*}
\lim 2 n^{-1} \log L_n \big( T(\mathbf{x}) \mid \theta \big) = \inf_{\theta} K(\eta, \theta)
\end{equation*}

then $d( \eta \mid \theta)$ is called the Bahadur exact slope of $T$.

The larger the Bahadur exact slope , the faster the rate of decay of the $\supset$ value under the alternative. It is known that for any $T$, $d(\eta \mid \theta) \le 2 K(\eta, \theta)$ where $K(\eta, \theta)$ is the information number corresponding to $\mathbb{P}_{\eta}$ and $\mathbb{P}_{\theta}$.

A test statistic $T$ is called Bahadur efficient at $\eta$ if

\begin{equation*}
e_T(\eta) = \inf_\theta \frac{1}{2} d( \eta \mid \theta) = \inf_\theta K(\eta, \theta)
\end{equation*}

Bahadur efficiency allows one to compare two (sequences of) test statistics $T^{(1)}$ and $T^{(2)}$ from the following perspective:

Let $N_i, i = 1, 2$ be the smallest sample size required to reject $\Theta_0$ at the significance level $\alpha$ on the basis of a random sample $\mathbf{x} = (x_1, \dots)$ when $\eta$ is the true parametric value.

The ratio $\frac{N_2}{N_1}$ gives a measure of relative efficiency of $T^{(1)}$ to $T^{(2)}$.

To reduce the number of arguments, $\alpha$, $\mathbf{x}$ and $\eta$, one usually considers the rv. which is the limit of this ratio, as $\alpha \to 0$. In many situations this limit does not depend on $\mathbf{x}$, so it represents the efficiency of $T^{(1)}$ against $T^{(2)}$ at $\eta$ with the convenient formula

\begin{equation*}
\lim_{\alpha \to 0} \frac{N_2}{N_1} = \frac{d_1(\eta \mid \theta_0}{d_2( \eta \mid \theta_0)}
\end{equation*}

where $d_1$ and $d_2$ are the corresponding Bahadur slopes.

I.e. can use the Bahadur efficiency to compare the number of samples needed to reject $H_0$.

Kolmogorov-Smirnov

As $n \to \infty$,

\begin{equation*}
D_n \overset{P}{\to} 0
\end{equation*}

where $D_n$ is the Kolmogorov-Smirnov statistic.

The empirical distribution function is given by

\begin{equation*}
\hat{F}_n(x) := \frac{1}{n} \sum_{i = 1}^{n} \mathbb{I}\{ X_i \le x \}
\end{equation*}
Consistency and unbiasedness at a point

Fix $x \in \mathbb{R}$, then

\begin{equation*}
n \hat{F}_n(x) \sim \text{Binomial \big(n, F(x) \big)}
\end{equation*}

Consequently,

\begin{equation*}
\mathbb{E} \big[ \hat{F}_n(x) \big] = \frac{1}{n} \mathbb{E} \big[ n \hat{F}_n(x) \big] = F(x)
\end{equation*}

That is, $\hat{F}_n(x)$ is an unbiased estimator of $F(x)$ for each fixed 4x$. Also

\begin{equation*}
\text{Var}\big( \hat{F}_n(x) \big) = \frac{1}{n^2} \text{Var} \big( n \hat{F}_n(x) \big) = \frac{F(x) [1 - F(x)]}{n}
\end{equation*}

Consequently, by the Chebyshev inequality,

\begin{equation*}
\hat{F}_n(x) \overset{P}{\to} F(x)
\end{equation*}

Therefore, $\hat{F}_n(x)$ is consistent

Kolmogorov-Smirnov test

The Kolmogorov distribution is the distribution of the rv.

\begin{equation*}
K = \sup_{t \in [0, 1]} |B(t)|
\end{equation*}

where $B(t)$ is the Brownian bridge. The cumulative distribution function of $K$ is given by

\begin{equation*}
\text{Pr}(K \le x) = 1 - 2 \sum_{k=1}^{\infty} (-1)^{k - 1} e^{-2k^2 x^2} = \frac{\sqrt{2 \pi}}{x} \sum_{i=1}^{\infty} e^{- (2k - 1)^2 \pi^2 / (8 x^2)}
\end{equation*}

The empirical distribution function $F_n$ for i.i.d. observations $X_i$ is defined as

\begin{equation*}
F_n(x) = \frac{1}{n} \sum_{i = 1}^{n} I_{[- \infty, x]} (X_i)
\end{equation*}

where

  • $I_{[-\infty, x]}(X_i)$ is the indicatior function defined

    \begin{equation*}
I_{[-\infty, x]}(X_i) = 
\begin{cases}
  1 &amp; \text{if } X_i \le x \\
  0 &amp; \text{otherwise}
\end{cases}
\end{equation*}

The Kolmogorov-Smirnov statistic for a given cumulative distribution function $F(x)$ is

\begin{equation*}
D_n = \sup_x | F_n(x) - F(x)|
\end{equation*}

Under $H_0$ that the sample comes from the hypothesized distribution $F(x)$

\begin{equation*}
\sqrt{n} D_n \overset{n \to \infty}{\longrightarrow} \sup_{t \in [0, 1]} | B (t) |
\end{equation*}

in distribution, where $B(t)$ is the Brownian bridge.

If $F$ is continuous, then under $H_0$ $\sqrt{n}D_n$ converges to the Kolmogorov distribution, which does not depend on $F$.

Kolmogorov-Smirnov test (K-S test) is a non-parametric test of the equality of continuous , one-dimensional probability distributions that can be used to compare one sample with a reference probability distribution (one-sample K-S test ), or to compare two samples (two-sample K-S test).

The one-sample KS test is:

F-test

An F-test is any statistical test in which the test statistic, called a F-statistic, has an F-distribution under the null hypothesis.

Residual-sum of squares between two models as a F-statistic

Notation
  • $n$ denotes the number of samples
  • $k$ denotes the number of regressors (including the constant term)
  • $q$ denotes the number of linear "restrictions" / "constraints", and therefore, also the degrees of freedom
  • $R$ is $k \times q$ matrix with full column rank $q$, i.e. $\dim \big( \text{col}(R) \big) = q$, which translates into constraints being "independent". This matrix is used to enforce the constraints on $\beta$, i.e. if we don't want a constant term / intercept $\tilde{\beta} = R^T \beta$ is such that $\tilde{\beta}^1 = 0$, if the constant term was in the first entry.
  • $r \in \mathbb{R}^+$ represents the hypothesized residual sum
Stuff

Consider two linear regression models with coefficients $\beta_0$ and $\beta_1$, with $\beta_1 \subset \beta_0$. Then let $p_0$ and $p_1$ denote the respective number of coefficients, with $p_1 &lt; p_0$. Then letting

\begin{equation*}
F = \frac{(\RSS_{p_0} - \RSS_{p_1} / (p_1 - p_0)}{\RSS_1 / (n - p_1)}
\end{equation*}

we have

\begin{equation*}
F \sim F_{p_0 - p_1, N - p_1}
\end{equation*}

i.e. the quantity above is F-distributed.

Note: observe that $\RSS_{p_0} \le \RSS_{p_1}$ so the above is a positive quantity.

Consider the null-hypothesis $H_0: R^T \beta = r$. Recall the OLS solution is given by

\begin{equation*}
\text{Var}( \hat{\beta}_{\text{ols}} ) = \sigma^2 \big( X^T X^{-1} \big)^{-1}
\end{equation*}

and we therefore have

\begin{equation*}
R^T \big( \hat{\beta}_{\text{ols}} - \beta \big) \sim \mathcal{N} \Big( 0, \sigma^2 R^T \big( X^T X \big)^{-1} R \Big)
\end{equation*}

Let

\begin{equation*}
B = R^T \big( X^T X \big)^{-1} R
\end{equation*}

and let, using Cholesky decomposition, $L$ be such that

\begin{equation*}
B^{-1} = L L^T
\end{equation*}

we have a "matrix square root" of $B^{-1}$ in $L^T$. The purpose of this matrix is so we can define the following random variable:

\begin{equation*}
Z := \frac{1}{\sigma} B^{-1 / 2} R^T \big( \hat{\beta}_{\text{ols}} - \beta \big) \sim \mathcal{N}(0, I_q)
\end{equation*}

where $I_q$ denotes the identity $q \times q$ identity matrix. To see that this is indeed the case, observe that

\begin{equation*}
\begin{split}
  \text{Var}(Z) &amp;= \frac{1}{\sigma} B^{- 1 / 2} R^T \text{Var}\big( \hat{\beta}_{\text{ols}} \big) R \frac{1}{\sigma} B^{- 1 / 2} \\ 
  &amp;= \frac{1}{\sigma} B^{- 1 / 2} \sigma^2 B \frac{1}{\sigma} B^{- 1 / 2} \\
  &amp;= I_q
\end{split}
\end{equation*}

Using properties seen in the proof of $\RSS \sim \chi_{n - q}$, know that this $Z$ is independent of the rv.

\begin{equation*}
(n - k) \frac{\hat{\sigma}^2}{\sigma^2} \sim \chi_{n - k}
\end{equation*}

where

\begin{equation*}
\hat{\sigma}^2 = \frac{\RSS}{n - k} = \frac{1}{n - k} y^T Q y
\end{equation*}

where

\begin{equation*}
Q = I - X \big( X^T X \big)^{-1} X^T
\end{equation*}

known as the residual maker matrix. Then we have

\begin{equation*}
\frac{\overbrace{(Z^T Z)}^{\sim \chi_q^2} / q}{\underbrace{W}_{\sim \chi_{n - k}^2} / (n - k)} = \frac{\big( \hat{\beta}_{\text{ols}} - \beta \big)^T R \big( R^T (X^T X)^{-1} R \big)^{-1} \big( \hat{\beta}_{\text{obs}} - \beta \big) / q}{\hat{\sigma}^2} \sim F_{q, n - k}
\end{equation*}

Recall that a F-statistic is a ratio between $\chi^2 \text{-squared}$ divided by their corresponding degrees of freedom, hence the above. In particular, under $H_0: R^T \beta = r$, this reduces to the statistic

\begin{equation*}
F = \frac{\big( R^T \hat{\beta}_{\text{ols}} - r \big)^T \big( R^T (X^T X)^{-1} R \big)^{-1} \big( R^T \hat{\beta}_{\text{obs}} - r \big) / q}{\hat{\sigma}^2} \sim F_{q, n -k}
\end{equation*}

Letting $q = p_0 - p_1$ (which is the number of components of $\beta_0$ we set to zero to get $\beta_1$) and $k = p_1$ (since this is how many parameters we now have), we conclude our proof.

Proof that $\RSS \sim \chi_{n - k}$

Consider the regression model

\begin{equation*}
y = X \beta + \varepsilon
\end{equation*}

with

  • $X \in \Mat(n \times k, \mathbb{R})$
  • $\beta \in \mathbb{R}^k$
  • $\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n)$

Then the residual vector is estimated by

\begin{equation*}
\hat{\varepsilon} = y - X \hat{\beta}
\end{equation*}

And is distributed according to

\begin{equation*}
\frac{\norm{\hat{\varepsilon}}^2}{\sigma^2} \sim \chi_{n - k}
\end{equation*}

and thus

\begin{equation*}
\mathbb{E} \big[ \norm{\hat{\varepsilon}}^2 \big] = \sigma^2
\end{equation*}

Consider the following linear model

\begin{equation*}
y = X \beta + \varepsilon
\end{equation*}

where $X \in \text{Mat}(n \times k, \mathbb{R})$, $\beta \in \mathbb{R}^k$ and $\varepsilon \in \mathbb{R}^n$. The vector of residuals is estimated by

\begin{equation*}
\hat{\varepsilon} = y - X \hat{\beta} = \big( I - X (X^T X)^{-1} X^T \big) y = Q y = Q (X \beta + \varepsilon) = Q \varepsilon
\end{equation*}

where

\begin{equation*}
Q = I - X \big( X^T X \big)^{-1} X^T
\end{equation*}

Since trace is invariant under cyclic permutations, we have

\begin{equation*}
\tr \Big( X (X^T X)^{-1} X^T \Big) = \tr \Big( X^T X (X^T X)^{-1} \Big) = \tr(I_k)
\end{equation*}

since $X \in \text{Mat}(n \times k)$ matrices, and so $X^T X \in \text{Mat}(k \times k)$. Therefore

\begin{equation*}
\text{tr}(Q) = \tr(I_n) - \tr \Big( X (X^T X)^{-1} X^T \Big) = n - k
\end{equation*}

Futhermore,

\begin{equation*}
Q^T = Q = Q^2
\end{equation*}

which is seen by the fact that

\begin{equation*}
\begin{split}
  Q^2 &amp;=  \Big( I - X \big( X^T X \big)^{-1} X^T \Big) \Big( I - X \big( X^T X \big)^{-1} X^T \Big) \\
  &amp;= I - X \big( X^T X \big)^{-1} X^T - X \big( X^T X \big)^{-1} X^T + \big[ X \big( X^T X \big)^{-1} \underbrace{X^T X \big( X^T X \big)^{-1}}_{= I} X^T \big] \\
  &amp;= I - 2 X \big( X^T X \big)^{-1} X^T + X \big( X^T X \big)^{-1} X^T \\
  &amp;= I-  X \big( X^T X \big)^{-1} X^T \\
  &amp;= Q
\end{split}
\end{equation*}

We then have the following sequence of implications:

  • $Q^T = Q = Q^2$
  • $\implies$ $Q$ only has eigenvalues $0$ and $1$
  • $\implies$ $Q$ is normal and so there exists a unitary matrix $V$ such that

    \begin{equation*}
V^T Q V = \diag \big( \underbrace{1, \dots, 1}_{n - k}, \underbrace{0, \dots, 0}_{k} \big) = D
\end{equation*}

Recall that

\begin{equation*}
\hat{\varepsilon} \sim \mathcal{N} \big( 0, \sigma^2 Q \big) \quad \implies \quad V^T \hat{\varepsilon} \sim \mathcal{N} \big( 0, \sigma^2 D \big)
\end{equation*}

which is seen by computing $\mathbb{E} \big[ \big( V^T \hat{\varepsilon} \big) \big( V^T \hat{\varepsilon} \big)^T \big]$ and using $\mathbb{E} \big[ \hat{\varepsilon} \hat{\varepsilon}^T \big] = \sigma^2 Q$. From this it follows that

\begin{equation*}
\frac{V^T \hat{\varepsilon}}{\sigma} \sim \mathcal{N}(0, D) \quad \implies \quad \frac{\norm{V^T \hat{\varepsilon}}^2}{\sigma^2} = \frac{\norm{(V^T \hat{\varepsilon})^*}^2}{\sigma^2} \sim \chi_{n - k}
\end{equation*}

where $\big( V^T \hat{\varepsilon} \big)^*$ is the corresponding $(n - p)$ vector (since the rest of the components have zero variance and thus add nothing to the norm).

Furthermore, since $V$ is unitary, as mentioned earlier, we have

\begin{equation*}
\norm{\hat{\varepsilon}}^2 = \norm{V^T \hat{\varepsilon}}^2 
\end{equation*}

Recall that the residual sum-of-squares is given by $\RSS = \norm{\hat{\varepsilon}}^2$, and so arrive at our result

\begin{equation*}
\frac{\norm{\hat{\varepsilon}}^2}{\sigma^2} = \frac{\RSS}{\sigma^2} \sim \chi_{n - k}
\end{equation*}

Finally, observe that this implies

\begin{equation*}
\mathbb{E} \big[ \norm{\hat{\varepsilon}}^2 \big] = \sigma^2
\end{equation*}

Similarity testing

Notation

  • $\theta =  \psi \otimes \lambda$ with can take on values $\Omega_{\theta} = \Omega_{\psi} \times \Omega_{\lambda}$
  • $\beta(\theta)$ denotes a test

Stuff

Test $H_0: \psi = \psi_0$ vs. $H_1 : \psi \ne \psi_0$.

$\phi(\mathbf{y})$ is a test of size $\alpha$, i.e.

\begin{equation*}
\mathbb{E}_{\psi_0, \lambda} \big( \phi(\mathbf{y}) \big) = \alpha, \quad \forall \lambda
\end{equation*}

Then $\phi$ is a similar test of size $\alpha$.

$H_0: \theta \in \Theta_0$ with $\mathbb{E}_{\theta} \big( \phi(\mathbf{y}) \big) = \alpha$ on the boundary of $\Theta_0$.

Confidence regions

Notation

  • $\mathbf{y}$ is drawn from some distribution

    \begin{equation*}
\mathbf{Y} \sim F(\mathbf{y} \mid \theta), \quad \theta \in \Theta
\end{equation*}
  • $\Theta = (\psi, \lambda)$
  • $R_\alpha(\psi_0)$ is a size $\alpha$ critical region for $H_0 : \psi = \psi_0$

    \begin{equation*}
S_{\alpha}(\mathbf{y}) = \{ \psi_0 \mid \mathbf{y} \notin R_{\alpha}(\psi_0) \}
\end{equation*}

    with

    \begin{equation*}
P \big( S_{\alpha} (\mathbf{y}) \supin \psi_0, \psi_0, \lambda \big) = \mathbb{P} \big( \mathbf{y} \in R_{\alpha}(\psi_0) \mid \ \psi_0, \lambda \big) = 1 - \alpha
\end{equation*}
  • critical region refers to the sampling distribution which will lead to the rejection of the hypothesis tested when the hypothesis is true.

Stuff

  • Generalization of confidence intervalls to multiple dimensions

$S_{\alpha} (\mathbf{y})$ is a $(1 - \alpha)$ confidence region for $\psi$ if

\begin{equation*}
P \big( \psi \in S_{\alpha}(\mathbf{y}) \big) = 1 - \alpha, \quad \forall \psi, \lambda
\end{equation*}

If $\alpha_1 &lt; \alpha_2$ then

\begin{equation*}
S_{\alpha_1}(\mathbf{y}) \supset S_{\alpha_2}(\mathbf{y})
\end{equation*}

Pivotal quantities

$U = u(\mathbf{y}, \psi)$ which has a distribution independent of $\psi$, i.e. $U$ is a pivotal statistic, that is

\begin{equation*}
F(U \mid \psi) = F(U)
\end{equation*}

Then we can construct a value $U_{\alpha}$ such that we obtain a confidence region following:

\begin{equation*}
P \Big( u(\mathbf{y}, \psi) \subseteq U_{\alpha} \Big) = 1 - \alpha
\end{equation*}

Decision Theory

Notation

  • $\mathcal{Y}$ denotes the sample space i.e. $y \in \mathcal{Y}$
  • $\Omega_{\theta}$ denotes the parameter space
  • $\{ \mathbb{P}_{\theta}(y) \ : \ y \in \mathcal{Y}, \theta \in \Omega_{\theta} \}$ denotes a family of probability distributions
  • $\mathcal{A}$ is the action space, i.e. set of actions an experiment can take after observing data, e.g. reject or accept $H_0$, estimating the value of $\theta$, etc.
  • $L: \Omega \times \mathcal{A} \to \mathbb{R}$ denotes the loss function
  • $\mathcal{D}$ denotes decision rules , with $d \in \mathcal{D}$ is a function $d: \mathcal{Y} \to \mathcal{A}$ that associates a particular decision which each possible observed data set.

Stuff

For a parameter $\theta \in \Theta_0$, the risk of a decision rule, $d$, is defined as

\begin{equation*}
R(\theta, d) = \mathbb{E}_{\theta} \Big[ \big( \theta, d(Y) \big) \Big] = 
\begin{cases}
  \sum_{y \in \mathcal{Y}} L \Big( \theta, d(y) \Big) f(y ; \ \theta) &amp; \text{for discrete } \mathcal{Y} \\
  \int_{\mathcal{Y}} L \Big( \theta, d(y) \Big) f(y ; \ \theta) \ dy &amp; \text{for continuous } \mathcal{Y}
\end{cases}
\end{equation*}

In other words, the risk is the expected loss of a particular decision rule $d$ when the true value of the unknown parameter is $\theta$.

Note: this is fundamentally a frequentist concept, since the definition implicitly invokes the idea of repeated samples from the parameter space $\mathcal{Y}$ and computes the average loss over these hypothetical repetitions.

Selection of decision rule

Given two possible decision rules $d_1$ and $d_2$, the rule $d_1$ strictly dominates the rule $d_2$ if

\begin{equation*}
R \big( \theta, d_1 \big) \le R(\theta, d_2), \quad \forall \theta \in \Omega_{\theta}
\end{equation*}

and there exists at least one value of $\theta$, e.g. θ', for which

\begin{equation*}
R \big( \theta', d_1 \big) &lt; R(\theta', d_2)
\end{equation*}

It is clear that we would always prefer $d_1$ to $d_2$.

If, for a given decision rule $d_2$, there exists some decision rule $d_1$ that strictly dominates $d_2$, then $d_2$ is said to be inadmissible.

Conversely, if there is no decision rule that dominates $d_2$, then $d_2$ is said to be admissible.

That is, generally less-than-or-equally AND has at least ONE value of $\theta$ for which we have strict inequality.

Unbiasedness

We say a loss-function is unbiased if

\begin{equation*}
\mathbb{E}_{\theta} \Big[ L \big( \theta', d(Y) \big) \Big] \ge \mathbb{E}_{\theta} \Big[ L(\theta, d(Y) \Big], \quad \forall \theta, \theta' \in \Omega_{\theta}
\end{equation*}

i.e. loss of the decision rule should be minimized for the true value of the parameter $\theta$.

Minimax decision rules

A decision rule is minimax if it minimizes the maximum risk

\begin{equation*}
\text{MR}(d) \le \text{MR}(d'), \quad \forall d' \in \mathcal{D}
\end{equation*}

which can also be written

\begin{equation*}
\sup_{\theta} R(\theta, d) = \inf_{d' \in \mathcal{D}} \sup_{\theta \in \Omega_{\theta}} R(\theta, d')
\end{equation*}
Bayes decision rules

The Bayes risk of a decision rule, $d$, is defined by

\begin{equation*}
r(\pi, d) = \int_{\theta \in \Omega_{\theta}} R(\theta, d) \pi(\theta) \ d \theta
\end{equation*}

or by a sum in case of discrete-valued probability distribution.

A decision rule is a Bayes rule with respect to the prior $\pi(\cdot)$ if it minimizes the Bayes risk, i.e.,

\begin{equation*}
r(\pi, d) = \inf_{d ' \in \mathcal{D}} r(\pi, d') = m_{\pi}
\end{equation*}

Definition of Bayes risk assumes that the infimum is achieved by some rule, which might not always be true.

In those cases, for any $\varepsilon &gt; 0$, we can find a decision rule $d_{\varepsilon}$ such that

\begin{equation*}
r(\pi, d_{\varepsilon}) &lt; m_{\pi} + \varepsilon
\end{equation*}

Such a rule is said to be $\varepsilon$ *Bayes wrt. prior $\pi(\cdot)$.

A useful choice of prior is the one that is most conservative in its estimate of risk. This gives rise to the concept of a least favourable prior.

We say $\pi(\theta)$ is a least favourable prior if, for any other prior $\pi'(\theta)$ we have

\begin{equation*}
r(\pi, d_{\pi}) \ge r(\pi', d_{\pi'})
\end{equation*}

where $d_{\pi}, d_{\pi'}$ are the Bayes (decision) rules corresponding to $\pi(\cdot)$ and $\pi'(\cdot)$.

Randomized decision rules
  • $M$ decision rules $d_1, \dots, d_M$ which we choose from with probabilties $p_1, \dots p_M$, with $p_i \ge 0, \forall i$ and $\sum p_i = 1$
  • Define $d_* = \sum p_i d_i$ to be the rule "selection decision rule $d_i$ with prob. $p_i$", called the randomized decision rule, which has risk

    \begin{equation*}
R(\theta, d_*) = \sum_{i=1}^{M} p_i R(\theta, d_i)  
\end{equation*}
  • Sort of a linear combination of decision rules
  • Minimax rules are often this way
    • Supremum over $\theta$ of the risk for $d_*$ is smaller than the supremum of the risk for any of the decision rules individually

Finding minimax rules

Suppose that $d_{\pi^*}$ is a Bayes (decision) rule corresponding to prior $\pi^*(\theta)$ and $r(\pi^*, d_{\pi^*}) = \sup_{\theta \in \Omega_{\theta}} R(\theta, d_{\pi^*})$, then

  1. $d_{\pi^*}$ is a minimax decision rule
  2. $\pi(\cdot)$ is the least favourable prior

Admissibility of Bayes (decision) rules

We have the following three theorems:

Assume that the parameter space, $\Omega_{\theta} = \{ \theta_1, \dots, \theta_M \}$, is finite and a given prior $\pi(\cdot)$ gives positive weight to each $\theta_i$.

The a Bayes (decision) rule wrt. $\pi$ is admissible.

If a Bayes (decision) rule is unique, it is admissible.

Let $\Omega_{\theta}$ be a subset of the real line. Assume that hte risk functions $R(\theta, d)$ are continuous in $\theta$ for all decision rules $d$.

If the prior $\pi$ has the property that for any $\varepsilon &gt; 0$ and any $\theta$, the interval $(\theta - \varepsilon, \theta + \varepsilon)$ has non-zero probability under $\pi$, then the Bayes (decision) rule wrt. $\pi$ is admissible.

James-Stein Estimators

Notation

  • $\mathbf{Y} \sim \mathcal{N}(\mu, I)$
  • $L(\mu, d = \norm{\mu - d}$ is the square loss function
  • Risk of a decision rule is then given

    \begin{equation*}
R\Big( \mu, \hat{\mathbf{d}}^0(\mathbf{Y}) \Big) = \mathbb{E} \Big[ \norm{\mu - \mathbf{Y}} \Big]^2 = \sum_{i=1}^{p} \mathbb{E} \big[ (\mu_i - Y_i)^2 \big] = \sum_{i=1}^{p} \text{Var}(Y_i)
\end{equation*}

Stein's Paradox

The class of James-Stein estimators of the form

\begin{equation*}
\mathbf{\hat{d}}^a (\mathbf{Y}) = \Bigg( 1 - \frac{a}{\norm{\mathbf{Y}}_2^2} \Bigg) \mathbf{Y}
\end{equation*}

has smaller risks that are also independent of $\mu$ and hence strictly dominate the natural estimator

\begin{equation*}
\mathbf{\hat{d}}^0 (\mathbf{Y}) = \mathbf{Y}
\end{equation*}

These are called shrinkage estimators as they shrink $\mathbf{Y}$ towards $0$ when $\norm{\mathbf{Y}}^2 &gt; a$, and has the following consequence: folding in information from variables that are independent of the variable of interest can, on average, improve estimation of the variable of interest!

Suppose

\begin{equation*}
\mathbf{X} \sim N(\mu, \sigma^2 I)
\end{equation*}

and $f: \mathbb{R}^n \to \mathbb{R}$ is an almost differentiable function then

\begin{equation*}
\frac{1}{\sigma^2} \mathbb{E} \Big[ (\mathbf{X} - \mu) f(\mathbf{X}) \Big] = \mathbb{E} \Big[ \nabla f(\mathbf{X}) \Big]
\end{equation*}

First we do the univariate case.

Suppose

\begin{equation*}
Z \sim N(\mu_0, \sigma^2), \quad g: \mathbb{R} \to \mathbb{R}
\end{equation*}

which is absolutely continuous, and

\begin{equation*}
\mathbb{E} \big[ g'(Z) \big] &lt; \infty
\end{equation*}

Using change of variables to set $\mu_0 = 0$ and $\sigma = 1$, then $Z$ is std. normal. Then

\begin{equation*}
z = \frac{z - \mu_0}{\sigma}, \quad \phi(z) = \frac{1}{\sqrt{2 \pi}} \exp \Bigg(- \frac{1}{2} z^2 \Bigg)
\end{equation*}

Then one simply substitute into the Stein's lemma and prove that it is indeed satisfied.

Bayesian Statistics

Model comparison

Bayesian Information Criterion (BIC)

Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred.

The BIC is defined as:

\begin{equation*}
\text{BIC} = \ln(n) k - 2 \ln \big( \hat{L} \big)
\end{equation*}

where

  • $\hat{L}$ is the MLE of the model
  • $x$ the obsered data
  • $n$ is the number of data points in $x$
  • $k$ is the number of parameters in the model

BIC is an asymptotic result derived under the assumptions that the data distribution follows an exponential family.

Limitations:

  1. Approximation only valid for sample sizes $n &gt;&gt; k$
  2. The BIC cannot handle complex collections of models as in the variable selection problem in high-dimensions
Akaike Information Criterion (AIC)
  • Given collection of models for some data, estimates quality of each model, relative to each of the other models
  • NOT a test of model in a sense of testing a null hypothesis, i.e. says nothing about the absolute quality, only relative quality

Suppose we have a statistical model of some data.

Let:

  • $k$ be the number of estimated parameters in the model
  • $\hat{L}$ be the maximum value of the likelihood function for the model

Then the Akaike Information Criterion (AIC) value of the model is

\begin{equation*}
\text{AIC} = 2k - \ln \big( \hat{L} \big)
\end{equation*}

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value.

Thus, AIC rewards goodness-of-fit (as assessed by the likelihood function), but it also penalizes large number of parmateres (complexity).

AIC is based on information theory. Suppose data was generated by some unknown process $f$, and we're considering two candidate models to $g_1$ and $g_2$. We could then compute the KL-divergence of $f$ and $g_1$, $\text{KL}(f || g)$, and of $f$ and $g_2$, $\text{KL}(f || g_2)$, i.e. the "loss of information" by representing $f$ using $g_1$ or $g_2$, respectively. One could then compare these values, and choose the candidate model which had the smallest KL-divergence with $f$.

Asymptotically, making this choice is equivalent of choosing the model $g_i$ with the smallest

AIC! Note, however, that AIC can be a bad comparison if the number of samples is small.

TODO ANOVA

One-way

Two-way

Bootstrapping

  • Bootstrap methods gets its name due to "using data to generate more data" seems analogous to a trick used by the fictional Baron Munchausen, who when he found himself at the bottom of a lake got out by pulling himself up by his bootstraps :)

Notation

  • $y_1, \dots, y_n$ is a single homogenous sample of data with PDF $f$ and CDF $F$
  • Statistic $T$ whose value in the sample is $t$, which estimates $\theta$, an underlying characterstic of the population. Generally $t$ is a symmetric function of $y_1, \dots, y_n$
  • EDF stands for the empirical distribution function, denoted $\hat{F}$
  • $\psi_$ denotes the parameter of a parametric model with CDF and PDF $F_{\psi}$ and $f_{\psi}$, respectively
  • $\hat{F}(y) = F_{\hat{\psi}}(y)$ is the fitted model to data drawn from the PDF $f_{\psi}$
  • $Y^*$ denotes the rv. distributed according to the fitted model, and we do the same for other moments (e.g. $\mathbb{E}^*$ denotes the mean of the fitted model)
  • $\hat{\theta}^*$ denotes the random variable of the statistic of interested, in comparison to $\hat{\theta}$ which is the observed estimate, or rather

    \begin{equation*}
\hat{\theta} = t(\hat{F})
\end{equation*}

Stuff

  • Interested in probability distribution of $T$
  • $\theta = t(F)$ describes the fact that the population parameter $\theta$ is equal to the value the statistic $T$ takes on for the underlying CDF $F$
  • $t = t(\hat{F})$ expresses the relationship between the estimate $t$ and $\hat{F}$
    • To be properly rigorous, we would write $T = t_n(\hat{F})$ and require that $t_n \to t$ as $n \to \infty$, possibly even that $t_n - t = \mathcal{O}(n^{-1})$
    • We will mostly assume $T = t(\hat{F})$
Moment estimates
  • Suppose we simulate a dataset $\{ Y_1^*, \dots, Y_n^* \} \sim \hat{f}$, i.e. from fitted model
  • Properties of $T - \theta$ are then estimated from $T_1^*, \dots, T_R^*$, using $R$ repetitions of the data simulation
  • Important to recognize that we are not estimating absolute properties of $T$, but rather of $T$ relative to $\theta$
Distribution and quantile estimates
  • Approximating the distribution of $T - \theta$ by that of $T^* - t$
  • Cumulative probabilities are estimated by EDF of the simulated values $t^* - t$, i.e. if

    \begin{equation*}
G(u) = Pr ( T - \theta \le u )
\end{equation*}

    then the simulation estimate of $G(u)$ is

    \begin{equation*}
\hat{G}_R(u) = \frac{\# \left\{ t_r^* - t \le u \right\}}{R} = \frac{1}{R} \sum_{i=1}^{R} I \left\{ t_r^* - t \le u \right\}
\end{equation*}

    And $\hat{G}_R(u) \to \hat{G}$, the exact CDF of $T^* - t$ under sampling from the fitted model

  • Approximation $\hat{G}_R$ to $G$ contains two sources of error:
    1. $\hat{G}$ to $G$ due to data variability
    2. $\hat{G}_R$ to $\hat{G}$ due to finite simulation
  • Quantiles of distribution of $T - \theta$
    • Approximated using ordered values of $t^* - t$
    • Suppose $X_1, \dots, X_N$ are independent distribution with CDF $K$ and if $X_{(j)}$ denotes the j-th ordered value, then

      \begin{equation*}
\mathbb{E} \big[ X_{(j)} \big] = K^{-1} \bigg( \frac{j}{N + 1} \bigg)
\end{equation*}
    • This implies a sensible estimate of $K^{-1}(p)$ is the $X_{\big( (N + 1)p \big)}$, assuming that $(N + 1)p$ is an integer
      • Therefore we can estimate $p$ quantile of $T - \theta$ by the $(R + 1)p \text{-th}$ oredered value of $t^* - t$, i.e. $t_{\big( (R + 1)p \big)}^* - t$
      • We're assuming $R$ is chosen so that $(R + 1)p$ is an integer
Nonparametric simulation
  • Same as in previous section, but EDF to perform simulation instead of estimate of parameter for distribution (e.g. using MLE estimate of $\theta$); call this nonparametric bootstrap
Simple confidence intervals

If we use bootstrap estimates of quantiles for $T - \theta$, then an equitailed $(1 - 2 \alpha)$ confidence interval will have limits

\begin{equation*}
\Big( t - \big( t^*_{\big( (R + 1)(1 - \alpha) \big)} - t \big), t - \big( t^*_{\big( (R + 1)\alpha \big)} - t \big) \Big)
\end{equation*}

where we explicitly write the second term in the parenthesis in the limits to emphasize that we're looking at $t^* - t$ with an expectation $t$.

This is based on the probability implication

\begin{equation*}
\text{Pr} \big( a \le T - \theta \le b \big) = 1 - 2 \alpha \implies \text{Pr} \big( T - b \le \theta \le T - a \big) = 1 - 2 \alpha
\end{equation*}

Reducing error

  • Problem: choose quantity $Q = q(\hat{F}, F)$ such that $Q$ is as nearly pivotal as possible
    • That is, it has (at least approx.) the same distribution under sampling from both $\hat{F}$ and $F$
  • Let $Q = h(T) - h(\theta)$ with $h(t)$ increasing in $t$ and if $a_{\alpha}$ is an approx. lower $\alpha$ quantile of $h(T)- h(\theta)$, then

    \begin{equation*}
1 - \alpha = \text{Pr} \left\{ h(T) - h(\theta) \ge \alpha_{\alpha} \right\} = \text{Pr} \left\{ \theta \le h^{-1} \big( h(T) - a_{\alpha} \big) \right\}
\end{equation*}

    where $h^{-1}$ is the inverse transformation

  • Thus, $h^{-1} \big( h(T) - a_{\alpha} \big)$ is an upper $(1 - \alpha)$ confidence limit for $\theta$

So we're basically saying "Let's bound the difference between $h$ of the true $\theta$ and $h$ of our estimator $T$ by $a_{\alpha}$, with a certainty / probability of $1 - \alpha$"

OR rather, "let's bound the probability of the difference between $h(\theta)$ and $h(T)$ being $1 - \alpha$"

OR "we want some constant $a$ such that the probability of $h(T)$ and $h(\theta)$ differ by more than $a$ to be bounded by $1 - \alpha$", and we

Theoretical basis for bootstrap

Suppose we have a random sample $y_1, \dots, y_n$, or equiv., its EDF $\hat{F}$.

We want to estimate some quantity $Q = q(Y_1, \dots, Y_n ; F)$, e.g.

\begin{equation*}
Q(Y_1, \dots, Y_n; F) = \sqrt{n} \bigg( \bar{Y} - \int y \ d F(y) \bigg) = \sqrt{n} \big( \bar{Y} - \theta \big)
\end{equation*}

and want to estimate the distribution function

\begin{equation*}
G_{F, n}(q) = \text{Pr} \left\{ Q(Y_1, \dots, Y_n ; F) \le q \mid F \right\}
\end{equation*}

where the conditioning on $F$ indicates that $Y_1, \dots, Y_n$ is a random sample from $F$.

The bootstrap estimate of $G_{F, n}$ is then

\begin{equation*}
G_{\hat{F}, n} (Q) = \text{Pr} \left\{ Q(Y_1^*, \dots, Y_n^*; \hat{F}) \le q \mid \hat{F} \right\}
\end{equation*}

where in this case

\begin{equation*}
Q(Y_1^*, \dots, Y_n^*; \hat{F}) = \sqrt{n} \big( \bar{Y}^* - \bar{y} \big)
\end{equation*}

In order for

\begin{equation*}
G_{\hat{F}, n} \overset{\text{a.s.}}{\to} G_{F, n} \quad \text{as} \quad n \to \infty
\end{equation*}

we need the following conditions to hold (letting $U$ be a neighborhood of $F$ in a space of suitable distributions):

  1. For any $A \in U$, $G_{A, n}$ must converge weakly to a limit $G_{A, \infty}$
  2. This convergence must be uniform on $U$
  3. The function mapping $A$ to $G_{a, \infty}$ must be continuous

where converge weakly means

\begin{equation*}
\int h(u) \ d G_{A, n}(u) \to \int h(u) \ d G_{A, \infty}(u)
\end{equation*}

for all integrable functions $h(\cdot)$.

Under these conditions, the bootstrap estimate is consistent.

Resampling for testing

Nonparametric Permutation tests
  • Permutation test is a two-sample test
  • Have samples $\{ x_i \sim p, 1 \le i \le n_p \}$ and $\{ y_j \sim q, 1 \le j \le n_q \}$ for two distributions.
  • Want to check if $H_0: p = q$ is true
  • Consider some statistic which measure discrepancy between the two sample-sets, e.g. mean difference

    \begin{equation*}
\theta = \bar{x} - \bar{y}
\end{equation*}
  • Under $H_0$, for every sample $z \in \left\{ x_i \right\} \cup \left\{ y_j \right\}$, whether or not the sample came from distribution $p$ or $q$ should be equally likely, since under $H_0$ we have $p = q$
  • Therefore we consider consider permutations between the samples form the distributions!
    • Consider $n_p + n_q$ tuple

      \begin{equation*}
\big( z_1 = x_1, \dots, z_{n_p} = x_{n_p}, z_{n_p + 1} = y_1, \dots, z_{n_p + n_q} = y_{n_q} \big)
\end{equation*}
    • Permute the tuple

      \begin{equation*}
  \big( z_1, \dots, z_{n_p + n_q} \big) \mapsto \big( z_{\sigma(1)}, \dots, z_{\sigma(n_p + n_q)} \big)
\end{equation*}

      where $\sigma$ is a permutation on $n_p + n_q$ symbols.

    • Let $\{ \tilde{x}_i = z_{\sigma(i)}, 1 \le i \le n_p \}$ and $\{ \tilde{y}_j = z_{\sigma(n_p + j)}, 1 \le j \le n_q \}$
    • Compute $\hat{\theta}^*$ assuming $\tilde{x}_i$ to come from $p$ and $\tilde{y}_j$ to come from $q$.
  • Gives us achieved significance level (ASL), also known as the p-value, by considering

    \begin{equation*}
  \text{ASL} = \text{Pr} \left\{ \hat{\theta}^* \ge \hat{\theta} \mid H_0 \right\}
\end{equation*}

    i.e. the probability of getting a $\hat{\theta}$ large value when $H_0$ is true

    • Observe that $\hat{\theta}$ is a "discrepancy" measurement between the samples, e.g. the difference, and so "large values = BAD"

Practically, there can be a lot of permutations to compute; ${n_p + n_q \choose n_p}$ possibilities, in fact. Therefore we estimate the $\text{ASL}_{\text{perm}}$ by "sampling" permutations uniformly:

\begin{equation*}
\widehat{\text{ASL}}_{\text{perm}} = \frac{\# \Big( \left\{ \hat{\theta}^{(m)} \ge \hat{\theta}, \quad 1 \le m \le M \right\} \Big)}{M}
\end{equation*}

where $\hat{\theta}^{(m)}$ corresponds to the estimate of $\hat{\theta}$ using the m-th sampled permutated dataset, and $M$ is the number of permutations considered.

Note: permutation test is an exact test when all permutations are computed.

Frequentist bootstrapping

  • Data approximates unknown distribution
  • Sampling distribution of a statistic can be approximated by repeatedly resampling the observations with replacement and computing the statistic for each sample

Let

  • $y = (y_1, \dots, y_n)$ denote the original samples
  • $y^{(b)} = \big( y_1^{(b)}, \dots, y_n^{(b)} \big)$ denote the bootstrap sample
    • Likely have some observations repeated one or more times, while some are absent

Equivalently, we can instead make the following observation:

  • Each original observation $y_i$ occurs anywhere from $0$ to $n$ times
  • Let $h_i^{(b)}$ denote # of times $y_i$ occurs in $y^{(b)}$ and

    \begin{equation*}
h^{(b)} = \big( h_1^{(b)}, \dots, h_n^{(b)} \big)
\end{equation*}

    thus

    \begin{equation*}
\begin{split}
  h_i^{(b)} &amp;\in \left\{ 0, 1, \dots, n - 1, n \right\} \\
  \sum_{i=1}^{n} h_i^{(b)} &amp; = n
\end{split}
\end{equation*}
  • Let $w_i^{(b)}$ such that

    \begin{equation*}
n \ w_i^{(b)} = h_i^{(b)}
\end{equation*}

    then

    \begin{equation*}
n \ w_i^{(b)} \sim \text{Multinomial} \bigg( n; \ \frac{1}{n}, \dots, \frac{1}{n} \bigg)
\end{equation*}
  • Hence we can compute the bootstrap sample $y^{(b)}$ by instead drawing $w^{(b)}$ this way, and weighting the original samples by these drawn weights!

Recall that

\begin{equation*}
x \sim \text{Multinomial} \bigg( n; \ \frac{1}{n}, \dots, \frac{1}{n} \bigg)
\end{equation*}

has the property that

\begin{equation*}
\sum_{i=1}^{n} x_i = n
\end{equation*}

Hence,

\begin{equation*}
\frac{1}{n} \sum_{i=1}^{n} x_i = 1
\end{equation*}

TODO Bayesian bootstrapping

Problems / Examples

Multi-parameter MLE

Normal samples

Suppose we have the samples $\{Y_1, Y_2, \dots, Y_n\}$ following the model

\begin{equation*}
Y_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\mu, \sigma^2)
\end{equation*}

Then we want to test $H_0 : \mu = \mu_0$ against $H_1 : \mu \ne \mu_0$.

The likelihood is given by

\begin{equation*}
L(\mu, \sigma^2) = \big( 2 \pi \sigma^2 \big)^{- \frac{n}{2} } \exp \Bigg( - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (y_i - \mu)^2 \Bigg)
\end{equation*}

Under our restricted parameter-space $\omega$ ($H_0 : \mu = \mu_0$): then the MLE

\begin{equation*}
\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mu_0)^2
\end{equation*}

Under $\Sigma$ ($H_0 \cup H_1 : \mu$ and $\sigma^2$ can take on any value):

\begin{equation*}
\begin{split}
  \hat{\hat{\mu}} &amp;= \hat{y} = \frac{1}{n} \sum_{i=1}^{n} y_i \\
  \hat{\hat{\sigma}}^2 &amp;= \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2
\end{split}
\end{equation*}

Thus, the likelihood ratio is

\begin{equation*}
\begin{split}
  LR &amp;= \frac{\max_{\boldsymbol{\theta} \in \omega} L(\boldsymbol{\theta})}{\max_{\boldsymbol{\theta} \in \Omega} L(\boldsymbol{\theta})} \\
  &amp;= \frac{\big( 2 \pi \hat{\sigma}^2 \big)^{- \frac{n}{2}} \exp \Big( - \frac{1}{2 \hat{\sigma}^2} \sum_{i=1}^{n} (y_i - \mu_0)^2 \Big)}{\big( 2 \pi \hat{\hat{\sigma}}^2 \big)^{- \frac{n}{2}} \exp \Big( - \frac{1}{2 \hat{\hat{\sigma}}^2} \sum_{i=1}^{n} (y_i - \bar{y})^2 \Big)} \\
  &amp;= \Bigg( \frac{\sum_{i=1}^{n} (y_i - \mu_0)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} \Bigg)
\end{split}
\end{equation*}

But

\begin{equation*}
\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mu_0)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2 + (\bar{y} - \mu_0)^2
\end{equation*}

and thus

\begin{equation*}
LR = \Bigg( \frac{\hat{\hat{\sigma}}^2 + (\bar{y} - \mu_0)^2}{\hat{\hat{\sigma}}^2} \Bigg)^{- \frac{n}{2}} = \Bigg( 1 + \frac{t^2}{n - 1} \Bigg)^{- \frac{n}{2}}
\end{equation*}

where

\begin{equation*}
t = \frac{\sqrt{n} (\bar{y} - \mu_0)}{s} \quad \text{and} \quad s^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (y_i - \bar{y})^2
\end{equation*}

Rejecting for small values for LR is equivalent to rejecting for large values of $t^2$, i.e. this test is equivalent to a two-tailed t-test.

Here we can determine an exact distribution for LR, or rather the equivalent test statistics $t$, i.e.

\begin{equation*}
t \sim t_{n - 1} \quad (\text{under } H_0 : \mu = \mu_0)
\end{equation*}

However, in general, the distribution of LR (under $H_0$) is difficult to determine and we need to use the approximate $\chi^2$ method.

Fisher's method of scoring

Multiparameter case

Let $y_1, \dots, y_n$ be i.i.d. with

\begin{equation*}
Y_i \sim \text{Cauchy}(\alpha, \beta)
\end{equation*}

with p.d.f.

\begin{equation*}
f(y; \alpha, \beta) = \frac{\beta}{\pi \Big( \beta^2 + (y - \alpha)^2 \Big)}, \qquad - \infty &lt; y &lt; \infty, \quad \beta &gt; 0
\end{equation*}

that is $\boldsymbol{\theta} = (\alpha, \beta)^T$.

  • Log-likelihood:

    \begin{equation*}
\ell(\alpha, \beta) = n \log \beta - \sum_{i=1}^{n} \log \Big( \beta^2 + (y_i - \alpha)^2 \Big)
\end{equation*}
  • Score vector:

    \begin{equation*}
U(\alpha, \beta) =
\begin{pmatrix}
  \frac{\partial \ell}{\partial \alpha} \\
  \frac{\partial \ell}{\partial \beta}
\end{pmatrix} =
\begin{pmatrix}
  2 \sum_{i=1}^{n} \frac{y_i - \alpha}{\beta^2 + (y_i - \alpha)^2} \\
  \frac{n}{\beta} - \sum_{i=1}^{n} \frac{2 \beta}{\beta^2 + (y_i - \alpha)^2}
\end{pmatrix}
\end{equation*}
  • Hessian matrix:

    \begin{equation*}
\begin{split}
  H(\alpha, \beta) &amp;= \begin{pmatrix}
  \frac{\partial^2 \ell}{\partial \alpha^2} &amp; \frac{\partial^2 \ell}{\partial \alpha \partial \beta} \\
  \frac{\partial^2 \ell}{\partial \alpha \beta} &amp; \frac{\partial^2 \ell}{\partial \beta^2}
\end{pmatrix} \\
  &amp;= \begin{pmatrix}
  2 \sum_{i=1}^{n} \frac{(y_i - \alpha)^2 - \beta^2}{\Big( \beta^2 + (y_i - \alpha)^2 \Big)^2} &amp; -4 \beta \sum_{i=1}^{n} \frac{y_i - \alpha}{\Big( \beta^2 + (y_i - \alpha)^2 \Big)^2} \\
  - 4 \beta \sum_{i=1}^{n} \frac{y_i - \alpha}{\Big( \beta^2 + (y_i - \alpha)^2 \Big)^2} &amp; - \frac{n}{\beta^2} + 2 \sum_{i=1}^{n} \frac{\beta^2 - (y_i - \alpha)^2}{\Big( \beta^2 + (y_i - \alpha)^2 \Big)^2}
\end{pmatrix}
\end{split}
\end{equation*}
  • Observed information matrix:

    \begin{equation*}
J(\alpha, \beta) = - H(\alpha, \beta)
\end{equation*}
  • Expected information matrix:

    \begin{equation*}
I(\alpha, \beta) = - \mathbb{E}[H(\alpha, \beta)] = \frac{n}{2 \beta^2} \begin{pmatrix}
                                                         1 &amp; 0 \\
                                                         0 &amp; 1
                                                       \end{pmatrix}
\end{equation*}
  • Asymptotic variance-covariance matrix of $\hat{\boldsymbol{\theta}}$ is:

    \begin{equation*}
\text{Var}(\hat{\boldsymbol{\theta}}) = I^{-1}(\boldsymbol{\theta}) = \frac{2 \beta^2}{n} \begin{pmatrix}
                                                                                 1 &amp; 0 \\
                                                                                 0 &amp; 1
                                                                               \end{pmatrix}
\end{equation*}
  • Estimated standard errors:

    \begin{equation*}
\text{ESE}(\hat{\alpha}) = \text{ESE}(\hat{\beta}) = \frac{\sqrt{2} \hat{\beta}}{\sqrt{n}}
\end{equation*}
  • Covariance between our estimates:

    \begin{equation*}
\text{Cov}(\hat{\alpha}, \hat{\beta}) = 0
\end{equation*}
  • Asymptotic distribution:

    \begin{equation*}
  \hat{\boldsymbol{\theta}} \sim \text{BVN} \Bigg( \begin{pmatrix} \alpha \\ \beta \end{pmatrix}, \begin{pmatrix} \frac{2 \beta^2}{n} &amp; 0 \\ 0 &amp; \frac{2 \beta^2}{n} \end{pmatrix} \Bigg)
\end{equation*}

    Which is the distribution we would base a Wald test of $H_0 : \boldsymbol{\theta} = \boldsymbol{\theta}_0$ on, i.e.

    \begin{equation*}
  H_0 : \begin{pmatrix} \alpha \\ \beta \end{pmatrix} = \begin{pmatrix} \alpha_0 \\ \beta_0 \end{pmatrix}
\end{equation*}

Asymptotics

Overview

Notation

  • For a measure $P$ on a measurable space $(\mathcal{X}, \mathcal{B})$ and a measurable function $f: \mathcal{X} \to \mathbb{R}^k$,

    \begin{equation*}
Pf := \int f \dd{P}
\end{equation*}
  • $\mathbb{P}_n$ denotes the empirical measure, such that

    \begin{equation*}
\mathbb{P}_n f = \frac{1}{n} \sum_{i=1}^{n} f(X_i)
\end{equation*}
  • $\mathbb{G}_n$ denotes the empirical process, i.e. the "cenetered & scaled" version of $\mathbb{P}_n$,

    \begin{equation*}
\mathbb{G}_n f = \sqrt{n} \big( \mathbb{P}_n f - Pf \big) = \frac{1}{\sqrt{n}} \sum_{i=1}^{n} \big( f(X_i) - \mathbb{E}_p [ f(X_i) ] \big)
\end{equation*}
  • $\mathbb{G}_P$ denotes a P-Brownian bridge
  • $z_{\alpha}$ denotes the upper α-quantie of a std. normal
  • $\chi_{n, \alpha}^2$ denotes the upper α-quantie of a χ²-distribution
  • $t_{n, \alpha}$ denotes the upper α-quantie of a t-distribution
  • $\ll$ denotes absolutely continuous
  • $\lhd$ denotes contiguous
  • $\lhd \ \rhd$ denotes mutually contiguous
  • $X_n \leadsto X$ denotes weak convergence, and if $X$ has distribution $L$, we also write $X_n \leadsto L$
  • For a given sequence of random variables $R_n$

    \begin{equation*}
\begin{split}
  X_n &amp;= o_P(R_n)  \quad \text{means} \quad  X_n = Y_n R_n \text{ with } Y_n \overset{P}{\to} 0 \\
  X_n &amp;= O_P(R_n)  \ \ \ \text{means} \quad  X_n = Y_n R_n \text{ with } Y_n = O_P(1)
\end{split}
\end{equation*}

    where $O_P(1)$ denotes the fact that a sequence is bounded in probability.

Stochastic convergence

van2000asymptotic

For any random vectors $X_n$ and $X$ the following statements are equivalent

  1. $P(X_n \le x) \to P(X \le x)$ for all continuity points of $x \mapsto P(X \le x)$
  2. $\mathbb{E} [f(X_n)] \to \mathbb{E}[f(X)]$ for all bounded, continuous functions $f$, i.e. $f \in C_b(\mathbb{R}^k)$
  3. $\mathbb{E} [f(X_n)] \to \mathbb{E}[f(X)]$ for all bounded, Lipschitz functions $f$
  4. $\liminf \mathbb{E}[f(X_n)] \ge \mathbb{E}[f(X)]$ for all nonnegative, continuous functions $f$
  5. $\liminf P(X_n \in G) \ge P(X \in G)$ for every open set $G$
  6. $\limsup P(X_n \in F) \le P(X \in F)$ for every closed set $F$
  7. $P(X_n \in B) \to P(X \in B)$ for all Borel sets $B$ with $P(X \in \partial B) = 0$, where $\partial B = \overline{B} - \mathring{B}$ is the boundary of $B$

lemma:portmanteau-weak-convergence-equivalences is a very useful lemma for talking about convergences in

van2000asymptotic

Any random vector $X$ is tight : for every $\varepsilon &gt; 0$

\begin{equation*}
\exists M \in \mathbb{R}: \quad P \big( \norm{X} &gt; M \big) &lt; \varepsilon
\end{equation*}

A set of random vectors $\left\{ X_{\alpha}: \alpha \in A \right\}$ is called uniformly tight if $M$ can be chosen the same for every $X_{\alpha}$: for every $\varepsilon &gt; 0$

\begin{equation*}
\exists M: \quad \sup_{\alpha} P \big( \norm{X_{\alpha}} &gt; M \big) &lt; \varepsilon
\end{equation*}

Thus, there exists a compact set to which all $X_{\alpha}$ give probability "almost" one.

Another name of uniformly tight is bounded in probability.

van2000asymptotic

Let $X_n$ be random vectors in $\mathbb{R}^k$.

  1. If $X_n \leadsto X$ for some $X$, then $\left\{ X_n : n \in \mathbb{N} \right\}$ is uniformly tight.
  2. If $X_n$ is uniformly tight, then there exists a subsequence with $X_{nj} \leadsto X$ as $j \to \infty$ for some $X$.

Prohorov's theorem states that every uniformly tight sequence contains a weakly converging subsequence. This basically generalizes the Heine-Borel theorem to random vectors.

Characteristic functions

Let $X_n$ and $X$ be random vectors in $\mathbb{R}^k$. Then

\begin{equation*}
X_n \leadsto X \quad \iff \quad \mathbb{E} \big[ e^{i t^T x_n} \big] \to \mathbb{E} \big[ e^{i t^T X} \big], \quad \forall t \in \mathbb{R}^k
\end{equation*}

Moreover, if $\mathbb{E} \big[ e^{it^T X_n} \big]$ convergences pointwise to a function $\phi(t)$ that is continuous at zero, then $\phi$ is the characteristic function of a random vector $X$ and $X_n \leadsto X$.

Statistical Learning Theory

Notation

Binary classification

Notation

  • $(X_1, Y_1), \dots, (X_n, Y_n)$ denotes the training data for the algorithm
  • $Z_i = (X_i, Y_i)$ and $Z = (X, Y)$ for convenience
  • $\mathcal{G}$ denotes the class of binary classifiers considered, and thus

    \begin{equation*}
\mathcal{G} \subseteq \big\{ g: \mathcal{X} \to \left\{ -1, 1 \right\} \big\}
\end{equation*}

Symmetrization

  • Idea: replace the true risk $P f$ by an estimate computed on an independent set of data, i.e. $P_n Z'$ (called a virtual or ghost sample)

For any $t &gt; 0$ .s.t that $nt^2 \ge 2$,

\begin{equation*}
\Pr \bigg[ \sup_{f \in \mathcal{F}} \big( P - P_n \big)f \ge t \bigg] \le 2 \Pr \bigg[ \sup_{f \in \mathcal{F}} \big( P_n' - P_n \big)f \ge t / 2 \bigg]
\end{equation*}

where

  • $P_n'$ denotes the empirical measure of the "ghost" sample $Z_1', \dots, Z_n'$

Stuff

For a class $\mathcal{F}$ of functions, the Rademacher average is defined as

\begin{equation*}
\mathcal{R}(\mathcal{F}) = \mathbb{E} \Big[ \sup_{f \in \mathcal{F}} R_n f \Big]
\end{equation*}

where

  • $R_n f$ is defined

    \begin{equation*}
R_n f := \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(Z_i)
\end{equation*}
  • $\big( \sigma_i \big)_{i = 1}^n$ are Rademacher random variables, i.e. i.i.d. such that $\Pr(\sigma_i = 1) = \Pr(\sigma_i = -1) = \frac{1}{2}$.
  • $\mathbb{E}$ denotes the expectation over all the rvs, i.e. $\left\{ \sigma_1, \dots, \sigma_n, Z_1, \dots, Z_n \right\}$

Similarily, the conditional Rademacher average is defined as

\begin{equation*}
\mathcal{R}_n(\mathcal{F}) := \mathbb{E}_{\sigma} \Big[ \sup_{f \in \mathcal{F}} R_n f \Big]
\end{equation*}

where

  • $\mathbb{E}_{\sigma}$ denotes the expectation only over the rvs $\left\{ \sigma_1, \dots, \sigma_n \right\}$, conditioned on $\left\{ Z_1, \dots, Z_n \right\}$

For all $\delta &gt; 0$, with probability at least $1 - \delta$

\begin{equation*}
\forall f \in \mathcal{F}, \quad P f \le P_n f + 2 \mathcal{R}(\mathcal{F}) + \sqrt{\frac{\log \big( \frac{1}{\delta} \big)}{2n}}
\end{equation*}

and also, with prob. at least $1 - \delta$,

\begin{equation*}
\forall f \in \mathcal{F}, \quad Pf \le P_n f + 2 \mathcal{R}_n (\mathcal{F}) + \sqrt{\frac{2 \log \big( \frac{2}{\delta} \big)}{n}}
\end{equation*}

where

  • True risk

    \begin{equation*}
Pf := \mathbb{E} \big[ f(X, Y) \big]
\end{equation*}
  • Empirical risk

    \begin{equation*}
P_n f := \frac{1}{n} \sum_{i=1}^{n} f(X_i, Y_i)
\end{equation*}

Topics in Statistical Theory

1.2 Estimating an arbitrary distribution function

Let

  • $\mathcal{f}$ denote the class of all dsitribution functions on $\mathbb{R}$
  • $X_1, \dots, X_n \sim F \in \mathcal{F}$
  • Empirical distribution $\hat{F}_n$ is given by

    \begin{equation*}
\hat{F}_n(x) = \hat{F}_n(x; X_1, \dots, X_n) := \frac{1}{n} \sum_{i_1}^{n} \1 \left\{ X_i \le x \right\}
\end{equation*}

We have

\begin{equation*}
\sup_{x \in \mathbb{R}} \left| \hat{F}_n(x) - F(x) \right| \overset{\text{a.s.}}{\to} 0
\end{equation*}

as $n \to \infty$.

Given $\varepsilon &gt; 0$, let $k := \left\lceil \frac{1}{\varepsilon} \right\rceil$.

Set $x_0 := - \infty$, and

\begin{equation*}
x_i := \inf \left\{ x \in \mathbb{R} : F(x) \ge \frac{i}{k} \right\}, \quad i = 1 \dots, k - 1
\end{equation*}

and $x_k := \infty$.

Writing $F(x-) := \lim_{y \uparrow x} F(y)$, which we can do since $F$ is left-continuous, we note that

\begin{equation*}
F(x_i -) - F(x_{i - 1}) \le \frac{i}{k} - \frac{i - 1}{k} = \frac{1}{k} \le \varepsilon
\end{equation*}

Let

\begin{equation*}
\begin{split}
  \Omega_{n, \varepsilon} &amp;:= \left\{ \max_{i = 1, \dots, k} \sup_{m \ge n} \left| \hat{F}_m(x_i) - F(x_i) \right| \le \varepsilon \right\} \\
  &amp; \quad \cap \left\{ \max_{i = 1, \dots, k} \sup_{m \ge n} \left| \hat{F}_m(x_i -) - F(x_i - ) \right| \le \varepsilon \right\}
\end{split}
\end{equation*}

where

\begin{equation*}
\hat{F}_n(x-) = \frac{1}{n} \sum_{i = 1}^{n} \1 \left\{ x &lt; x_i \right\}
\end{equation*}

(note the that this has uses strict inequality). Hence, by a union bound, we have

\begin{equation*}
\begin{split}
  \mathbb{P}_F(\Omega_{n, \varepsilon}^{\complement}) &amp;\le \sum_{i=1}^{k} \mathbb{P}_F \bigg( \sup_{m \ge n} \left| \hat{F}_m(x_i) - F(x_i) \right| &gt; \varepsilon \bigg) \\
  &amp; \quad + \sum_{i = 1}^{k} \mathbb{P}_F \bigg( \sup_{m \ge n} \left| \hat{F}_m(x_i-) - F(x_i -) \right| &gt; \varepsilon \bigg)
\end{split}
\end{equation*}

which, by SLLN the terms in the absolute values goes to zero a.s. as $m \to \infty$, therefore implying that

\begin{equation*}
\mathbb{P}_F \Big( \Omega_{n, \varepsilon}^{\complement} \Big) \overset{\text{a.s.}}{\to} 0
\end{equation*}

Now fix $x \in \mathbb{R}$ and find $i \in \left\{ 1, \dots, k \right\}$ s.t. $x \in [x_i, x_{i + 1})$. Then for any $n_0 \in \mathbb{N}$ and $n \ge n_0$,

\begin{equation*}
\begin{split}
  \hat{F}_n(x) - F(x) &amp;\le \hat{F}_n(x_i -) - F(x_{i - 1}) \\
  &amp; \le \hat{F}_n(x_i-) - F(x_i-) + \varepsilon \\
  &amp; \le \max_{i = 1, \dots, k} \sup_{m \ge n_0} \left| \hat{F}_m(x_i-) - F(x_i - ) \right| + \varepsilon
\end{split}
\end{equation*}

where in the second inequality we use the fact that $\left| F(x_i -) - F(x_{i - 1}) \right| \le \varepsilon$.

Similarily, for the other way around, we have

\begin{equation*}
\begin{split}
  F(x) - \hat{F}_n(x) &amp;\le F_n(x_i -) - \hat{F}_n(x_{i - 1}) \\
  &amp; \le \varepsilon + F_n(x_{i - 1}) - \hat{F}_n(x_{i - 1}) \\
  &amp; \le \varepsilon + \max_{i = 1, \dots, k} \sup_{m \ge n_0} \left| \hat{F}_m(x_{i-1}) - F(x_{i - 1}) \right|
\end{split}
\end{equation*}

From the two equations above, it follows that

\begin{equation*}
\mathbb{P}_F \bigg( \sup_{m \ge n} \sup_{x \in \mathbb{R}} \left| \hat{F}_m(x) - F(x) \right| &gt; 2 \varepsilon \bigg) \le \mathbb{P}_F \big( \Omega_{n, \varepsilon}^{\complement} \big)
\end{equation*}

since either we have

  1. $\hat{F}_n &gt; F(x)$: in which case the event on the LHS can occur if

    \begin{equation*}
\max_{i = 1, \dots, k} \sup_{m \ge n_0} \left| \hat{F}_m(x_i-) - F(x_i-) \right| &gt; \varepsilon
\end{equation*}
  2. $\hat{F}_n &lt; F(x)$: in which case the event on the LHS can occur if

    \begin{equation*}
\max_{i = 1, \dots, k} \sup_{m \ge n_0} \left| \hat{F}_m(x_{i-1}) - F(x_{i-1}) \right| &gt; \varepsilon
\end{equation*}
1

Hence the probability of such an event is upper-bounded by the RHS $\mathbb{P}_F \big( \Omega_{n, \varepsilon}^{\complement} \big)$.

We therefore conclude that

\begin{equation*}
\begin{split}
  &amp;\mathbb{P}_F \bigg( \sup_{x \in \mathbb{R}} \left| \hat{F}_n(x) - F(x) \right| \to \0 \bigg) \\
  &amp;= \mathbb{P}_F \bigg( \bigcap_{L = 1}^{\infty} \bigcup_{n = 1}^{\infty} \bigcap_{m \ge n}^{} \left\{ \sup_{x \in \mathbb{R}} \left| \hat{F}_m(x) - F(x)  \right| \le \frac{1}{L} \right\} \bigg)
\end{split}
\end{equation*}

i.e. for all $L \in \mathbb{N}$, $\exists n \in \mathbb{N}$ such that $\forall m \ge n$, the difference is less than $\frac{1}{L}$.

In particular, note that the probability on the RHS is:

  1. Decreasing in $L$ (since the event becomes smaller as $L$ increases)
  2. Increasing in $n$ (since the "number of" $m \ge n$ we're taking the intersection for decreases)

This means that we have

\begin{equation*}
\begin{split}
  &amp;\mathbb{P}_F \bigg( \sup_{x \in \mathbb{R}} \left| \hat{F}_n(x) - F(x) \right| \to \0 \bigg) \\
  &amp;= \lim_{L \to \infty} \lim_{n \to \infty} \mathbb{P}_F \bigg( \sup_{m \ge n} \sup_{x \in \mathbb{R}} \left| \hat{F}_m(x) - F(x) \right| \le \frac{1}{L} \bigg) \\
  &amp;= 1
\end{split}
\end{equation*}

as we wanted.

1.3

Let

  • $\big( Y_n \big)$ be a sequence of random vectors in $\mathbb{R}^d$

Suppose $\exists \mu \in \mathbb{R}^d$ and a random vector $Z$ such that

\begin{equation*}
\sqrt{n} \big( Y_n - \mu \big) \overset{d}{\to} Z
\end{equation*}

as $n \to \infty$.

Furthermore, suppose that $g: \mathbb{R}^d \to \mathbb{R}$ is differentabile at $\mu$.

Then

\begin{equation*}
\sqrt{n} \Big( g(Y_n) - g(\mu) \Big) \overset{d}{\to} \nabla g(\mu)^T Z
\end{equation*}

This can be seen by using expanding using the Taylor expansion at $\mu$.

1.4. Concentraion Inequalities

We say the random variable $X$ is sub-Gaussian with parameter $\sigma^2$ if $\mathbb{E}[X] = 0$ and it has MGF

\begin{equation*}
\mathbb{E} \big[ e^{tX} \big] \le e^{t^2 \sigma^2 / 2}, \quad \forall t \in \mathbb{R}
\end{equation*}

Let $X$ be a sub-Gaussian rv. with parameter $\sigma^2$.

Then

\begin{equation*}
\mathbb{P} \big( X \ge x \big) \le e^{- \frac{x^2}{2 \sigma^2}}
\end{equation*}

For any $t \in \mathbb{R}{ + }$, we have

\begin{equation*}
\mathbb{P} \big( X \ge x \big) = \mathbb{P} \big( e^{tX} \ge e^{tx} \big)
\end{equation*}

By Markov's inequality we have

\begin{equation*}
\begin{split}
  \mathbb{P} \big( X \ge x \big) &amp;\le \frac{\mathbb{E} \big[ e^{tX} \big]}{e^{t x}} \\
  &amp;\le e^{- tx} e^{t^2 \sigma^2 / 2}
\end{split}
\end{equation*}

where in the second inequality we used the fact that $X$ is sub-Gaussian with parameter $\sigma^2$. Since this holds for any $t \in \mathbb{R}_{ + }$,

\begin{equation*}
\mathbb{P} \big( X \ge x \big) \le \inf_{t \in \mathbb{R}_{ + }} e^{- tx} e^{t^2 \sigma^2 / 2}
\end{equation*}

The RHS attains the minimum at $t = \frac{x}{\sigma^2}$ (this can be seen by expanding the square and the solving the resulting quadratic), thus

\begin{equation*}
\mathbb{P} \big( X \ge x \big) \le e^{- \frac{x^2}{2 \sigma^2}}
\end{equation*}

as claimed!

Let $X_1, \dots, X_n$ be independent r.v.s. with

  • $\mathbb{E} [X_i] = 0$
  • finite variances
  • $X_i \le b$ for some $b &gt; 0$

Moreover, let

  • $S := \sum_{i_1}^{n} X_i$
  • $\nu := \frac{1}{n} \sum_{i = 1}^{n} \Var(X_i)$
  • $\phi: \mathbb{R} \to \mathbb{R}$ defined by

    \begin{equation*}
\phi(u) := e^u - 1 - u
\end{equation*}

Then, for every $t &lt; 0$,

\begin{equation*}
\log \mathbb{E} \big[ e^{t S} \big] \le \frac{n \nu}{b^2} \phi(b t)
\end{equation*}

Consequently, defining

\begin{equation*}
\begin{split}
  h: \quad &amp; (0, \infty) \to (0, \infty) \\
  &amp; u \mapsto (1 + u) \log (1 + u) - n
\end{split}
\end{equation*}

we have that for $x \ge 0$,

\begin{equation*}
\mathbb{P} \bigg( \frac{S}{n} \ge x \bigg) \le \exp \Bigg( - \frac{n \nu}{ b^2} h \bigg( \frac{b x}{\nu} \bigg) \Bigg)
\end{equation*}

Define

\begin{equation*}
\begin{split}
  g: \quad &amp; \mathbb{R} \to \mathbb{R} \\
  &amp; u \mapsto \begin{cases}
                u^{-2} \phi(u) \quad &amp; \text{if } u \ne 0 \\
                \frac{1}{2} \quad &amp; \text{if } u = 0
              \end{cases}
\end{split}
\end{equation*}

where the definition at $u = 0$, is simply to ensure continuity for all $u \in \mathbb{R}_{ + }$.

Then $g$ is icnreasing, so for $t &gt; 0$,

\begin{equation*}
e^{t x} - 1 - tx \le x^2 \frac{\big( e^{bt} - 1 - bt \big)}{b^2} = x^2 \frac{\phi(bt)}{b^2}
\end{equation*}

Hence, for $t &gt; 0$,

\begin{equation*}
\begin{split}
  \log \mathbb{E} \big[ e^{t S} \big] &amp;= \sum_{i = 1}^{n} \log \mathbb{E} \big[ e^{t X_i} \big] \\
  &amp;\le \sum_{i = 1}^{n} \log \bigg( 1 + \mathbb{E}\big[ X_i^2 \big] \frac{\phi(bt)}{b^2} \bigg) \\
  &amp;\le n \log \bigg( 1 + \frac{\nu \phi(bt)}{b^2} \bigg) \\
  &amp;\le n \frac{\nu \phi(bt)}{b^2}
\end{split}
\end{equation*}

where in the

  • second to last inequality we have multiplied by $\frac{n}{n}$ and used Jensen's inequality,
  • final inequality we used the fact that $1 + u \le e^{u}$ for $u &gt; 0$.

Consequently, by a Chernouff bound,

\begin{equation*}
\begin{split}
  \mathbb{P} \bigg( \frac{S}{n} \ge x \bigg) &amp;\le \inf_{t &gt; 0} \exp \bigg( -nt x + \frac{n \nu}{b^2} \phi(bt) \bigg) \\
  &amp;= \exp \bigg( - \frac{n \nu}{b^2} h \bigg( \frac{b x}{\nu} \bigg) \bigg)
\end{split}
\end{equation*}

since the infimum is attained at $t = \frac{1}{b} \log \big(1  + \frac{bx}{\nu} \big)$.

Bibliography

Bibliography