Nonparametric Bayes

Concepts
Beta distribution
- Overview
Dirichlet distribution
- Overview
- Generating Dirichlet from Beta
Dirichlet process
Lecture 2
Appendix A: Vocabulary

Concepts

Infinitively exchangeable: order of data does not matter for the joint distribution.

Beta distribution

$\begin{equation*} Beta(\rho_1 | \alpha_1, \alpha_2) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \rho_1^{a-1} (1- \rho_1)^{\alpha_2 - 1} \end{equation*}$ 1

Overview

Distribution over parameters for a binomial-distribution!
- So in a sense you're "drawing distributions"
- Like to think of it as simply putting some rv. parameters on the model itself, instead of simply going straight for estimating $\rho_1$ in a binomial distribution.
Remember the $\Gamma$ function is a $\Gamma(n) = (n - 1)!$ when $n$ is an integer.

Dirichlet distribution

$\begin{equation*} Dirichlet(\rho_{1:K} | \alpha_{1:K}) = \frac{\Gamma (\overset{K}{\underset{k=1}{\sum}} \alpha_k)}{\overset{K}{\underset{k=1}{\prod}} \Gamma(\alpha_k)} \overset{K}{\underset{k=1}{\prod}} \rho_k^{\alpha_k - 1} \end{equation*}$ 2

Overview

Generialization of Beta distribution, i.e. over multiple categorical variables, i.e. distribution over parameters for a multionomial distribution.
So if you say were to plot the Dirichlet distribution of some parameters we obtain the simplex/surface of allowed values for these parameters
- "Allowed" meaning that they satisfy being a probability within the multinomial model, i.e. $\overset{3}{\underset{k=1}{\sum}} \rho_k = 1, \quad \rho_k \geq 0$
Got nice conjugacy properties, where it's conjugate to itself, and also multinomial distributions

Generating Dirichlet from Beta

We can draw $\rho_1$ from a Beta by marginalizing over $\rho_2, ..., \rho_K$

$\begin{equation*} \rho_1 \overset{d}{=} \text{Beta} \overset{K}{\underset{k=1}{\sum}} (\alpha_k - \alpha_1) \implies \frac{(\rho_2, ..., \rho_K)^}{1 - \rho_1} \overset{d}{=} \text{Dirichlet}(\alpha_2, ..., \alpha_K) \end{equation*}$ 3

This is what we call stick braking.

$\begin{equation*} \begin{alignat*}{2} V_1 &\sim \text{Beta}(\alpha_1, \alpha_2 + \alpha_3 + \alpha_4), &\quad \rho_1 = V_1 \\ V_2 &\sim \text{Beta}(\alpha_2, \alpha_3 + \alpha_4), &\quad \rho_2 = (1 - V_1) V_2 \\ V_3 &\sim \text{Beta}(\alpha_3, \alpha_4), &\quad \rho_1 = (1 - V_1) (1 - V_2) V_3 \\ & \implies & \rho_4 = 1 - \overset{3}{\underset{k=1}{\sum}} \rho_k \end{alignat*} \end{equation*}$ 4

Dirichlet process

Overview

Taking the number of parameters $k$ to go to $\infty$ .
Allows arbitrary number of clusters => $k$ can grow with the data

Taking $k = \infty$

We do what we do in Generating Dirichlet from Beta, the "stick braking". But in the Dirichlet process stick braking we do

$\begin{equation*} a_k = 1, b_k = \alpha > 0 \end{equation*}$ 5

And then we just continue doing this, drawing $\rho_k$ as follows:

$\begin{equation*} V_k \sim \text{Beta}(a_k, b_k), \qquad \rho_k = \Bigg[ \overset{k - 1}{\underset{j=1}{\prod}} (1 - V_j) \Bigg] V_k \end{equation*}$ 6

Resulting distribution of $\rho_k$ is then

$\begin{equation*} \rho = \big( \rho_1, \rho_2, \dots \big) \sim \text{GEM}(\alpha) \end{equation*}$

where $\text{GEM}(\alpha)$ is called the Griffiths-Engen-McCloskey (GEM) distribution.

To obtain a Dirichlet process we then do:

$\begin{equation*} \begin{split} \rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\ \phi_k &\overset{iid}{\sim} G_0 \\ G &= \overset{\infty}{\underset{k=1}{\sum}} \rho_k \delta_{\phi_k} \end{split} \end{equation*}$ 7

where $G$ can be any probability measure.

Dirichlet process mixture model

Start out with Gaussian Mixture Model

$\begin{equation*} \begin{split} \rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\ \mu_k &\overset{iid}{\sim} \mathcal{N}(\mu_0, \Sigma_0), \quad k = 1, 2, ... \end{split} \end{equation*}$ 8

Where our $\mu_0$ and $\Sigma_0$ are our priors of the Gaussian clusters. Which is the same as saying $G = \overset{\infty}{\underset{k = 1}{\sum}} \rho_k \delta_{\mu_k} \overset{d}{=} DP(\alpha, \mathcal{N}(\mu_0, \Sigma_0)$ .

So, $G$ is a sum over dirac deltas and so will only take non-zero values where $\mu_k$ corresponds to some $\rho_k$ . That is, it just indexes the probabilities somehow. Or rather, it describes the probability of each cluster $k$ being assigned to.

$\begin{equation*} \begin{split} z_n &\overset{iid}{\sim} Categorical( \rho ) \\ \mu_n^* &= \mu_{z_n} \end{split} \end{equation*}$ 9

i.e. $\mu_n^* \overset{iid}{\sim} G$ , which means that drawing an assignment cluster for our nth data point, where the drawn cluster has mean $\mu_{z_n}$ , is equivalent of drawing the mean itself from $G$ .

$\begin{equation*} x_n \overset{indep}{\sim} \mathcal{N}(\mu_n^*, \Sigma) \end{equation*}$

i.e. the nth data point is then drawn from a normal distribution with the sampled mean $\mu_n^*$ and some variance $\Sigma$ .

The shape / variance $\Sigma$ could also be dependent on the cluster if we wanted to make the model a bit more complex. Would just have to add some draw for $\Sigma$ in our model.

Lecture 2

Notation

$\pi = (\pi_1, \pi_2, \dots) \sim \text{GEM}(\alpha)$ which sums to 1 with probability one.
$G = \sum_{k=1}^{\infty} \pi_k \delta_{\phi_k}$
$\delta_{\psi_k}$ is the dirac delta for the element $\psi_k$

Stuff

can be described as follows:
- Take a stick of length $L = 1$
- $l_1 \sim \text{Beta}(1, \alpha)$
- "Break" stick at the point corresponding to :
  - $\pi_1 := l_1$
- $l_2 \sim \text{Beta}(1, \alpha)$
- "Break" the rest of the stick by :
  - $\pi_2 := (L - \pi_1) * l_2$
- $l_3 \sim \text{Beta}(1, \alpha)$
- "Break" the rest of the stick:
  - $\pi_3 := (L - \sum_{k=1}^{2} \pi_k) * l_3$
- …
- Then
  
  $\begin{equation*} \sum_{k=1}^{\infty} \pi_k \overset{p}{=} 1 \end{equation*}$
We let

$\begin{equation*} \phi_k \overset{\text{i.i.d.}}{\sim} G_0 \end{equation*}$

where $G_0$ is some underlying distribution
The we define the random variable

$\begin{equation*} G = \sum_{k=1}^{\infty} \pi_k \delta_{\phi_k} \end{equation*}$

where $\delta_{\psi_k}$ is the dirac delta for the element $\psi_k$
- The $\psi_k$ can even be functions, if $G_0$ is a distribution on a separable Banach space!
Then

$\begin{equation*} G \sim \text{DP}(\alpha, G_0) \end{equation*}$

where $\text{DP}$ denotes a Dirichlet process
Observe that $G$ defines a measure!

$\begin{equation*} G(A) = \sum_{k \ge 1 }^{} \pi_k \delta_{\phi_k}(A) = \sum_{k : \psi_k \in A}^{} \pi_k \end{equation*}$

hence a $\text{DP}$ is basically a distribution over measures!
So we have a random measure where the σ-algebra is defined by

$\begin{equation*} G(A): A \in \mathcal{A} \end{equation*}$

where $\mathcal{A}$ is the original σ-algebra

There's a very interesting property of the $\text{GEM}(\alpha)$ distribution.

Suppose $B_t$ is Brownian motion. Then consider the maximal points (i.e. new "highest" or "lowest" peak), then the time between these new peaks follow a $\text{GEM}(\alpha)$ !

We say a that a sequence of random variables $\big( X_1, X_2, \dots \big)$ is infinitely exchangable if and only if there exists an unique random measure $G$ such that

$\begin{equation*} \begin{split} P \Big( X_1 \in A_1, \dots, X_N \in A_N \Big) &= P\Big(X_{\sigma(1)} \in A_1, X_{\sigma(2)} \in A_2, \dots, X_{\sigma(N)} \in A_n \Big) \\ &= \int \prod_{i=1}^{N} G(A_i) \ P(d G) \end{split} \end{equation*}$

Then observe that what's known as the Chinese restaurant process is just our previous $G = \sum_{k \ge 1}^{} \pi_k \delta_{\phi_k}$ where we've marginalized over all the $\pi_k$ !

Dirichlet as a GEM

Suppose we have finite number of samples from a GEM distribution $\{ A_1, \dots, A_k \}$ .

Then,

$\begin{equation*} \Big( G(A_1), \dots, G(A_k) \Big) \sim \text{Dir} \Big( \alpha G_0(A_1), \dots, \alpha G_0(A_k) \Big) \end{equation*}$

Stochastic process on a σ-algebra.

A complete random measure is a random measure such that the draws are independent:

$\begin{equation*} \big( G(A_i) \rlap G(A_j) \big) \end{equation*}$

Appendix A: Vocabulary

categorical distribution: distribution with some probability $\rho_k$ for the the class/label indexed by $k$ . So a multinomial distribution?
random measure

Nonparametric Bayes

Table of Contents

Concepts

Beta distribution

Overview

Dirichlet distribution

Overview

Generating Dirichlet from Beta

Dirichlet process

Overview

Taking

Dirichlet process mixture model

Lecture 2

Notation

Stuff

Dirichlet as a GEM

Appendix A: Vocabulary

Taking $k = \infty$