Nonparametric Bayes

Table of Contents

Concepts

Infinitively exchangeable
order of data does not matter for the joint distribution.

Beta distribution

\begin{equation*}
      Beta(\rho_1 | \alpha_1, \alpha_2) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \rho_1^{a-1} (1- \rho_1)^{\alpha_2 - 1}
\end{equation*}
1

Overview

  • Distribution over parameters for a binomial-distribution!
    • So in a sense you're "drawing distributions"
    • Like to think of it as simply putting some rv. parameters on the model itself, instead of simply going straight for estimating $\rho_1$ in a binomial distribution.
  • Remember the $\Gamma$ function is a $\Gamma(n) = (n - 1)!$ when $n$ is an integer.

Dirichlet distribution

\begin{equation*}
      Dirichlet(\rho_{1:K} | \alpha_{1:K}) = \frac{\Gamma (\overset{K}{\underset{k=1}{\sum}} \alpha_k)}{\overset{K}{\underset{k=1}{\prod}} \Gamma(\alpha_k)} \overset{K}{\underset{k=1}{\prod}} \rho_k^{\alpha_k - 1}
\end{equation*}
2

Overview

  • Generialization of Beta distribution, i.e. over multiple categorical variables, i.e. distribution over parameters for a multionomial distribution.
  • So if you say were to plot the Dirichlet distribution of some parameters $\rho_1, \rho_2, \rho_3$ we obtain the simplex/surface of allowed values for these parameters
    • "Allowed" meaning that they satisfy being a probability within the multinomial model, i.e. $\overset{3}{\underset{k=1}{\sum}} \rho_k = 1, \quad \rho_k \geq 0$
  • Got nice conjugacy properties, where it's conjugate to itself, and also multinomial distributions

Generating Dirichlet from Beta

We can draw $\rho_1$ from a Beta by marginalizing over $\rho_2, ..., \rho_K$

\begin{equation*}
 \rho_1 \overset{d}{=} \text{Beta} \overset{K}{\underset{k=1}{\sum}} (\alpha_k - \alpha_1) \implies
     \frac{(\rho_2, ..., \rho_K)^}{1 - \rho_1} \overset{d}{=} \text{Dirichlet}(\alpha_2, ..., \alpha_K)
\end{equation*}
3

This is what we call stick braking.

\begin{equation*}
\begin{alignat*}{2}
V_1 &\sim \text{Beta}(\alpha_1, \alpha_2 + \alpha_3 + \alpha_4), &\quad \rho_1 = V_1 \\
V_2 &\sim \text{Beta}(\alpha_2, \alpha_3 + \alpha_4), &\quad \rho_2 = (1 - V_1) V_2 \\
V_3 &\sim \text{Beta}(\alpha_3, \alpha_4), &\quad \rho_1 = (1 - V_1) (1 - V_2) V_3 \\
& \implies & \rho_4 = 1 - \overset{3}{\underset{k=1}{\sum}} \rho_k
\end{alignat*}
\end{equation*}
4

Dirichlet process

Overview

  • Taking the number of parameters $k$ to go to $\infty$.
  • Allows arbitrary number of clusters => $k$ can grow with the data

Taking $k = \infty$

We do what we do in Generating Dirichlet from Beta, the "stick braking". But in the Dirichlet process stick braking we do

\begin{equation*}
a_k = 1, b_k = \alpha > 0
\end{equation*}
5

And then we just continue doing this, drawing $\rho_k$ as follows:

\begin{equation*}
V_k \sim \text{Beta}(a_k, b_k), \qquad
\rho_k = \Bigg[ \overset{k - 1}{\underset{j=1}{\prod}} (1 - V_j) \Bigg] V_k
\end{equation*}
6

Resulting distribution of $\rho_k$ is then

\begin{equation*}
\rho = \big( \rho_1, \rho_2, \dots \big) \sim \text{GEM}(\alpha)
\end{equation*}

where $\text{GEM}(\alpha)$ is called the Griffiths-Engen-McCloskey (GEM) distribution.

To obtain a Dirichlet process we then do:

\begin{equation*}
\begin{split}
\rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\
\phi_k &\overset{iid}{\sim} G_0 \\
G &= \overset{\infty}{\underset{k=1}{\sum}} \rho_k \delta_{\phi_k}
\end{split}
\end{equation*}
7

where $G$ can be any probability measure.

Dirichlet process mixture model

Start out with Gaussian Mixture Model

\begin{equation*}
\begin{split}
     \rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\
     \mu_k &\overset{iid}{\sim} \mathcal{N}(\mu_0, \Sigma_0), \quad k = 1, 2, ...
\end{split}
\end{equation*}
8

Where our $\mu_0$ and $\Sigma_0$ are our priors of the Gaussian clusters. Which is the same as saying $G = \overset{\infty}{\underset{k = 1}{\sum}} \rho_k \delta_{\mu_k} \overset{d}{=} DP(\alpha, \mathcal{N}(\mu_0, \Sigma_0)$.

So, $G$ is a sum over dirac deltas and so will only take non-zero values where $\mu_k$ corresponds to some $\rho_k$. That is, it just indexes the probabilities somehow. Or rather, it describes the probability of each cluster $k$ being assigned to.

\begin{equation*}
\begin{split}
z_n &\overset{iid}{\sim} Categorical( \rho ) \\
\mu_n^* &= \mu_{z_n}
\end{split}
\end{equation*}
9

i.e. $\mu_n^* \overset{iid}{\sim} G$, which means that drawing an assignment cluster for our nth data point, where the drawn cluster has mean $\mu_{z_n}$, is equivalent of drawing the mean itself from $G$.

\begin{equation*}
 x_n \overset{indep}{\sim} \mathcal{N}(\mu_n^*, \Sigma)
\end{equation*}

i.e. the nth data point is then drawn from a normal distribution with the sampled mean $\mu_n^*$ and some variance $\Sigma$.

The shape / variance $\Sigma$ could also be dependent on the cluster if we wanted to make the model a bit more complex. Would just have to add some draw for $\Sigma$ in our model.

Lecture 2

Notation

  • $\pi = (\pi_1, \pi_2, \dots) \sim \text{GEM}(\alpha)$ which sums to 1 with probability one.
  • $G = \sum_{k=1}^{\infty} \pi_k \delta_{\phi_k}$
  • $\delta_{\psi_k}$ is the dirac delta for the element $\psi_k$

Stuff

  • $\text{GEM}(\alpha)$ can be described as follows:
    • Take a stick of length $L = 1$
    • $l_1 \sim \text{Beta}(1, \alpha)$
    • "Break" stick at the point corresponding to $l_1$:
      • $\pi_1 := l_1$
    • $l_2 \sim \text{Beta}(1, \alpha)$
    • "Break" the rest of the stick by $l_2$:
      • $\pi_2 := (L - \pi_1) * l_2$
    • $l_3 \sim \text{Beta}(1, \alpha)$
    • "Break" the rest of the stick:
      • $\pi_3 := (L - \sum_{k=1}^{2} \pi_k) * l_3$
    • Then

      \begin{equation*}
\sum_{k=1}^{\infty} \pi_k \overset{p}{=} 1
\end{equation*}
  • We let

    \begin{equation*}
\phi_k \overset{\text{i.i.d.}}{\sim} G_0
\end{equation*}

    where $G_0$ is some underlying distribution

  • The we define the random variable

    \begin{equation*}
G = \sum_{k=1}^{\infty} \pi_k \delta_{\phi_k}
\end{equation*}

    where $\delta_{\psi_k}$ is the dirac delta for the element $\psi_k$

    • The $\psi_k$ can even be functions, if $G_0$ is a distribution on a separable Banach space!
  • Then

    \begin{equation*}
G \sim \text{DP}(\alpha, G_0)
\end{equation*}

    where $\text{DP}$ denotes a Dirichlet process

  • Observe that $G$ defines a measure!

    \begin{equation*}
G(A) = \sum_{k \ge 1 }^{} \pi_k \delta_{\phi_k}(A) = \sum_{k : \psi_k \in A}^{} \pi_k
\end{equation*}

    hence a $\text{DP}$ is basically a distribution over measures!

  • So we have a random measure where the σ-algebra is defined by

    \begin{equation*}
G(A): A \in \mathcal{A}
\end{equation*}

    where $\mathcal{A}$ is the original σ-algebra

There's a very interesting property of the $\text{GEM}(\alpha)$ distribution.

Suppose $B_t$ is Brownian motion. Then consider the maximal points (i.e. new "highest" or "lowest" peak), then the time between these new peaks follow a $\text{GEM}(\alpha)$!

We say a that a sequence of random variables $\big( X_1, X_2, \dots \big)$ is infinitely exchangable if and only if there exists an unique random measure $G$ such that

\begin{equation*}
\begin{split}
  P \Big( X_1 \in A_1, \dots, X_N \in A_N \Big) &= P\Big(X_{\sigma(1)} \in A_1, X_{\sigma(2)} \in A_2, \dots, X_{\sigma(N)} \in A_n \Big) \\
  &= \int \prod_{i=1}^{N} G(A_i) \ P(d G)
\end{split}
\end{equation*}

Then observe that what's known as the Chinese restaurant process is just our previous $G = \sum_{k \ge 1}^{} \pi_k \delta_{\phi_k}$ where we've marginalized over all the $\pi_k$!

Dirichlet as a GEM

Suppose we have finite number of samples from a GEM distribution $\{ A_1, \dots, A_k \}$.

Then,

\begin{equation*}
\Big( G(A_1), \dots, G(A_k) \Big) \sim \text{Dir} \Big( \alpha G_0(A_1), \dots, \alpha G_0(A_k) \Big)
\end{equation*}

Stochastic process on a σ-algebra.

A complete random measure is a random measure such that the draws are independent:

\begin{equation*}
\big( G(A_i) \rlap G(A_j) \big)
\end{equation*}

Appendix A: Vocabulary

categorical distribution
distribution with some probability $\rho_k$ for the the class/label indexed by $k$. So a multinomial distribution?
random measure