Nonparametric Bayes
Table of Contents
Concepts
- Infinitively exchangeable
- order of data does not matter for the joint distribution.
Beta distribution
data:image/s3,"s3://crabby-images/6a5e7/6a5e7b855efad35bf331071d44b1627223e73be5" alt="\begin{equation*}
Beta(\rho_1 | \alpha_1, \alpha_2) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \rho_1^{a-1} (1- \rho_1)^{\alpha_2 - 1}
\end{equation*}"
Overview
- Distribution over parameters for a binomial-distribution!
- So in a sense you're "drawing distributions"
- Like to think of it as simply putting some rv. parameters on the model
itself, instead of simply going straight for estimating
in a binomial distribution.
- Remember the
function is a
when
is an integer.
Dirichlet distribution
data:image/s3,"s3://crabby-images/0cac1/0cac19f050ad1b5d807b7836fbf4b6d8bbd64035" alt="\begin{equation*}
Dirichlet(\rho_{1:K} | \alpha_{1:K}) = \frac{\Gamma (\overset{K}{\underset{k=1}{\sum}} \alpha_k)}{\overset{K}{\underset{k=1}{\prod}} \Gamma(\alpha_k)} \overset{K}{\underset{k=1}{\prod}} \rho_k^{\alpha_k - 1}
\end{equation*}"
Overview
- Generialization of Beta distribution, i.e. over multiple categorical variables, i.e. distribution over parameters for a multionomial distribution.
- So if you say were to plot the Dirichlet distribution of some parameters
we obtain the simplex/surface of allowed values for these parameters
- "Allowed" meaning that they satisfy being a probability within the multinomial
model, i.e.
- "Allowed" meaning that they satisfy being a probability within the multinomial
model, i.e.
- Got nice conjugacy properties, where it's conjugate to itself, and also multinomial distributions
Generating Dirichlet from Beta
We can draw from a Beta by marginalizing over
data:image/s3,"s3://crabby-images/71d67/71d67f881b6741885f8c8e950841d2694b2be7c1" alt="\begin{equation*}
\rho_1 \overset{d}{=} \text{Beta} \overset{K}{\underset{k=1}{\sum}} (\alpha_k - \alpha_1) \implies
\frac{(\rho_2, ..., \rho_K)^}{1 - \rho_1} \overset{d}{=} \text{Dirichlet}(\alpha_2, ..., \alpha_K)
\end{equation*}"
This is what we call stick braking.
data:image/s3,"s3://crabby-images/2409f/2409fa259a0127cf14f45218f33079072fc54d0c" alt="\begin{equation*}
\begin{alignat*}{2}
V_1 &\sim \text{Beta}(\alpha_1, \alpha_2 + \alpha_3 + \alpha_4), &\quad \rho_1 = V_1 \\
V_2 &\sim \text{Beta}(\alpha_2, \alpha_3 + \alpha_4), &\quad \rho_2 = (1 - V_1) V_2 \\
V_3 &\sim \text{Beta}(\alpha_3, \alpha_4), &\quad \rho_1 = (1 - V_1) (1 - V_2) V_3 \\
& \implies & \rho_4 = 1 - \overset{3}{\underset{k=1}{\sum}} \rho_k
\end{alignat*}
\end{equation*}"
Dirichlet process
Overview
- Taking the number of parameters
to go to
.
- Allows arbitrary number of clusters =>
can grow with the data
Taking data:image/s3,"s3://crabby-images/27a93/27a936528353711d6cecafabd8cdb5a2a048f888" alt="$k = \infty$"
We do what we do in Generating Dirichlet from Beta, the "stick braking". But in the Dirichlet process stick braking we do
data:image/s3,"s3://crabby-images/d2d4b/d2d4b49f12bb5d6591802f1299553279fddf7b00" alt="\begin{equation*}
a_k = 1, b_k = \alpha > 0
\end{equation*}"
And then we just continue doing this, drawing as follows:
![\begin{equation*}
V_k \sim \text{Beta}(a_k, b_k), \qquad
\rho_k = \Bigg[ \overset{k - 1}{\underset{j=1}{\prod}} (1 - V_j) \Bigg] V_k
\end{equation*}](../../assets/latex/nonparameteric_bayes_5c39a18f6158e36f12245f0b87c518bfe7c87e4a.png)
Resulting distribution of is then
data:image/s3,"s3://crabby-images/69250/69250059805d3fff9ed4a98bfcd8441cf3404d71" alt="\begin{equation*}
\rho = \big( \rho_1, \rho_2, \dots \big) \sim \text{GEM}(\alpha)
\end{equation*}"
where is called the Griffiths-Engen-McCloskey (GEM) distribution.
To obtain a Dirichlet process we then do:
data:image/s3,"s3://crabby-images/1b871/1b871799d038a9bfa2290b3e4ffb14be68cac9cd" alt="\begin{equation*}
\begin{split}
\rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\
\phi_k &\overset{iid}{\sim} G_0 \\
G &= \overset{\infty}{\underset{k=1}{\sum}} \rho_k \delta_{\phi_k}
\end{split}
\end{equation*}"
where can be any probability measure.
Dirichlet process mixture model
Start out with Gaussian Mixture Model
data:image/s3,"s3://crabby-images/dca83/dca83e774f2322185299015c38f72b88b3890b74" alt="\begin{equation*}
\begin{split}
\rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\
\mu_k &\overset{iid}{\sim} \mathcal{N}(\mu_0, \Sigma_0), \quad k = 1, 2, ...
\end{split}
\end{equation*}"
Where our and
are our priors of the Gaussian clusters.
Which is the same as saying
.
So, is a sum over dirac deltas and so will only take non-zero
values where
corresponds to some
. That is, it just
indexes the probabilities somehow. Or rather, it describes the
probability of each cluster
being assigned to.
data:image/s3,"s3://crabby-images/fc6f9/fc6f9639ead9aed12743219470e7b076b98edacf" alt="\begin{equation*}
\begin{split}
z_n &\overset{iid}{\sim} Categorical( \rho ) \\
\mu_n^* &= \mu_{z_n}
\end{split}
\end{equation*}"
i.e. , which means that drawing an assignment cluster for our
nth data point, where the drawn cluster has mean
, is equivalent of drawing
the mean itself from
.
data:image/s3,"s3://crabby-images/4adc3/4adc3fd5b0c61264825725e7687388e1a31b8010" alt="\begin{equation*}
x_n \overset{indep}{\sim} \mathcal{N}(\mu_n^*, \Sigma)
\end{equation*}"
i.e. the nth data point is then drawn from a normal distribution with
the sampled mean and some variance
.
The shape / variance could also be dependent on the cluster if we wanted
to make the model a bit more complex. Would just have to add some draw for
in our model.
Lecture 2
Notation
which sums to 1 with probability one.
is the dirac delta for the element
Stuff
can be described as follows:
- Take a stick of length
- "Break" stick at the point corresponding to
:
- "Break" the rest of the stick by
:
- "Break" the rest of the stick:
- …
Then
- Take a stick of length
We let
where
is some underlying distribution
The we define the random variable
where
is the dirac delta for the element
- The
can even be functions, if
is a distribution on a separable Banach space!
- The
Then
where
denotes a Dirichlet process
Observe that
defines a measure!
hence a
is basically a distribution over measures!
So we have a random measure where the σ-algebra is defined by
where
is the original σ-algebra
There's a very interesting property of the distribution.
Suppose is Brownian motion. Then consider the maximal points (i.e. new "highest" or "lowest" peak), then the time between these new peaks follow a
!
We say a that a sequence of random variables is infinitely exchangable if and only if there exists an unique random measure
such that
data:image/s3,"s3://crabby-images/e0cc4/e0cc4c2fd489077d0f49fcf99ce6d7f961c71dea" alt="\begin{equation*}
\begin{split}
P \Big( X_1 \in A_1, \dots, X_N \in A_N \Big) &= P\Big(X_{\sigma(1)} \in A_1, X_{\sigma(2)} \in A_2, \dots, X_{\sigma(N)} \in A_n \Big) \\
&= \int \prod_{i=1}^{N} G(A_i) \ P(d G)
\end{split}
\end{equation*}"
Then observe that what's known as the Chinese restaurant process is just our previous where we've marginalized over all the
!
Dirichlet as a GEM
Suppose we have finite number of samples from a GEM distribution .
Then,
data:image/s3,"s3://crabby-images/14213/1421355741d0e10c29b54273d0481610c0f4fa6a" alt="\begin{equation*}
\Big( G(A_1), \dots, G(A_k) \Big) \sim \text{Dir} \Big( \alpha G_0(A_1), \dots, \alpha G_0(A_k) \Big)
\end{equation*}"
Stochastic process on a σ-algebra.
A complete random measure is a random measure such that the draws are independent:
data:image/s3,"s3://crabby-images/cccd8/cccd881eb435198dfc44ca78c6c6421491da1a95" alt="\begin{equation*}
\big( G(A_i) \rlap G(A_j) \big)
\end{equation*}"
Appendix A: Vocabulary
- categorical distribution
- distribution with some probability
for the the class/label indexed by
. So a multinomial distribution?
- random measure