Nonparametric Bayes
Table of Contents
Concepts
- Infinitively exchangeable
- order of data does not matter for the joint distribution.
Beta distribution
![\begin{equation*}
Beta(\rho_1 | \alpha_1, \alpha_2) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1) \Gamma(\alpha_2)} \rho_1^{a-1} (1- \rho_1)^{\alpha_2 - 1}
\end{equation*}](../../assets/latex/nonparameteric_bayes_c1f5a9e3a5f78eafa956a8f96520e4e6774d9b49.png)
Overview
- Distribution over parameters for a binomial-distribution!
- So in a sense you're "drawing distributions"
- Like to think of it as simply putting some rv. parameters on the model
itself, instead of simply going straight for estimating
in a binomial distribution.
- Remember the
function is a
when
is an integer.
Dirichlet distribution
![\begin{equation*}
Dirichlet(\rho_{1:K} | \alpha_{1:K}) = \frac{\Gamma (\overset{K}{\underset{k=1}{\sum}} \alpha_k)}{\overset{K}{\underset{k=1}{\prod}} \Gamma(\alpha_k)} \overset{K}{\underset{k=1}{\prod}} \rho_k^{\alpha_k - 1}
\end{equation*}](../../assets/latex/nonparameteric_bayes_47b5d8e43d14b017180cf3c4dc12991fdccc3d16.png)
Overview
- Generialization of Beta distribution, i.e. over multiple categorical variables, i.e. distribution over parameters for a multionomial distribution.
- So if you say were to plot the Dirichlet distribution of some parameters
we obtain the simplex/surface of allowed values for these parameters
- "Allowed" meaning that they satisfy being a probability within the multinomial
model, i.e.
- "Allowed" meaning that they satisfy being a probability within the multinomial
model, i.e.
- Got nice conjugacy properties, where it's conjugate to itself, and also multinomial distributions
Generating Dirichlet from Beta
We can draw from a Beta by marginalizing over
![\begin{equation*}
\rho_1 \overset{d}{=} \text{Beta} \overset{K}{\underset{k=1}{\sum}} (\alpha_k - \alpha_1) \implies
\frac{(\rho_2, ..., \rho_K)^}{1 - \rho_1} \overset{d}{=} \text{Dirichlet}(\alpha_2, ..., \alpha_K)
\end{equation*}](../../assets/latex/nonparameteric_bayes_76f8bd033f52502d465d8716ce696bdcfe3e40b4.png)
This is what we call stick braking.
![\begin{equation*}
\begin{alignat*}{2}
V_1 &\sim \text{Beta}(\alpha_1, \alpha_2 + \alpha_3 + \alpha_4), &\quad \rho_1 = V_1 \\
V_2 &\sim \text{Beta}(\alpha_2, \alpha_3 + \alpha_4), &\quad \rho_2 = (1 - V_1) V_2 \\
V_3 &\sim \text{Beta}(\alpha_3, \alpha_4), &\quad \rho_1 = (1 - V_1) (1 - V_2) V_3 \\
& \implies & \rho_4 = 1 - \overset{3}{\underset{k=1}{\sum}} \rho_k
\end{alignat*}
\end{equation*}](../../assets/latex/nonparameteric_bayes_4e5430768774db3ec03c1e2f670b3c8691f64472.png)
Dirichlet process
Overview
- Taking the number of parameters
to go to
.
- Allows arbitrary number of clusters =>
can grow with the data
Taking ![$k = \infty$](../../assets/latex/nonparameteric_bayes_06790f088863f63335c5abf8f7539f44823b6547.png)
We do what we do in Generating Dirichlet from Beta, the "stick braking". But in the Dirichlet process stick braking we do
![\begin{equation*}
a_k = 1, b_k = \alpha > 0
\end{equation*}](../../assets/latex/nonparameteric_bayes_e63b85cb1fab961e7cb74a7fc98c7e85a4a16c94.png)
And then we just continue doing this, drawing as follows:
![\begin{equation*}
V_k \sim \text{Beta}(a_k, b_k), \qquad
\rho_k = \Bigg[ \overset{k - 1}{\underset{j=1}{\prod}} (1 - V_j) \Bigg] V_k
\end{equation*}](../../assets/latex/nonparameteric_bayes_5c39a18f6158e36f12245f0b87c518bfe7c87e4a.png)
Resulting distribution of is then
![\begin{equation*}
\rho = \big( \rho_1, \rho_2, \dots \big) \sim \text{GEM}(\alpha)
\end{equation*}](../../assets/latex/nonparameteric_bayes_e53a492bd44e3603413c2fe115dcbdec3a08aa3b.png)
where is called the Griffiths-Engen-McCloskey (GEM) distribution.
To obtain a Dirichlet process we then do:
![\begin{equation*}
\begin{split}
\rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\
\phi_k &\overset{iid}{\sim} G_0 \\
G &= \overset{\infty}{\underset{k=1}{\sum}} \rho_k \delta_{\phi_k}
\end{split}
\end{equation*}](../../assets/latex/nonparameteric_bayes_ccecaed7e8e9575b4a0481956e9168d143ae0829.png)
where can be any probability measure.
Dirichlet process mixture model
Start out with Gaussian Mixture Model
![\begin{equation*}
\begin{split}
\rho &= (\rho_1, \rho_2, ...) \sim GEM(\alpha) \\
\mu_k &\overset{iid}{\sim} \mathcal{N}(\mu_0, \Sigma_0), \quad k = 1, 2, ...
\end{split}
\end{equation*}](../../assets/latex/nonparameteric_bayes_886f872a53119b8a4b0930a018a82b8f298291a1.png)
Where our and
are our priors of the Gaussian clusters.
Which is the same as saying
.
So, is a sum over dirac deltas and so will only take non-zero
values where
corresponds to some
. That is, it just
indexes the probabilities somehow. Or rather, it describes the
probability of each cluster
being assigned to.
![\begin{equation*}
\begin{split}
z_n &\overset{iid}{\sim} Categorical( \rho ) \\
\mu_n^* &= \mu_{z_n}
\end{split}
\end{equation*}](../../assets/latex/nonparameteric_bayes_38d202a7b58f863b7ce23fb1cfb93399e9aed9b5.png)
i.e. , which means that drawing an assignment cluster for our
nth data point, where the drawn cluster has mean
, is equivalent of drawing
the mean itself from
.
![\begin{equation*}
x_n \overset{indep}{\sim} \mathcal{N}(\mu_n^*, \Sigma)
\end{equation*}](../../assets/latex/nonparameteric_bayes_dd28525f4d8b36b5fdaacfecef1d20fcd531668a.png)
i.e. the nth data point is then drawn from a normal distribution with
the sampled mean and some variance
.
The shape / variance could also be dependent on the cluster if we wanted
to make the model a bit more complex. Would just have to add some draw for
in our model.
Lecture 2
Notation
which sums to 1 with probability one.
is the dirac delta for the element
Stuff
can be described as follows:
- Take a stick of length
- "Break" stick at the point corresponding to
:
- "Break" the rest of the stick by
:
- "Break" the rest of the stick:
- …
Then
- Take a stick of length
We let
where
is some underlying distribution
The we define the random variable
where
is the dirac delta for the element
- The
can even be functions, if
is a distribution on a separable Banach space!
- The
Then
where
denotes a Dirichlet process
Observe that
defines a measure!
hence a
is basically a distribution over measures!
So we have a random measure where the σ-algebra is defined by
where
is the original σ-algebra
There's a very interesting property of the distribution.
Suppose is Brownian motion. Then consider the maximal points (i.e. new "highest" or "lowest" peak), then the time between these new peaks follow a
!
We say a that a sequence of random variables is infinitely exchangable if and only if there exists an unique random measure
such that
![\begin{equation*}
\begin{split}
P \Big( X_1 \in A_1, \dots, X_N \in A_N \Big) &= P\Big(X_{\sigma(1)} \in A_1, X_{\sigma(2)} \in A_2, \dots, X_{\sigma(N)} \in A_n \Big) \\
&= \int \prod_{i=1}^{N} G(A_i) \ P(d G)
\end{split}
\end{equation*}](../../assets/latex/nonparameteric_bayes_29dd17688a078a8f4a1a59fc18f0169f161c04fd.png)
Then observe that what's known as the Chinese restaurant process is just our previous where we've marginalized over all the
!
Dirichlet as a GEM
Suppose we have finite number of samples from a GEM distribution .
Then,
![\begin{equation*}
\Big( G(A_1), \dots, G(A_k) \Big) \sim \text{Dir} \Big( \alpha G_0(A_1), \dots, \alpha G_0(A_k) \Big)
\end{equation*}](../../assets/latex/nonparameteric_bayes_43a10ec4b8c1336aff880f6d526a45c94ff604a0.png)
Stochastic process on a σ-algebra.
A complete random measure is a random measure such that the draws are independent:
![\begin{equation*}
\big( G(A_i) \rlap G(A_j) \big)
\end{equation*}](../../assets/latex/nonparameteric_bayes_0f4d91d5652ce25c58dc3ce800092400ae372e14.png)
Appendix A: Vocabulary
- categorical distribution
- distribution with some probability
for the the class/label indexed by
. So a multinomial distribution?
- random measure