Nonparametric Bayes
Table of Contents
Concepts
- Infinitively exchangeable
- order of data does not matter for the joint distribution.
Beta distribution
Overview
- Distribution over parameters for a binomial-distribution!
- So in a sense you're "drawing distributions"
- Like to think of it as simply putting some rv. parameters on the model itself, instead of simply going straight for estimating in a binomial distribution.
- Remember the function is a when is an integer.
Dirichlet distribution
Overview
- Generialization of Beta distribution, i.e. over multiple categorical variables, i.e. distribution over parameters for a multionomial distribution.
- So if you say were to plot the Dirichlet distribution of some parameters
we obtain the simplex/surface of allowed values for these parameters
- "Allowed" meaning that they satisfy being a probability within the multinomial model, i.e.
- Got nice conjugacy properties, where it's conjugate to itself, and also multinomial distributions
Generating Dirichlet from Beta
We can draw from a Beta by marginalizing over
This is what we call stick braking.
Dirichlet process
Overview
- Taking the number of parameters to go to .
- Allows arbitrary number of clusters => can grow with the data
Taking
We do what we do in Generating Dirichlet from Beta, the "stick braking". But in the Dirichlet process stick braking we do
And then we just continue doing this, drawing as follows:
Resulting distribution of is then
where is called the Griffiths-Engen-McCloskey (GEM) distribution.
To obtain a Dirichlet process we then do:
where can be any probability measure.
Dirichlet process mixture model
Start out with Gaussian Mixture Model
Where our and are our priors of the Gaussian clusters. Which is the same as saying .
So, is a sum over dirac deltas and so will only take non-zero values where corresponds to some . That is, it just indexes the probabilities somehow. Or rather, it describes the probability of each cluster being assigned to.
i.e. , which means that drawing an assignment cluster for our nth data point, where the drawn cluster has mean , is equivalent of drawing the mean itself from .
i.e. the nth data point is then drawn from a normal distribution with the sampled mean and some variance .
The shape / variance could also be dependent on the cluster if we wanted to make the model a bit more complex. Would just have to add some draw for in our model.
Lecture 2
Notation
- which sums to 1 with probability one.
- is the dirac delta for the element
Stuff
- can be described as follows:
- Take a stick of length
- "Break" stick at the point corresponding to :
- "Break" the rest of the stick by :
- "Break" the rest of the stick:
- …
Then
We let
where is some underlying distribution
The we define the random variable
where is the dirac delta for the element
- The can even be functions, if is a distribution on a separable Banach space!
Then
where denotes a Dirichlet process
Observe that defines a measure!
hence a is basically a distribution over measures!
So we have a random measure where the σ-algebra is defined by
where is the original σ-algebra
There's a very interesting property of the distribution.
Suppose is Brownian motion. Then consider the maximal points (i.e. new "highest" or "lowest" peak), then the time between these new peaks follow a !
We say a that a sequence of random variables is infinitely exchangable if and only if there exists an unique random measure such that
Then observe that what's known as the Chinese restaurant process is just our previous where we've marginalized over all the !
Dirichlet as a GEM
Suppose we have finite number of samples from a GEM distribution .
Then,
Stochastic process on a σ-algebra.
A complete random measure is a random measure such that the draws are independent:
Appendix A: Vocabulary
- categorical distribution
- distribution with some probability for the the class/label indexed by . So a multinomial distribution?
- random measure