# Nonparametric Bayes

## Concepts

Infinitively exchangeable
order of data does not matter for the joint distribution.

## Beta distribution 1

### Overview

• Distribution over parameters for a binomial-distribution!
• So in a sense you're "drawing distributions"
• Like to think of it as simply putting some rv. parameters on the model itself, instead of simply going straight for estimating in a binomial distribution.
• Remember the function is a when is an integer.

## Dirichlet distribution 2

### Overview

• Generialization of Beta distribution, i.e. over multiple categorical variables, i.e. distribution over parameters for a multionomial distribution.
• So if you say were to plot the Dirichlet distribution of some parameters we obtain the simplex/surface of allowed values for these parameters
• "Allowed" meaning that they satisfy being a probability within the multinomial model, i.e. • Got nice conjugacy properties, where it's conjugate to itself, and also multinomial distributions

### Generating Dirichlet from Beta

We can draw from a Beta by marginalizing over  3

This is what we call stick braking. 4

## Dirichlet process

### Overview

• Taking the number of parameters to go to .
• Allows arbitrary number of clusters => can grow with the data

### Taking We do what we do in Generating Dirichlet from Beta, the "stick braking". But in the Dirichlet process stick braking we do 5

And then we just continue doing this, drawing as follows: 6

Resulting distribution of is then where is called the Griffiths-Engen-McCloskey (GEM) distribution.

To obtain a Dirichlet process we then do: 7

where can be any probability measure.

### Dirichlet process mixture model

Start out with Gaussian Mixture Model 8

Where our and are our priors of the Gaussian clusters. Which is the same as saying .

So, is a sum over dirac deltas and so will only take non-zero values where corresponds to some . That is, it just indexes the probabilities somehow. Or rather, it describes the probability of each cluster being assigned to. 9

i.e. , which means that drawing an assignment cluster for our nth data point, where the drawn cluster has mean , is equivalent of drawing the mean itself from . i.e. the nth data point is then drawn from a normal distribution with the sampled mean and some variance .

The shape / variance could also be dependent on the cluster if we wanted to make the model a bit more complex. Would just have to add some draw for in our model.

## Lecture 2

### Notation

• which sums to 1 with probability one.
• • is the dirac delta for the element ### Stuff

• can be described as follows:
• Take a stick of length • • "Break" stick at the point corresponding to :
• • • "Break" the rest of the stick by :
• • • "Break" the rest of the stick:
• • Then • We let where is some underlying distribution

• The we define the random variable where is the dirac delta for the element • The can even be functions, if is a distribution on a separable Banach space!
• Then where denotes a Dirichlet process

• Observe that defines a measure! hence a is basically a distribution over measures!

• So we have a random measure where the σ-algebra is defined by where is the original σ-algebra

There's a very interesting property of the distribution.

Suppose is Brownian motion. Then consider the maximal points (i.e. new "highest" or "lowest" peak), then the time between these new peaks follow a !

We say a that a sequence of random variables is infinitely exchangable if and only if there exists an unique random measure such that Then observe that what's known as the Chinese restaurant process is just our previous where we've marginalized over all the !

### Dirichlet as a GEM

Suppose we have finite number of samples from a GEM distribution .

Then, Stochastic process on a σ-algebra.

A complete random measure is a random measure such that the draws are independent: ## Appendix A: Vocabulary

categorical distribution
distribution with some probability for the the class/label indexed by . So a multinomial distribution?
random measure