Clustering algorithms
Table of Contents
Notation
- is the observation
- is the dimension of the data, i.e.
K-means clustering
Summary
- Guarantees local minima converges
- No guarantee for global minima
Notation
- is the mean of the cluster
Idea
K-means clustering is performing using two steps:
Randomly initialize cluster centroids with means
1Repeat until convergence:
- Set , i.e. assign the data to the cluster for which it is closest to the mean.
- Update mean of cluster as follows: μj = { {} 1 \{c(i) = j \} x(i)} {{} 1 \{c(i) = j\} }
Convergence here means until the assignments to the different clusters do not change.
Gaussian mixture models
Summary
Notation
- is the latent / unobserverd / hidden random variable which we want to estimate
- is the assigned cluster for the observation
- represents the probability of the "cluster" assignment
Idea
We have a latent random variable and have a joint distribution:
The model is as follows:
- where
Note that this model is identical to Gaussian Discriminant Analysis, except the fact that we have replaced our labeled outputs by latent variable .
Optimization
Special case of EM
See this for more on Expectation Maximization (EM).
Repeat until convergence:
(E-step) Guess values of 's and compute:
3Where the means that we sum over the same expression as in the numerator for every , and the numerator is simply the Gaussian PDF multiplied by the probability of .
- (M-step) Compute for each cluster :
- , i.e. take average of weights computed above
- , i.e. weighted average for the ith observation
- , i.e. the variance using the previously computed weights
Parallels to Gaussian Discriminant Analysis
In GDA we have:
But in our case (GMM) we do not know which cluster the observation belongs to, so instead we use the "soft-weights" and so get .
Mean-shift
- Non-parametric feature-space analysis technique for locating a maxima of a density function, a so-called mode-seeking algorithm
Mean shift is an iterative method for locating the maxima - the modes - of density function given discrete data sampled from that function.
- Choose a kernel function , which determines the weights of the nearby points for re-estimation of the mean.
The weighted mean of the density in the window determined by is
where is the neighborhood of , a set of points for which .
- Set
- Repeate 2-3 until converges
The procedure gets it's name from the difference , which is called the mean shift.
A rigid proof for convergence of the algorithm using a general kernel in a high dimensional space is still not known.
Convergence has been shown for one-dimension with differentiable, convex, and strictly increasing profile function.
Convergence in higher dimensions with a finite number of the stationary points (or isolated) has been proven, however, sufficient conditions for a general kernel to have a finite (or isolated) stationary points have not been provided.