Clustering algorithms

Notation
K-means clustering
- Summary
- Notation
- Idea
Gaussian mixture models
- Summary
- Notation
- Idea
- Optimization
Mean-shift

Notation

$x^{(i)}$ is the $i^{\text{th}}$ observation
$d$ is the dimension of the data, i.e. $x \in \mathcal{R}^d$

K-means clustering

Summary

Guarantees local minima converges
No guarantee for global minima

Notation

$\mu_i$ is the mean of the $i^{\text{th}}$ cluster

Idea

K-means clustering is performing using two steps:

Randomly initialize $k$ cluster centroids with means

$\begin{equation*} \mu_1, ..., \mu_k \in \mathcal{R}^d \end{equation*}$ 1
Repeat until convergence:
1. Set $c^{(i)} = \underset{j}{\text{argmin }} ||x^{(i)} - \mu_j||^2$ , i.e. assign the $i^{\text{th}}$ data to the cluster for which it is closest to the mean.
2. Update mean of cluster as follows: $\begin{equation}$ μ_j = { {} 1_{\{c⁽ⁱ⁾ = j \}} x⁽ⁱ⁾} {{} 1_{\{c⁽ⁱ⁾ = j\}}}

Convergence here means until the assignments to the different clusters do not change.

Gaussian mixture models

Summary

Notation

$z$ is the latent / unobserverd / hidden random variable which we want to estimate
$z^{(i)}$ is the assigned cluster for the $i^{\text{th}}$ observation $x^{(i)}$
$\phi_j$ represents the probability of the $j^{\text{th}}$ "cluster" assignment

Idea

We have a latent random variable $z$ and $x^{(i)}, z^{(i)}$ have a joint distribution:

$\begin{equation*} P(x^{(i)}, z^{(i)}) = P(x^{(i)} | z^{(i)}) P(z^{(i)}) \end{equation*}$ 2

The model is as follows:

$z_i \sim \text{Multinomial}(\phi)$ where $\phi_j\ge 0, \quad \underset{j}{\sum} \phi_j = 1$
$x^{(i)} | z^{(i)} \sim \mathcal{N}(\mu_j, \Sigma_j)$

Note that this model is identical to Gaussian Discriminant Analysis, except the fact that we have replaced our labeled outputs $y$ by latent variable $z$ .

Optimization

Special case of EM

See this for more on Expectation Maximization (EM).

Repeat until convergence:

(E-step) Guess values of $z^{(i)}$ 's and compute:

$\begin{equation*} \begin{split} w^{(i)} & := P(z^{(i)} = j | x^{(i)}, \phi, \mu, \Sigma) \\ & = \frac{ P(x^{(i)} | z^{(i)}) P(z^{(i)}) } {\sum_k P(x^{(k)} | z^{(k)}) P(z^{(k)})} \\ & = \frac{\frac{1}{(2 \pi)^{d} |\Sigma|^{\frac{1}{2}}} exp \Big( (x^{(i)} - \mu_j)^T \Sigma^{-1}_j (x^{(i)} - \mu_j) \Big) \phi_j} {\sum_k \dots} \end{split} \end{equation*}$ 3

Where the $\sum_k \dots$ means that we sum over the same expression as in the numerator for every $j$ , and the numerator is simply the Gaussian PDF multiplied by the probability of $z^{ (i) } = j$ .
(M-step) Compute for each cluster :
- $\phi_j := \frac{1}{m} \overset{m}{\underset{i=1}{\sum}} w_j^{(i)}$ , i.e. take average of weights computed above
- $\mu_j := \frac{\overset{m}{\underset{i=1}{\sum}} w_{j}^{(i)} x^{(i)}} {\overset{m}{\underset{i=1}{\sum}} w_{j}}$ , i.e. weighted average for the ith observation
- $\varepsilon_j = \frac { \overset{m}{\underset{i=1}{\sum}} w_{j}^{(i)} (x^{(i)} - \mu_j)(x^{(i)} - \mu_j)^T } {\overset{m}{\underset{i=1}{\sum}} w_{j}^{(i)}}$ , i.e. the variance using the previously computed weights

Parallels to Gaussian Discriminant Analysis

In GDA we have: $\phi_j = P(z^{ (i) } = j) = \frac{1}{m} \overset{m}{\underset{i=1}{\sum}} 1_{ z^{(i)} = j }$

But in our case (GMM) we do not know which cluster the observation belongs to, so instead we use the "soft-weights" $w^{(i)} := P(z^{(i)} = j | x^{(i)}, \phi, \mu, \Sigma)$ and so get $\phi_j := \frac{1}{m} \overset{m}{\underset{i=1}{\sum}} w_j^{(i)}$ .

Mean-shift

Non-parametric feature-space analysis technique for locating a maxima of a density function, a so-called mode-seeking algorithm

Mean shift is an iterative method for locating the maxima - the modes - of density function given discrete data sampled from that function.

Choose a kernel function $K(x_i - x)$ , which determines the weights of the nearby points for re-estimation of the mean.
The weighted mean of the density in the window determined by $K$ is

$\begin{equation*} m(x) = \frac{\sum_{x_i \in N(x)} K(x_i - x) x_i}{\sum_{x_i \in N(x)} K(x_i - x)} \end{equation*}$

where $N(x)$ is the neighborhood of $x$ , a set of points for which $K(x_i) \ne 0$ .
Set $x := m(x)$
Repeate 2-3 until $m(x)$ converges

The procedure gets it's name from the difference $m(x) - x$ , which is called the mean shift.

A rigid proof for convergence of the algorithm using a general kernel in a high dimensional space is still not known.

Convergence has been shown for one-dimension with differentiable, convex, and strictly increasing profile function.

Convergence in higher dimensions with a finite number of the stationary points (or isolated) has been proven, however, sufficient conditions for a general kernel to have a finite (or isolated) stationary points have not been provided.

Clustering algorithms

Table of Contents

Notation

K-means clustering

Summary

Notation

Idea

Gaussian mixture models

Summary

Notation

Idea

Optimization

Special case of EM

Parallels to Gaussian Discriminant Analysis

Mean-shift