Clustering algorithms

Table of Contents


  • $x^{(i)}$ is the $i^{\text{th}}$ observation
  • $d$ is the dimension of the data, i.e. $x \in \mathcal{R}^d$

K-means clustering


  • Guarantees local minima converges
  • No guarantee for global minima


  • $\mu_i$ is the mean of the $i^{\text{th}}$ cluster


K-means clustering is performing using two steps:

  1. Randomly initialize $k$ cluster centroids with means

	  \mu_1, ..., \mu_k \in \mathcal{R}^d
  2. Repeat until convergence:

    1. Set $c^{(i)} = \underset{j}{\text{argmin }} ||x^{(i)} - \mu_j||^2$, i.e. assign the $i^{\text{th}}$ data to the cluster for which it is closest to the mean.
    2. Update mean of cluster as follows: \begin{equation} μj = { {} 1 \{c(i) = j \} x(i)} {{} 1 \{c(i) = j\} }

Convergence here means until the assignments to the different clusters do not change.

Gaussian mixture models



  • $z$ is the latent / unobserverd / hidden random variable which we want to estimate
  • $z^{(i)}$ is the assigned cluster for the $i^{\text{th}}$ observation $x^{(i)}$
  • $\phi_j$ represents the probability of the $j^{\text{th}}$ "cluster" assignment


We have a latent random variable $z$ and $x^{(i)}, z^{(i)}$ have a joint distribution:

	   P(x^{(i)}, z^{(i)}) = P(x^{(i)} | z^{(i)}) P(z^{(i)})

The model is as follows:

  • $z_i \sim \text{Multinomial}(\phi)$ where $\phi_j\ge 0, \quad \underset{j}{\sum} \phi_j = 1$
  • $x^{(i)} | z^{(i)} \sim \mathcal{N}(\mu_j, \Sigma_j)$

    Note that this model is identical to Gaussian Discriminant Analysis, except the fact that we have replaced our labeled outputs $y$ by latent variable $z$.


Special case of EM

See this for more on Expectation Maximization (EM).

Repeat until convergence:

  1. (E-step) Guess values of $z^{(i)}$ 's and compute:

	w^{(i)} & := P(z^{(i)} = j | x^{(i)}, \phi, \mu, \Sigma) \\
	& = \frac{ P(x^{(i)} | z^{(i)}) P(z^{(i)}) } {\sum_k P(x^{(k)} | z^{(k)}) P(z^{(k)})} \\
	& = \frac{\frac{1}{(2 \pi)^{d} |\Sigma|^{\frac{1}{2}}} exp \Big( (x^{(i)} - \mu_j)^T \Sigma^{-1}_j (x^{(i)} - \mu_j) \Big) \phi_j} {\sum_k \dots}

    Where the $\sum_k \dots$ means that we sum over the same expression as in the numerator for every $j$, and the numerator is simply the Gaussian PDF multiplied by the probability of $z^{ (i) } = j$.

  2. (M-step) Compute for each cluster $j$:
    • $\phi_j := \frac{1}{m} \overset{m}{\underset{i=1}{\sum}} w_j^{(i)}$, i.e. take average of weights computed above
    • $\mu_j := \frac{\overset{m}{\underset{i=1}{\sum}} w_{j}^{(i)} x^{(i)}} {\overset{m}{\underset{i=1}{\sum}} w_{j}}$, i.e. weighted average for the ith observation
    • $\varepsilon_j = \frac { \overset{m}{\underset{i=1}{\sum}} w_{j}^{(i)} (x^{(i)} - \mu_j)(x^{(i)} - \mu_j)^T } {\overset{m}{\underset{i=1}{\sum}} w_{j}^{(i)}}$, i.e. the variance using the previously computed weights

Parallels to Gaussian Discriminant Analysis

In GDA we have: $\phi_j = P(z^{ (i) } = j) = \frac{1}{m} \overset{m}{\underset{i=1}{\sum}} 1_{ z^{(i)} = j }$

But in our case (GMM) we do not know which cluster the observation belongs to, so instead we use the "soft-weights" $w^{(i)} := P(z^{(i)} = j | x^{(i)}, \phi, \mu, \Sigma)$ and so get $\phi_j := \frac{1}{m} \overset{m}{\underset{i=1}{\sum}} w_j^{(i)}$.


  • Non-parametric feature-space analysis technique for locating a maxima of a density function, a so-called mode-seeking algorithm

Mean shift is an iterative method for locating the maxima - the modes - of density function given discrete data sampled from that function.

  1. Choose a kernel function $K(x_i - x)$, which determines the weights of the nearby points for re-estimation of the mean.
  2. The weighted mean of the density in the window determined by $K$ is

 m(x) = \frac{\sum_{x_i \in N(x)} K(x_i - x) x_i}{\sum_{x_i \in N(x)} K(x_i - x)}

    where $N(x)$ is the neighborhood of $x$, a set of points for which $K(x_i) \ne 0$.

  3. Set $x := m(x)$
  4. Repeate 2-3 until $m(x)$ converges

The procedure gets it's name from the difference $m(x) - x$, which is called the mean shift.

A rigid proof for convergence of the algorithm using a general kernel in a high dimensional space is still not known.

Convergence has been shown for one-dimension with differentiable, convex, and strictly increasing profile function.

Convergence in higher dimensions with a finite number of the stationary points (or isolated) has been proven, however, sufficient conditions for a general kernel to have a finite (or isolated) stationary points have not been provided.