Clustering algorithms

Table of Contents

Notation

  • clustering_a7cae710b282741473b3eeddb69879a3d74b55cb.png is the clustering_debedb7f7748a9279d98de01a251be322c096645.png observation
  • clustering_24db75342343379bc61072ed218d34613269537e.png is the dimension of the data, i.e. clustering_750aab1e2df04c4df304cb8376596d8f02e2bfdb.png

K-means clustering

Summary

  • Guarantees local minima converges
  • No guarantee for global minima

Notation

  • clustering_77bc14ffb2fac05e758ab86a73926f82995df685.png is the mean of the clustering_debedb7f7748a9279d98de01a251be322c096645.png cluster

Idea

K-means clustering is performing using two steps:

  1. Randomly initialize clustering_094b02afce734f4ce51933d0093ef3d2da9f8123.png cluster centroids with means

    clustering_ea43b2832d4e3bd1c9751646ae8d981588871dc9.png

  2. Repeat until convergence:
    1. Set clustering_5d151b61a8952477ea448fd4291e340d43e642af.png, i.e. assign the clustering_debedb7f7748a9279d98de01a251be322c096645.png data to the cluster for which it is closest to the mean.
    2. Update mean of cluster as follows:

      clustering_c6424cea1acabac2a8c2c76e2352b56511da5197.png

Convergence here means until the assignments to the different clusters do not change.

Gaussian mixture models

Summary

Notation

  • clustering_9c15196dd07b1add486b8b54592e74bfe946ed95.png is the latent / unobserverd / hidden random variable which we want to estimate
  • clustering_3144699a62f3cf3700c6a191d3cfb1acb455fa65.png is the assigned cluster for the clustering_debedb7f7748a9279d98de01a251be322c096645.png observation clustering_a7cae710b282741473b3eeddb69879a3d74b55cb.png
  • clustering_347a50892ecd5de150f022fdf2fbdd2d2b702d0b.png represents the probability of the clustering_494f69b962b78c1246de9860610fa54782c36233.png "cluster" assignment

Idea

We have a latent random variable clustering_9c15196dd07b1add486b8b54592e74bfe946ed95.png and clustering_4e839a2489936a2f1b7330abb38d649fbe5c8ab8.png have a joint distribution:

clustering_2bef4f4073aaf7c2298f920753b4982ab94c41a8.png

The model is as follows:

  • clustering_515a7930433695bbc1810e27879335d28d3b7568.png where clustering_979e3d7c28163e9a807b6b75e99704974168f7fe.png
  • clustering_63c1e50f7dda8490ac58b10fa2ae1bacc82031c4.png

    Note that this model is identical to Gaussian Discriminant Analysis, except the fact that we have replaced our labeled outputs clustering_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png by latent variable clustering_9c15196dd07b1add486b8b54592e74bfe946ed95.png.

Optimization

Special case of EM

See this for more on Expectation Maximization (EM).

Repeat until convergence:

  1. (E-step) Guess values of clustering_3144699a62f3cf3700c6a191d3cfb1acb455fa65.png 's and compute:

    clustering_0fbf05611f835c8ca81379a8d44653f3ec6ad0db.png

    Where the clustering_0fe53444eb265b5c15615e3a52230916af0ef5d1.png means that we sum over the same expression as in the numerator for every clustering_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png, and the numerator is simply the Gaussian PDF multiplied by the probability of clustering_4771c9f7696655e96cd8fedbc37b2c244bcfcebb.png.

  2. (M-step) Compute for each cluster clustering_92483d0bb3bbad1c0b4630e7e529b0b5f3f8c48b.png:
    • clustering_2cf12de33c21e3ecc9bbf1a5bbc36191a8e74569.png, i.e. take average of weights computed above
    • clustering_23e639e9e43602fe19edd323dbc4755d3d3427e4.png, i.e. weighted average for the ith observation
    • clustering_6b3f5e3238bd6a89cb62584a2d18ed23e3184f46.png, i.e. the variance using the previously computed weights

Parallels to Gaussian Discriminant Analysis

In GDA we have: clustering_d9840c793eceb37852e18be406a52b87d60daf6f.png

But in our case (GMM) we do not know which cluster the observation belongs to, so instead we use the "soft-weights" clustering_b04e5a3c3162ca7a2a60b36ed6412b8c7c2687cb.png and so get clustering_2cf12de33c21e3ecc9bbf1a5bbc36191a8e74569.png.

Mean-shift

  • Non-parametric feature-space analysis technique for locating a maxima of a density function, a so-called mode-seeking algorithm

Mean shift is an iterative method for locating the maxima - the modes - of density function given discrete data sampled from that function.

  1. Choose a kernel function clustering_dd4ff6d5277418855fa1f0e59ddfbf4703183be5.png, which determines the weights of the nearby points for re-estimation of the mean.
  2. The weighted mean of the density in the window determined by clustering_d80b633218f04f7a25f6a9dd7e4617a38e88f263.png is

    clustering_f50e73705f0de99259c9dae2f884f853ae7c11ae.png

    where clustering_e62128b592611d9353f1fa54119898d3e49a0c09.png is the neighborhood of clustering_3c314f80373742988ad542a6d4ce66a111b17847.png, a set of points for which clustering_501476154d8c940a9485a0f0a06ce6fde95a7a1e.png.

  3. Set clustering_191544c730758e46c6f7c5fdc7b0317ae4dfc9cf.png
  4. Repeate 2-3 until clustering_7a3393ece9711c5392b10ed0d0eec2e66d85df4d.png converges

The procedure gets it's name from the difference clustering_a872d59f4acb54f636064696ff5748527ffc3829.png, which is called the mean shift.

A rigid proof for convergence of the algorithm using a general kernel in a high dimensional space is still not known.

Convergence has been shown for one-dimension with differentiable, convex, and strictly increasing profile function.

Convergence in higher dimensions with a finite number of the stationary points (or isolated) has been proven, however, sufficient conditions for a general kernel to have a finite (or isolated) stationary points have not been provided.