Saturday, December 29, 2012

K-Means Clustering - 2 : Working with Scipy


In the previous article, 'K-Means Clustering - 1 : Basic Understanding', we understood what is K-Means clustering, how it works etc. In this article, we will use k-means functionality in Scipy for data clustering. OpenCV will be covered in another article.

Scipy's cluster module provides routines for clustering. The vq module in it provides k-means functionality. You will need Scipy version 0.11 to get this feature.

We also use Matplotlib to visualize the data.

Note : All the data arrays used in this article are stored in github repo for you to check. It would be nice to check it for a better understanding. It is optional. Or you can create your own data and check it.

So we start by importing all the necessary libraries.

>>> import numpy as np
>>> from scipy.cluster import vq
>>> from matplotlib import pyplot as plt

Here I would like to show three examples.

1 - Data with only one feature :

Consider, you have a set of data with only one feature, ie one-dimensional. For eg, we can take our t-shirt problem where you use only height of people to decide the size of t-shirt.

Or, from an image processing point of view, you have a grayscale image with pixel values ranges from 0 to 255. You need to group it into just two colors, may be black and white only. ( That is another version of thresholding. I don't think someone will use k-means for thresholding. So just take this as a demo of k-means.)

So we start by creating data.

>>> x = np.random.randint(25,100,25)
>>> y = np.random.randint(175,255,25)
>>> z = np.hstack((x,y))
>>> z = z.reshape((50,1))

So we have 'z' which is an array of size 50, and values ranging from 0 to 255. I have reshaped 'z' to a column vector. It is not necessary here, but it is a good practice. Reason, I will explain in coming sections. Now we can plot this using Matplotlib's histogram plot.

>>> plt.hist(z,256,[0,256]),

We get following image :

Test Data

Now we use our k-means functions.

First function, vq.kmeans(), is used to cluster the data as per our requirements and it returns the centroids of the clusters. (Docs)

It takes our test data and number of clusters we need as inputs. Other two inputs are optional and is not of big concern now.

>>> centers,dist = vq.kmeans(z,2)
>>> centers
       [ 60]])

First output is 'centers', which are the centroids of clustered data. For our data, it is 60 and 207. Second output is the distortion between centroids and test data. We mark the centroids along with the inputs.

>>> plt.hist(z,256,[0,256]),plt.hist(centers,32,[0,256]),

Below is the output we got. Those green bars are the centroids.

Green bars shows centroids after clustering

Now we have found the centroids. From first article, you might have seen our next job is to label the data '0' and '1' according to distance to the centroids. We use vq.vq() function for this purpose.

vq.vq() takes our test data and centroids as inputs and provides us the labelled data,called 'code' and distance between each data and corresponding centroids.

>>> code, distance = vq.vq(z,centers)

If you compare the arrays 'code' and 'z' in git repo, you can see all values near to first centroid will be labelled '0' and next as '1'.

Also check the distance array. 'z[0]' is 47, which is near to 60, so labelled as '1' in 'code'. And distance between them is 13, which is 'distance[0]'. Similarly you can check other data also.

Now we have the labels of all data, we can separate the data according to labels.

>>> a = z[code==0]
>>> b = z[code==1]

'a' corresponds to data with centroid = 207 and 'b' corresponds to remaining data. (Check git repo to see a&b).

Now we plot 'a' in red color, 'b' in blue color and 'centers' in yellow color as below:

>>> plt.hist(a,256,[0,256],color = 'r') # draw 'a' in red color
>>> plt.hist(b,256,[0,256],color = 'b') # draw 'b' in blue color
>>> plt.hist(centers,32,[0,256],color = 'y') # draw 'centers' in yellow color

We get the output as follows, which is our clustered data :

Output of K-Means clustering

So, we have done a very simple and basic example on k-means clustering. Next one, we will try with more than one features.

2 - Data with more than one feature :

In previous example, we took only height for t-shirt problem. Here, we will take both height and weight, ie two features.

Remember, in previous case, we made our data to a single column vector. This is because, it is a good convention, and normally followed by people from all fields. ie each feature is arranged in a column, while each row corresponds to an input sample.

For example, in this case, we set a test data of size 50x2, which are heights and weights of 50 people. First column corresponds to height of all the 50 people and second column corresponds to their weights. First row contains two elements where first one is the height of first person and second one his weight. Similarly remaining rows corresponds to heights and weights of other people. Check image below:

So now we can prepare the data.

>>> x = np.random.randint(25,50,(25,2))
>>> y = np.random.randint(60,85,(25,2))
>>> z = np.vstack((x,y))

Now we got a 50x2 array. We plot it with 'Height' in X-axis and 'Weight' in Y-axis.

>>> plt.scatter(z[:,0],z[:,1]),plt.xlabel('Height'),plt.ylabel('Weight')

(Some data may seem ridiculous. Never mind it, it is just a demo)

Test Data

Now we apply k-means algorithm and label the data.

>>> center,dist = vq.kmeans(z,2)
>>> code,distance = vq.vq(z,center)

This time, 'center' is a 2x2 array, first column corresponds to centroids of height, and second column corresponds to centroids of weight.(Check git repo data)

As usual, we extract data with label '0', mark it with blue, then data with label '1', mark it with red, mark centroids in yellow and check how it looks like.

>>> a = z[code==0]
>>> b = z[code==1]
>>> plt.scatter(a[:,0],a[:,1]),plt.xlabel('Height'),plt.ylabel('Weight')
>>> plt.scatter(b[:,0],b[:,1],c = 'r')
>>> plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's')

This is the output we got :

Result of K-Means clustering

So this is how we apply k-means clustering with more than one feature.

Now we go for a simple application of k-means clustering, ie color quantization.

3 - Color Quantization :

Color Quantization is the process of reducing number of colors in an image. One reason to do so is to reduce the memory. Sometimes, some devices may have limitation such that it can produce only limited number of colors. In those cases also, color quantization is performed.

There are lot of algorithms for color quantization. Wikipedia page for color quantization gives a lot of details and references to it. Here we use k-means clustering for color quantization.

There is nothing new to be explained here. There are 3 features, say, R,G,B. So we need to reshape the image to an array of Mx3 size (M is just a number). And after the clustering, we apply centroid values (it is also R,G,B) to all pixels, such that resulting image will have specified number of colors. And again we need to reshape it back to the shape of original image. Below is the code:

import cv2
import numpy as np
from scipy.cluster import vq

img = cv2.imread('home.jpg')
z = img.reshape((-1,3))

k = 2           # Number of clusters
center,dist = vq.kmeans(z,k)
code,distance = vq.vq(z,center)
res = center[code]
res2 = res.reshape((img.shape))

Change the value of 'k' to get different number of colors. Below is the original image and results I got for values k=2,4,8 :

Color Quantization with K-Means clustering

So, that's it !!!

In this article, we have seen how to use k-means algorithm with the help of Scipy functions. We also did 3 examples with sufficient number of images and plots. There are two more functions related to it, but I will deal it later.

In next article, we will deal with OpenCV k-means implementation.

I hope you enjoyed it...

And if you found this article useful, don't forget to share it on Google+ or facebook etc.


Abid Rahman K.

Ref :

1 - Scipy cluster module documentation

2 - Color Quantization

K-Means Clustering - 1 : Basic Understanding


In this section, we will be dealing with K-Means clustering, a machine learning algorithm, which is used for grouping a large set of data to desired number of groups depending on their similarity. It will be a series of articles, may be two or three articles.

In this article, we mainly focus on the understanding what is this algorithm, how it works, what are its applications etc. I think an intuitive understanding is more better than explaining the mathematical background behind the algorithm.

In the coming articles, I will explain how to use in-built functions of Scipy and OpenCV for K-Means clustering.

(Note : All images are sufficiently larger, but resized to fit in page. Just click on the image to see it in its original size)

By the way, What is this algorithm for ?

According to Wikipedia :

k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

( Reading above passage, I hope you might have understood at-least this : It groups large data to some groups based on some features.)

If you didn't understand above statements, I will tell a couple of examples :

1 - Google News :

I am sure you will have visited Google News at-least once. For every news, Google provides a main article and just below it, there are links to corresponding news from all other newspapers. Check below image :

This is one of my favorite features in Google News, because I try to read news on same topic from different papers. (It helps me to understand how media play with our minds. Different newspapers try to present same news in different ways to audience. And it is seen frequently with respect to political news and people reading them are, most often, got biased by these news, while real truth is hidden from them. Presently the power of media is such that, if you control the media, you control the world. Anyway, it is out of our topic. )

I have always wondered how does Google achieve this. I am not claiming Google uses k-means clustering, but this is what K-Means clustering also does, You have a large set of news and group them according to its content.

2 - T-shirt size problem

This example is commonly used to explain clustering problems.

Consider a company, which is going to release a new model of T-shirt to market. Obviously they will have to manufacture models in different sizes to satisfy people of all sizes. So the company make a data of people's height and weight, and plot them on to a graph, as below:

People Height v/s Weight plot

Company can't create t-shirts with all the sizes. Instead, they divide people to Small, Medium and Large, and manufacture only these 3 models which will fit into all the people. This grouping of people into three groups can be done by k-means clustering, and algorithm provides us best 3 sizes, which will satisfy all the people. And if it doesn't, company can divide people to more groups, may be five, and so on. Check image below :

People grouped to different clusters

So, now I hope, you might have understood use of this algorithm. Now we can go to next step.

How does the algorithm work ?

This algorithm is an iterative process. We will explain it step-by-step with the help of images.

Consider a set of data as below ( Don't be bothered about what this data is, or you can consider it as t-shirt problem). We need to cluster this data into two groups.

Test Data

Step :1 - Algorithm randomly chooses two centroids, c1 and c2 (sometimes, any two data are taken as the centroids).

Step :2 - Then it calculates the distance from each point to both centroids. If a test data is more closer to c1, then that data is labelled with '0'. If it is closer to c2, then labelled as '1' (If more centroids are there, labelled as '2','3' etc).

In our case, we will color all '0' labelled with red, and '1' labelled with blue. So we get following image after above operations.

Initial centroid selection and data labelling

Step :3 - Next we calculate the average of all blue points and red points separately and that will be our new centroids. That is c1 and c2 shift to newly calculated centroids. (Remember, the images shown are not true values and not to true scale, it is just for demonstration only. So calculated averages may not be what you expect).

And again, perform step 2 with new centroids and label data to '0' and '1'.

So we get result as below :

New centroid calculated and data re-labelled

Now step - 2 and step - 3 are iterated until both centroids are converged to fixed points. These points are such that sum of distances between test data and their corresponding centroids are minimum. Or simply, sum of distances between red data & c1 and blue data & c2 is minimum.

Final result almost looks like below :

Final Result

So, I hope by this time, you would have obtained a basic understanding how K-Means clustering works. With this intuitive knowledge, read some good text books to understand the mathematics behind this algorithm. I am sure you won't find any difficulty to understand it.

It is just a top layer of K-Means clustering. There are a lot of modifications to this algorithm like, how to choose the initial centroids, how to speed up the iteration process etc. Me too, haven't gone much deeper and I think this would be sufficient to start working with K-Means clustering. Those who need more knowledge can read any good text books.

So that was a complete theoretical, easy-to-read article.

In next article, we will see how to use Scipy's kmeans implementation for data clustering, and we will deal OpenCV implementation in later article.

So that's all for now.

If you find this article useful, don't forget to share this article on Google+, facebook etc. It is just a click away !!!


Abid Rahman K.

Ref :

1 - Stanford Machine Learning classes by Prof. Andrew Ng