KMeans with Canopy Clustering for Integer and Text data using MapReduce.
Summary
The purpose of the project was to implement KMeans Algorithm with Canopy Clustering to cluster Random Integers and Text Data Set.
Introduction to the Algorithms
KMeans is one of the simplest clustering algorithms. The basic KMeans algorithm is described in the below
S= Set of Random Points Choose k random points from S as the initial K-Means Centers, call it set K while (true) for all points p in S a. for all points k in K i. calculate distance between p and k. b. Assign p to the k-center which it is closest to. Re calculate the K-Mean Centers, the new K-Mean centers are an average of all the points assigned to a particular K-Mean Center. If the calculated error is < allowed error break;
Although the KMeans algorithm is relatively straight-forward and converges in practice quickly it is computationally expensive. The total run time of the algorithm is O(kni) where
k= number of k-centers
n = number of total points
i = iterations taken to converge
Typically for most datasets the ideal value of k itself is not known and we may have to repeat the above algorithm for different values of k.
Canopy Clustering is a hierarchal clustering algorithm. The basic steps of the canopy clustering are described below.
Given two threshold distance T1 and T2 ; T1>T2 and a set of points. Determine Canopy Centers: Iterate through the set of points, if the point is at distance Determine the canopy membership - for each point in the input set if the point is at a distance < T1 from any of points in the list of canopy centers (generated in step1) then point is member of the corresponding cluster.
KMeans algorithm when used with canopy clustering can reduce the computations in the distance calculation step. The modified kmeans algorithm is below
S= Set of Random Points Determine the canopy centers C, and the canopy membership of all points using canopy clustering described above Choose k random points from S as the initial K-Means Centers, call it set K while (true) for all points p in S a. for all points k in K if p and k share a canopy center then calculate distance between p and k. else distance is infinite. b. Assign p to the k-center which it is closest to. Re calculate the K-Mean Centers, the new K-Mean centers are an average of all the points assigned to a particular K-Mean Center. If the calculated error is < allowed error break;
Mapreduce Implementations of KMeans algorithm with Canopy Clustering.
The Map Reduce implementation of KMeans Algorithm with Canopy Clustering has the following steps.
1. Data Preparation: The input data needs to be converted into a format suitable for distance and similarity measures. This is trivial for integer data but requires some preprocessing of text data. The details of the pre-processing are discussed later.
2. Picking Canopy Centers: The input points are split into data for the individual mappers. The mappers would then apply the sequential canopy generation step to the input data they receive. The reducer would receive the canopy centers and repeat the canopy generation algorithm on the input.
3. Assign Points to canopy centers: The canopy assignment step would simply assign points to generated canopy centers.
4. Pick K-Mean Cluster Centers & Iterate until convergence: In this mapreduce step, we simply find the kmean center for each point in the mapper stage. The computation to calculate the closest kmean center is greatly reduced as we only calculate the distance between a k-center and point if they share a canopy. The reducer would then emit the average of x-coordinate and y-coordinate of the points belonging to a particular k-center. The average emitted is the new k-center. We repeat the iteration until the centers don’t move much between iterations.
5. Assign Points to the K-Mean Centers: The final step of the algorithm assigns points to the final k-mean centers. The points are now in clustered sets.
Data Preparation
Integer
For Integer Data I used randomly generated 2D points. The points were generated in the [-200,-200] to [200,200] space. Some the cluster plots is below.
The average time for KMeans with Canopy Clustering vs KMeans alone is plotted in Figure 8
Text
For text data I used Reuters 21578 dataset. The dataset is available here. In order to determine the similarity and difference between text documents the text needs to vectorized. I vectorized the documents as pair of>. The weight measure I choose for weight of a word in a document is called Term Frequency – Inverse Document Frequency.
The formula for calculation of TF-IDF = (WFi,j) * log (N/DFi)
where WFi,j = is the Term Frequency of Word i in Document j. It is calculated as the ratio of # of times Word i occurs in Document j by the number of words in Document j
N= is the total number of documents in the sample set.
DFi = is the number of documents that word i occurs in.
Converting text into TF-IDF format was done using the dooplogs library from here. The mapreduce step details are outlined in the blog here.
The similarity measure used for text document comparison is Tanimoto Coefficient. The distance between two documents would be 1-Tanimoto Coefficient
Results
I created an output html page of the reuter news articles here
Reference
1. http://horicky.blogspot.com/
2. http://code.google.com/p/hadoop-clusternet/wiki/RunningMapReduceExampleTFIDF
3. http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
Code is available for Download from here