Project Report : CS 292: KMeans Algorithm with Canopy Clustering

KMeans with Canopy Clustering for Integer and Text data using MapReduce.

The purpose of the project was to implement KMeans Algorithm with Canopy Clustering to cluster Random Integers and Text Data Set.

Introduction to the Algorithms

KMeans is one of the simplest clustering algorithms. The basic KMeans algorithm is described in the below

S= Set of Random Points
Choose k random points from S as the initial K-Means Centers, call it set K
while (true)
for all points p in S
	a. for all points k in K
		i. calculate distance between p and k.
	b. Assign p to the k-center which it is closest to.
Re calculate the K-Mean Centers, the new K-Mean centers are an average of all the points assigned to a particular K-Mean Center.
If the calculated error is < allowed error break;

Although the KMeans algorithm is relatively straight-forward and converges in practice quickly it is computationally expensive. The total run time of the algorithm is O(kni) where

k= number of k-centers
n = number of total points
i = iterations taken to converge

Typically for most datasets the ideal value of k itself is not known and we may have to repeat the above algorithm for different values of k.

Canopy Clustering is a hierarchal clustering algorithm. The basic steps of the canopy clustering are described below.

Given two threshold distance T1 and T2 ; T1>T2 and a set of points.
Determine Canopy Centers: Iterate through the set of points, if the point is at distance Determine the canopy membership - for each point in the input set if the point is at a distance < T1 from any of points in the list of canopy centers (generated in step1) then point is member of the corresponding cluster.

Canopy Clustering

KMeans algorithm when used with canopy clustering can reduce the computations in the distance calculation step. The modified kmeans algorithm is below

S= Set of Random Points
Determine the canopy centers C, and the canopy membership of all points using canopy clustering described above
Choose k random points from S as the initial K-Means Centers, call it set K
while (true)
for all points p in S
	a. for all points k in K
            if p and k share a canopy center
		then calculate distance between p and k.
                else distance is infinite.
	b. Assign p to the k-center which it is closest to.
Re calculate the K-Mean Centers, the new K-Mean centers are an average of all the points assigned to a particular K-Mean Center.
If the calculated error is < allowed error break;

Mapreduce Implementations of KMeans algorithm with Canopy Clustering.

The Map Reduce implementation of KMeans Algorithm with Canopy Clustering has the following steps.

Figure 2: Map Reduce Steps

Figure 2: Map Reduce Steps

1. Data Preparation: The input data needs to be converted into a format suitable for distance and similarity measures. This is trivial for integer data but requires some preprocessing of text data. The details of the pre-processing are discussed later.

2. Picking Canopy Centers: The input points are split into data for the individual mappers. The mappers would then apply the sequential canopy generation step to the input data they receive. The reducer would receive the canopy centers and repeat the canopy generation algorithm on the input.

Figure 3: Canopy Generation

Figure 3: Canopy Generation

3. Assign Points to canopy centers: The canopy assignment step would simply assign points to generated canopy centers.

Figure 4: Canopy Assignment

Figure 4: Canopy Assignment

4. Pick K-Mean Cluster Centers & Iterate until convergence: In this mapreduce step, we simply find the kmean center for each point in the mapper stage. The computation to calculate the closest kmean center is greatly reduced as we only calculate the distance between a k-center and point if they share a canopy. The reducer would then emit the average of x-coordinate and y-coordinate of the points belonging to a particular k-center. The average emitted is the new k-center. We repeat the iteration until the centers don’t move much between iterations.

Figure 5: KMeans Iteration

Figure 5: KMeans Iteration

5. Assign Points to the K-Mean Centers: The final step of the algorithm assigns points to the final k-mean centers. The points are now in clustered sets.

Data Preparation
For Integer Data I used randomly generated 2D points. The points were generated in the [-200,-200] to [200,200] space. Some the cluster plots is below.

Figure 6: KMeans N=1000, K=10

Figure 6: KMeans N=1000, K=10

Figure 7: KMeans N=10000, K=5

Figure 7: KMeans N=10000, K=5

The average time for KMeans with Canopy Clustering vs KMeans alone is plotted in Figure 8

Figure 10: Tanitmoto Distance

Figure 10: Tanitmoto Distance

Figure 8: Graph KMean vs KMean + Canopy

Figure 8: Graph KMean vs KMean + Canopy

For text data I used Reuters 21578 dataset. The dataset is available here. In order to determine the similarity and difference between text documents the text needs to vectorized. I vectorized the documents as pair of>. The weight measure I choose for weight of a word in a document is called Term Frequency – Inverse Document Frequency.

The formula for calculation of TF-IDF = (WFi,j) * log (N/DFi)

where WFi,j = is the Term Frequency of Word i in Document j. It is calculated as the ratio of # of times Word i occurs in Document j by the number of words in Document j
N= is the total number of documents in the sample set.
DFi = is the number of documents that word i occurs in.

Converting text into TF-IDF format was done using the dooplogs library from here. The mapreduce step details are outlined in the blog here.

The similarity measure used for text document comparison is Tanimoto Coefficient. The distance between two documents would be 1-Tanimoto Coefficient

I created an output html page of the reuter news articles here


Code is available for Download from here


Sorting data with mapreduce

Sorting is simple in mapreduce but a the same time not very intuitive. Sorting large files with mapreduce is a step that is essentially step in many graph algorithms including the famous PageRank.

The mapper phase of the algorithm takes a key, value pair say (k1,v1) and we want to sort the data based on value. In this case mapper would simply output (v1,k1) as its output. The keys received by the reducer are in sorted order, so the reducer just needs to output its input, i.e. it is an identity reducer.

So to summarize
Mapper Phase : Input (k1,v1) ->Output (v1,k1)
Reducer Phase : Input(v1,k1)->Output(v1,k1)

To generate a large amount of data I generate a set of random numbers using the below python code.

#!/usr/bin/env python

import sys
import numpy

count = 1
for item in a:
    print "{0}\t{1}".format(count,item)
    count = count + 1;

This generates the input for the map-phase of the map-reduce algorithm
The code for the mapper is

#!/usr/bin/env python

import sys
import re

for line in sys.stdin:
    line = line.strip()
    arg = line.split('\t')
    print "{0}\t{1}".format(arg[1], arg[0])

The code for the reducer is

#!/usr/bin/env python
import sys
import re
#identity reducer...

for line in sys.stdin:
	line = line.strip()
        arg = line.split('\t')
        print "{0}\t{1}".format(arg[0],arg[1])

You can download the code and script to run the code on cloudera vm from here

Getting Started with MPI

MPI stands for Message Passing Interface. It is essentially a library of routines. There are bindings / implementations in Python, Fortran, C , C++. I am using Open MPI for my examples. Blog for the installation of OpenMPI is here

MPI allows processes to communicate to exchange messages. Typically the processes are members of communications. In C++ the communicator shared by call is MPI::COMM_WORLD.

Each process has unique id within the communicator. Typically there is master process and a number of slave processes.

Example 1: Sending a Random Number from Master Process to the slave process
Code Snippet 1.1

 int main(int argc, char **argv)

	// Initialize MPI and determine our process number
	MPI::Init(argc, argv);
	const int iMyProc = MPI::COMM_WORLD.Get_rank();
	if (iMyProc == 0){


	return 0;

MPI::COMM_WORLD.Get_rank() gives the rank of the current process. The rank of the master process generally is set as ‘0’ and the workers have process rank >0.

1.2 Code Snippet for the master process

 void master() {
   //printf("Hello from Master\n");
   const int nProcs = MPI::COMM_WORLD.Get_size();
   /* initialize random seed: */
   srand ( time(NULL) );

  /* generate secret number: */

   // send a random number to each worker
    for(int i=1; i<nProcs; i++){
        unsigned r = rand();
	printf("Hello from master to worker number %d : sending random r %d\n", i, r);	
        MPI::COMM_WORLD.Send(&r, sizeof(unsigned), MPI::BYTE, i, msgID);

The master sends a random number to its slave processes. Every MPI:COMM_WORLD.Send() on the master side is matched with MPI:COMM_WORLD.Recv() on the slave end. The message Id (msgId) uniquely identify a message being transmitted.

1.3 Code Snippet for worker process

void worker(int iMyProc) {
   unsigned r;
   MPI::COMM_WORLD.Recv(&r, sizeof(unsigned), MPI::BYTE, iMasterProc, msgID);
   printf("Hello from worker %d : received random r %d\n", iMyProc, r);

The worker code is reciprocal of the master code. MPI::COMM_WORLD.Recv on the worker side receives the message from the master.

Hope this helps folks Getting Started with MPI.

Come – on Guys — Obfuscate your source code

Most people put a lot of effort on the Android Apps. There is a lot of business logic and intellectual property hidden within your application. Yet most application developers spend no time in making sure that their application code is properly obfuscated.

The process of Android code de-compilation starts at obtaining the .apk file. The .apk file is hosted either on the Android Market or hosted on individual webserver.

Obtaining .apk file from Android Market.

1. Root your device using z4root or any other software.
2. Download the application from the android market.
3. Attach your device to the your computer via usb and run the below instruction on the command line

adb pull 

Obtaining from hosted web server
1. Download directly from the URL. This would work if the web-server is not doing any re-directs based on the browser user agent.
2. If the web-server redirects based on the browser user agent
a) You can choose to modify your browser user agent (put a link)
b) Or simply choose to download the file on your android device and transfer the file over.

Decompiling the .apk file
1. Download dex2jar from here
2. Convert your .apk file to .jar file. <apk-filename> 

This will create a apk-filename.dex2jar.jar file.
3. Download jd-gui from here
4. Open the apk-filename.dex2jar.jar file in jd-gui.

To check how to obfuscate android source code (i.e prevent others from seeing your source code using proguard read this blog unless you are hacker or don’t really care about your source code(open source it please in that case))

Syntax highlighting CUDA files (.cu) in xemacs

*** Note: I am not the original author of the cuda.el file. I am compiling the notes from my searches on the topic ***
*** Note 2: I was not able to do syntax highlighting on emacs 23. This only works on xemacs ***

1. To get xemacs

sudo apt-get install xemacs21

2. Add the following to lines to your ~/.xemacs/init.el

(autoload 'cuda-mode &quot;~/.emacs.d/cuda-mode.el&quot; &quot;Cuda mode.&quot; t)
(setq auto-mode-alist (append '((&quot;\\.cu$&quot; . cuda-mode)) auto-mode-alist)

3. create a file ~/xemacs/cuda-mode.el or download from here
4. open your .cu file using xemacs.

Edit: to use on emacs remove the defconst lines in cuda-mode.el file.

Setting PATH on Ubuntu

To set PATH for all users.

1. Open the terminal application and type the following
sudo gedit /etc/environment
2. Modify the file in gedit by adding path directories.
3. Save the file and quit.
4. On the terminal application type
source gedit /etc/environment to load the new path from the environment file.
5. Echo out the new Path echo $PATH