dc.description.abstract | K-means is the most commonly known partitioning algorithm used for data
clustering. It was originally designed to run on a single processor. Therefore, this created a limitation on dealing with large amounts of data because
of the requirement to have data resident in memory. However, the advent of
distributed systems has led to the design of parallel versions of the algorithm.
This is done with the intention to allow the algorithm to work with large sets
of data that would otherwise be impossible to handle on a single machine,
due to limited processing power and memory.
Since there are currently many applications that are generating large sets
of data, for example in oil exploration, social media and image processing,
research in parallel clustering algorithms has received a great deal of attention so as to find efficient ways of data analysis. Clustering algorithms that
can be formulated according to the MapReduce framework such as K-means
provide a great opportunity for data analysis. This is because the actual data
is used to get statistics. Otherwise, statistics would only be obtained by applying sampling methods on the original data.
Running a parallel algorithm on a distributed platform to handle large data
sets, requires alot of data transfer on the network. This exchange of data
could choke the network if it is beyond the capacity of the network. Therefore, there is a way of reducing the amount of traffic on the network and that
is by partial aggregation. Reducing traffic on the network has been known to
increase efficiency.
In this thesis, we look at two things. First, how partial aggregation impacts
the K-means algorithm running on a distributed system. Second, we compare
two implementations of the K-means algorithm, Java and R. | no_NO |