Investigating how partial aggregation impacts K-means

Chiliboyi, Erick

dc.contributor.author	Chiliboyi, Erick
dc.date.accessioned	2012-09-25T13:50:55Z
dc.date.available	2012-09-25T13:50:55Z
dc.date.issued	2012
dc.identifier.uri	http://hdl.handle.net/11250/181801
dc.description	Master's thesis in Computer Science	no_NO
dc.description.abstract	K-means is the most commonly known partitioning algorithm used for data clustering. It was originally designed to run on a single processor. Therefore, this created a limitation on dealing with large amounts of data because of the requirement to have data resident in memory. However, the advent of distributed systems has led to the design of parallel versions of the algorithm. This is done with the intention to allow the algorithm to work with large sets of data that would otherwise be impossible to handle on a single machine, due to limited processing power and memory. Since there are currently many applications that are generating large sets of data, for example in oil exploration, social media and image processing, research in parallel clustering algorithms has received a great deal of attention so as to find efficient ways of data analysis. Clustering algorithms that can be formulated according to the MapReduce framework such as K-means provide a great opportunity for data analysis. This is because the actual data is used to get statistics. Otherwise, statistics would only be obtained by applying sampling methods on the original data. Running a parallel algorithm on a distributed platform to handle large data sets, requires alot of data transfer on the network. This exchange of data could choke the network if it is beyond the capacity of the network. Therefore, there is a way of reducing the amount of traffic on the network and that is by partial aggregation. Reducing traffic on the network has been known to increase efficiency. In this thesis, we look at two things. First, how partial aggregation impacts the K-means algorithm running on a distributed system. Second, we compare two implementations of the K-means algorithm, Java and R.	no_NO
dc.language.iso	eng	no_NO
dc.publisher	University of Stavanger, Norway	no_NO
dc.relation.ispartofseries	Masteroppgave/UIS-TN-IDE/2012;
dc.subject	data mining	no_NO
dc.subject	parallel algorithms	no_NO
dc.subject	cluster analysis	no_NO
dc.subject	informasjonsteknologi	no_NO
dc.subject	datateknikk	no_NO
dc.title	Investigating how partial aggregation impacts K-means	no_NO
dc.type	Master thesis	no_NO
dc.subject.nsi	VDP::Technology: 500::Information and communication technology: 550	no_NO

Tilhørende fil(er)

Filnavn:: Chiliboyi,Erick.pdf
Størrelse:: 509.5Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Studentoppgaver (TN-IDE) [866]
Studentoppgaver i informasjonsteknologi, datateknikk / kybernetikk, signalbehandling

Vis enkel innførsel