Performance analysis and optimization of left outer join on map side
Master thesis

View/ Open
Date
2012Metadata
Show full item recordCollections
- Studentoppgaver (TN-IDE) [821]
Abstract
Ontologies are representations of the entities and relationships that structure an application area. Ontologies are important for tasks such as data integration, natural-language processing, information retrieval, and decision support. NCBO Resource Index is a system for ontology based annotation and indexing of biomedical data. With the increasing of its data, a distributed processing method should be implemented, which can store, compute and inquire those large-scale data in an efficient way. This paper is based on the master thesis of B. Byambajav, Methods for Large-scale Semantic Expansion on Hadoop Architecture, and going forward to seek a better solution for process NCBO Resource Index data and forced on performance optimization of left outer join on the Map side. In this paper, we researched and contrasted different kinds of join algorithms. In order to implement more effective experiments, we studied the characteristics of HDFS and DistributedCache, then an algorithm of left outer join on map side had been implemented on the Hadoop platform, and for the purpose of performance optimization, we inspected several methods to control amount of map task. Further, according to the result of the experiment, we adjusted critical parameters and we got a lot of valuable conclusions. Based on these conclusions, we found the map side join works well and got a better result in previous works.
Description
Master's thesis in Computer science