Analysis of large time-series data in OpenTSDB
Master thesis
View/ Open
Date
2013Metadata
Show full item recordCollections
- Studentoppgaver (TN-IDE) [928]
Abstract
In recent years, the quantity of time series data generated in a wide variety of domains
have grown consistently. Analyzing time-series datasets at a massive scale is one of the
biggest challenges that data scientists are facing.
This thesis focuses on implementation of a tool for analyzing large time-series data.
It describes a way to analyze the data stored by OpenTSDB. OpenTSDB is an open
source distributed and scalable time series database. It has become a challenge for
statisticians and data scientists to analyze such massive data sets with the same level
of comprehensive details as is possible for smaller analyses.
Currently tools available for time-series analysis are time and memory consuming.
Moreover, no single tool exists that specializes on providing an efficient implementations
of analyzing time-series data through MapReduce programming model at massive
scale. For these reason, we have designed an efficient and distributed computing
framework - R2Time. R2Time integrates R open source project for statistical computing
and visualization with the OpenTSDB [1] and RHIPE [2] based on the MapReduce
framework for the distributed processing of large data sets across a cluster. It creates
the programming environment by integrating R and HBase for the data scientists.
This thesis describes the architecture of R2Time framework. The usefulness of this
framework is verified by the performance analysis based on carefully choosen types
of statistical analysis for time-series data. With the increase in the time-series data
size and complexity of statistical functions, we have noticed supralinear nature in the
performance of R2Time framework. The performance of this framework is verified
by the performance analysis based on different configurations setting. Configuration
settings as scan cache and batch size plays vital role with the performances of timeseries
data.
Description
Master's thesis in Computer Science