Using machine learning for exploratory data analysis and predictive models on large datasets

Xiao, Chengwei

dc.contributor.author	Xiao, Chengwei
dc.date.accessioned	2015-09-11T11:16:47Z
dc.date.available	2015-09-11T11:16:47Z
dc.date.issued	2015-06-15
dc.identifier.uri	http://hdl.handle.net/11250/299600
dc.description	Master's thesis in Computer science	nb_NO
dc.description.abstract	With the advent of the era of big data, machine learning has been widely used in many technologies and industries, which is able to get computers to learn without being explicitly programmed. As one of the fields of the supervised learning, some classical types of regression models, including the linear regression, nonlinear regression and regression trees, are discussed at first. And some representative algorithms in each category and their advantages and disadvantages are also illustrated as well. After that, the data pre-processing and resampling techniques, including data transformation, dimensionality reduction and k-fold cross-validation, are explained which can be used to improve the performance of the training model. During the implementation of machine learning algorithms, three typical models (Ordinary Linear Regression, Artificial Neural Networks and Random Forest) have been implemented by the different packages in R on the given large datasets. Apart from the model training, the regression diagnostics are conducted to explain the poorly predictive ability of the simplest ordinary linear regression model. Due to the non-deterministic feature of the artificial neural network and random forest models, several small models are built on small number of samples in the dataset to get the reasonable tuning parameters, and the optimal models are chosen by the value of RMSE and R2 among several training models. The corresponding performance of the built models are quantitatively and visually evaluated in details. The quantitative and visual results of our practical implementation show the feasibility for the large datasets under the artificial neural network and random forest algorithms. Comparing with the ordinary linear regression model (RMSE = 65556.95, R2 = 0.7327), the performance of the artificial neural network (RMSE = 36945.95, R2 = 0.9151) and random forest (RMSE = 30705.78, R2 = 0.9417) models are greatly improved, but the model training process is more complex and more time-consuming. The right choice between different models relies on the characteristics of the dataset and the goal, and also depends upon the cross-validation technique and the quantitative evaluation of the models.	nb_NO
dc.language.iso	eng	nb_NO
dc.publisher	University of Stavanger, Norway	nb_NO
dc.relation.ispartofseries	Masteroppgave/UIS-TN-IDE/2015;
dc.subject	informasjonsteknologi	nb_NO
dc.subject	Random Forest	nb_NO
dc.subject	datateknikk	nb_NO
dc.subject	machine learning	nb_NO
dc.subject	exploratory data analysis	nb_NO
dc.subject	regression model	nb_NO
dc.subject	ordinary linear regression	nb_NO
dc.subject	artificial neural networks	nb_NO
dc.title	Using machine learning for exploratory data analysis and predictive models on large datasets	nb_NO
dc.type	Master thesis	nb_NO
dc.subject.nsi	VDP::Technology: 500::Information and communication technology: 550::Computer technology: 551	nb_NO

Files in this item

Name:: Xiao_Chengwei.pdf
Size:: 4.569Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Studentoppgaver (TN-IDE) [850]
Studentoppgaver i informasjonsteknologi, datateknikk / kybernetikk, signalbehandling

Show simple item record