Vis enkel innførsel

dc.contributor.authorXiao, Chengwei
dc.date.accessioned2015-09-11T11:16:47Z
dc.date.available2015-09-11T11:16:47Z
dc.date.issued2015-06-15
dc.identifier.urihttp://hdl.handle.net/11250/299600
dc.descriptionMaster's thesis in Computer sciencenb_NO
dc.description.abstractWith the advent of the era of big data, machine learning has been widely used in many technologies and industries, which is able to get computers to learn without being explicitly programmed. As one of the fields of the supervised learning, some classical types of regression models, including the linear regression, nonlinear regression and regression trees, are discussed at first. And some representative algorithms in each category and their advantages and disadvantages are also illustrated as well. After that, the data pre-processing and resampling techniques, including data transformation, dimensionality reduction and k-fold cross-validation, are explained which can be used to improve the performance of the training model. During the implementation of machine learning algorithms, three typical models (Ordinary Linear Regression, Artificial Neural Networks and Random Forest) have been implemented by the different packages in R on the given large datasets. Apart from the model training, the regression diagnostics are conducted to explain the poorly predictive ability of the simplest ordinary linear regression model. Due to the non-deterministic feature of the artificial neural network and random forest models, several small models are built on small number of samples in the dataset to get the reasonable tuning parameters, and the optimal models are chosen by the value of RMSE and R2 among several training models. The corresponding performance of the built models are quantitatively and visually evaluated in details. The quantitative and visual results of our practical implementation show the feasibility for the large datasets under the artificial neural network and random forest algorithms. Comparing with the ordinary linear regression model (RMSE = 65556.95, R2 = 0.7327), the performance of the artificial neural network (RMSE = 36945.95, R2 = 0.9151) and random forest (RMSE = 30705.78, R2 = 0.9417) models are greatly improved, but the model training process is more complex and more time-consuming. The right choice between different models relies on the characteristics of the dataset and the goal, and also depends upon the cross-validation technique and the quantitative evaluation of the models.nb_NO
dc.language.isoengnb_NO
dc.publisherUniversity of Stavanger, Norwaynb_NO
dc.relation.ispartofseriesMasteroppgave/UIS-TN-IDE/2015;
dc.subjectinformasjonsteknologinb_NO
dc.subjectRandom Forestnb_NO
dc.subjectdatateknikknb_NO
dc.subjectmachine learningnb_NO
dc.subjectexploratory data analysisnb_NO
dc.subjectregression modelnb_NO
dc.subjectordinary linear regressionnb_NO
dc.subjectartificial neural networksnb_NO
dc.titleUsing machine learning for exploratory data analysis and predictive models on large datasetsnb_NO
dc.typeMaster thesisnb_NO
dc.subject.nsiVDP::Technology: 500::Information and communication technology: 550::Computer technology: 551nb_NO


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel