IMPACT OF DATA PRE-PROCESSING TECHNIQUES ON MACHINE LEARNING MODELS
Master thesis
Permanent lenke
https://hdl.handle.net/11250/3028476Utgivelsesdato
2022Metadata
Vis full innførselSamlinger
- Studentoppgaver (TN-IER) [150]
Sammendrag
The Volve dataset, which contains the time series values of different sensors that have been usedat the Volve drilling site contains many flaws which make it hard for machine learning modelsto learn from the dataset and provide useful insights and future predictions. Three flaws havebeen highlighted including missing data, different frequency rates, and too many attributes (highdimensional data). To solve the issues, present in time series data, a data preprocessing pipelinehas been proposed which first removes the noise through the rolling mean. Then applies gapanalysis to remove the columns whose gaps can not be filled with data imputation methods.After that gap has been filled by the KNN imputer which imputes the missing values in thedata. After that data resampling has been applied to make the sampling rate consistent as thetime series prediction model takes a constant sampling rate. For hyper-parameter tuning of theresampling method AIC and BIC value has been created on a grid of hyper-parameters. Afterresampling, top parameters were selected on basis of Pearson correlation, after which AIC andBIC has been used to select the most relevant 3 parameters. These 3 parameters has then beused to train three models that are: RNN + MLP, LSTM + MLP, and LSTM + RNN + MLP. Onbasis of mean absolute error (MAE) best model has been selected which is RNN + MLP.