IMPACT OF DATA PRE-PROCESSING TECHNIQUES ON MACHINE LEARNING MODELS

2022

The Volve dataset, which contains the time series values of different sensors that have been used

at the Volve drilling site contains many flaws which make it hard for machine learning models

to learn from the dataset and provide useful insights and future predictions. Three flaws have

been highlighted including missing data, different frequency rates, and too many attributes (high

dimensional data). To solve the issues, present in time series data, a data preprocessing pipeline

has been proposed which first removes the noise through the rolling mean. Then applies gap

analysis to remove the columns whose gaps can not be filled with data imputation methods.

After that gap has been filled by the KNN imputer which imputes the missing values in the

data. After that data resampling has been applied to make the sampling rate consistent as the

time series prediction model takes a constant sampling rate. For hyper-parameter tuning of the

resampling method AIC and BIC value has been created on a grid of hyper-parameters. After

resampling, top parameters were selected on basis of Pearson correlation, after which AIC and

BIC has been used to select the most relevant 3 parameters. These 3 parameters has then be

used to train three models that are: RNN + MLP, LSTM + MLP, and LSTM + RNN + MLP. On

basis of mean absolute error (MAE) best model has been selected which is RNN + MLP.

uis