IMPACT OF DATA PRE-PROCESSING TECHNIQUES ON MACHINE LEARNING MODELS

Abstract

The Volve dataset, which contains the time series values of different sensors that have been used at the Volve drilling site contains many flaws which make it hard for machine learning models to learn from the dataset and provide useful insights and future predictions. Three flaws have been highlighted including missing data, different frequency rates, and too many attributes (high dimensional data). To solve the issues, present in time series data, a data preprocessing pipeline has been proposed which first removes the noise through the rolling mean. Then applies gap analysis to remove the columns whose gaps can not be filled with data imputation methods. After that gap has been filled by the KNN imputer which imputes the missing values in the data. After that data resampling has been applied to make the sampling rate consistent as the time series prediction model takes a constant sampling rate. For hyper-parameter tuning of the resampling method AIC and BIC value has been created on a grid of hyper-parameters. After resampling, top parameters were selected on basis of Pearson correlation, after which AIC and BIC has been used to select the most relevant 3 parameters. These 3 parameters has then be used to train three models that are: RNN + MLP, LSTM + MLP, and LSTM + RNN + MLP. On basis of mean absolute error (MAE) best model has been selected which is RNN + MLP

    Similar works