The Volve dataset, which contains the time series values of different sensors that have been used
at the Volve drilling site contains many flaws which make it hard for machine learning models
to learn from the dataset and provide useful insights and future predictions. Three flaws have
been highlighted including missing data, different frequency rates, and too many attributes (high
dimensional data). To solve the issues, present in time series data, a data preprocessing pipeline
has been proposed which first removes the noise through the rolling mean. Then applies gap
analysis to remove the columns whose gaps can not be filled with data imputation methods.
After that gap has been filled by the KNN imputer which imputes the missing values in the
data. After that data resampling has been applied to make the sampling rate consistent as the
time series prediction model takes a constant sampling rate. For hyper-parameter tuning of the
resampling method AIC and BIC value has been created on a grid of hyper-parameters. After
resampling, top parameters were selected on basis of Pearson correlation, after which AIC and
BIC has been used to select the most relevant 3 parameters. These 3 parameters has then be
used to train three models that are: RNN + MLP, LSTM + MLP, and LSTM + RNN + MLP. On
basis of mean absolute error (MAE) best model has been selected which is RNN + MLP