55 research outputs found

    Performance analysis of various machine learning algorithms for CO2 leak prediction and characterization in geo-sequestration injection wells

    Get PDF
    The effective detection and prevention of CO2 leakage in active injection wells are paramount for safe carbon capture and storage (CCS) initiatives. This study assesses five fundamental machine learning algorithms, namely, Support Vector Regression (SVR), K-Nearest Neighbor Regression (KNNR), Decision Tree Regression (DTR), Random Forest Regression (RFR), and Artificial Neural Network (ANN), for use in developing a robust data-driven model to predict potential CO2 leakage incidents in injection wells. Leveraging wellhead and bottom-hole pressure and temperature data, the models aim to simultaneously predict the location and size of leaks. A representative dataset simulating various leak scenarios in a saline aquifer reservoir was utilized. The findings reveal crucial insights into the relationships between the variables considered and leakage characteristics. With its positive linear correlation with depth of leak, wellhead pressure could be a pivotal indicator of leak location, while the negative linear relationship with well bottom-hole pressure demonstrated the strongest association with leak size. Among the predictive models examined, the highest prediction accuracy was achieved by the KNNR model for both leak localization and sizing. This model displayed exceptional sensitivity to leak size, and was able to identify leak magnitudes representing as little as 0.0158% of the total main flow with relatively high levels of accuracy. Nonetheless, the study underscored that accurate leak sizing posed a greater challenge for the models compared to leak localization. Overall, the findings obtained can provide valuable insights into the development of efficient data-driven well-bore leak detection systems.<br/

    Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

    Get PDF
    With the COVID-19 pandemic having caused unprecedented numbers of infections and deaths, large research efforts have been undertaken to increase our understanding of the disease and the factors which determine diverse clinical evolutions. Here we focused on a fully data-driven exploration regarding which factors (clinical or otherwise) were most informative for SARS-CoV-2 pneumonia severity prediction via machine learning (ML). In particular, feature selection techniques (FS), designed to reduce the dimensionality of data, allowed us to characterize which of our variables were the most useful for ML prognosis. We conducted a multi-centre clinical study, enrolling n = 1548 patients hospitalized due to SARS-CoV-2 pneumonia: where 792, 238, and 598 patients experienced low, medium and high-severity evolutions, respectively. Up to 106 patient-specific clinical variables were collected at admission, although 14 of them had to be discarded for containing ⩾60% missing values. Alongside 7 socioeconomic attributes and 32 exposures to air pollution (chronic and acute), these became d = 148 features after variable encoding. We addressed this ordinal classification problem both as a ML classification and regression task. Two imputation techniques for missing data were explored, along with a total of 166 unique FS algorithm configurations: 46 filters, 100 wrappers and 20 embeddeds. Of these, 21 setups achieved satisfactory bootstrap stability (⩾0.70) with reasonable computation times: 16 filters, 2 wrappers, and 3 embeddeds. The subsets of features selected by each technique showed modest Jaccard similarities across them. However, they consistently pointed out the importance of certain explanatory variables. Namely: patient’s C-reactive protein (CRP), pneumonia severity index (PSI), respiratory rate (RR) and oxygen levels –saturation Sp O2, quotients Sp O2/RR and arterial Sat O2/Fi O2–, the neutrophil-to-lymphocyte ratio (NLR) –to certain extent, also neutrophil and lymphocyte counts separately–, lactate dehydrogenase (LDH), and procalcitonin (PCT) levels in blood. A remarkable agreement has been found a posteriori between our strategy and independent clinical research works investigating risk factors for COVID-19 severity. Hence, these findings stress the suitability of this type of fully data-driven approaches for knowledge extraction, as a complementary to clinical perspectives.This research is supported by the Spanish State Research Agency AEI under the project S3M1P4R PID2020-115882RB-I00, as well as by the Basque Government EJ-GV under the grant ‘Artificial Intelligence in BCAM’ 2019/00432, under the strategy ‘Mathematical Modelling Applied to Health’, and under the BERC 2018–2021 and 2022–2025 programmes, and also by the Spanish Ministry of Science and Innovation: BCAM Severo Ochoa accreditation CEX2021-001142-S / MICIN / AEI / 10.13039/501100011033. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    Guaranteed Coverage Prediction Intervals with Gaussian Process Regression

    Full text link
    Gaussian Process Regression (GPR) is a popular regression method, which unlike most Machine Learning techniques, provides estimates of uncertainty for its predictions. These uncertainty estimates however, are based on the assumption that the model is well-specified, an assumption that is violated in most practical applications, since the required knowledge is rarely available. As a result, the produced uncertainty estimates can become very misleading; for example the prediction intervals (PIs) produced for the 95\% confidence level may cover much less than 95\% of the true labels. To address this issue, this paper introduces an extension of GPR based on a Machine Learning framework called, Conformal Prediction (CP). This extension guarantees the production of PIs with the required coverage even when the model is completely misspecified. The proposed approach combines the advantages of GPR with the valid coverage guarantee of CP, while the performed experimental results demonstrate its superiority over existing methods.Comment: 12 pages. This work has been submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    Instantaneous failure mode remaining useful life estimation using non-uniformly sampled measurements from a reciprocating compressor valve failure

    Get PDF
    One of the major targets in industry is minimisation of downtime and cost, and maximisation of availability and safety, with maintenance considered a key aspect in achieving this objective. The concept of Condition Based Maintenance and Prognostics and Health Management (CBM/PHM) , which is founded on the principles of diagnostics, and prognostics, is a step towards this direction as it offers a proactive means for scheduling maintenance. Reciprocating compressors are vital components in oil and gas industry, though their maintenance cost is known to be relatively high. Compressor valves are the weakest part, being the most frequent failing component, accounting for almost half maintenance cost. To date, there has been limited information on estimating Remaining Useful Life (RUL) of reciprocating compressor in the open literature. This paper compares the prognostic performance of several methods (multiple linear regression, polynomial regression, Self-Organising Map (SOM), K-Nearest Neighbours Regression (KNNR)), in relation to their accuracy and precision, using actual valve failure data captured from an operating industrial compressor. The SOM technique is employed for the first time as a standalone tool for RUL estimation. Furthermore, two variations on estimating RUL based on SOM and KNNR respectively are proposed. Finally, an ensemble method by combining the output of all aforementioned algorithms is proposed and tested. Principal components analysis and statistical process control were implemented to create T^2 and Q metrics, which were proposed to be used as health indicators reflecting degradation processes and were employed for direct RUL estimation for the first time. It was shown that even when RUL is relatively short due to instantaneous nature of failure mode, it is feasible to perform good RUL estimates using the proposed techniques

    Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

    Get PDF
    With the COVID-19 pandemic having caused unprecedented numbers of infections and deaths, large research efforts have been undertaken to increase our understanding of the disease and the factors which determine diverse clinical evolutions. Here we focused on a fully data-driven exploration regarding which factors (clinical or otherwise) were most informative for SARS-CoV-2 pneumonia severity prediction via machine learning (ML). In particular, feature selection techniques (FS), designed to reduce the dimensionality of data, allowed us to characterize which of our variables were the most useful for ML prognosis. We conducted a multi-centre clinical study, enrolling n=1548 patients hospitalized due to SARS-CoV-2 pneumonia: where 792, 238, and 598 patients experienced low, medium and high-severity evolutions, respectively. Up to 106 patient-specific clinical variables were collected at admission, although 14 of them had to be discarded for containing ⩾60% missing values. Alongside 7 socioeconomic attributes and 32 exposures to air pollution (chronic and acute), these became d=148 features after variable encoding. We addressed this ordinal classification problem both as a ML classification and regression task. Two imputation techniques for missing data were explored, along with a total of 166 unique FS algorithm configurations: 46 filters, 100 wrappers and 20 embeddeds. Of these, 21 setups achieved satisfactory bootstrap stability (⩾0.70) with reasonable computation times: 16 filters, 2 wrappers, and 3 embeddeds. The subsets of features selected by each technique showed modest Jaccard similarities across them. However, they consistently pointed out the importance of certain explanatory variables. Namely: patient’s C-reactive protein (CRP), pneumonia severity index (PSI), respiratory rate (RR) and oxygen levels –saturation SpO2, quotients SpO2/RR and arterial SatO2/FiO2 –, the neutrophil-to-lymphocyte ratio (NLR) –to certain extent, also neutrophil and lymphocyte counts separately–, lactate dehydrogenase (LDH), and procalcitonin (PCT) levels in blood. A remarkable agreement has been found a posteriori between our strategy and independent clinical research works investigating risk factors for COVID-19 severity. Hence, these findings stress the suitability of this type of fully data-driven approaches for knowledge extraction, as a complementary to clinical perspectives

    Tesla Stock Close Price Prediction Using KNNR, DTR, SVR, and RFR

    Get PDF
    The growth of every nation's economy depends heavily on the stock market. Because of their strong predictive powers, "K-Nearest Neighbor Regression (KNNR), Random Forest Regression (RFR), Decision Tree Regression (DTR), and Support Vector Regression (SVR)" is commonly used. From the previous study, it shows that the chosen state-of-the-art of previous studies is by Shah et al. (2021) using Bi-LSTM. But the state-of-the-art model shows a high Mean Squared Error (MSE) value that is 5% for predict Tesla Stock. Therefore, we propose the SVR, KNNR, DTR, and RFR model hope error value can be reduced. However, those models have difficulties finding suitable parameters for prediction. Due to that, we proposed the best parameters that were commonly used in studies which is SVR with RBF kernel, C = [1; 10; 100], gamma = [0, 1; 0, 2], epsilon = 0,1, KNNR with K = [1, 3, 5, 7, 9], and DTR with a criterion = [‘squared_error’;’friedman_mse’], RFR with 0 < n_estimators < 100 is picked to enhance the predictive capabilities of each proposed model. To prove this, a comparison is made with each proposed model, and choose the best proposed model then compared to the state-of-the-art of the previous study, which is Bi-LSTM. The dataset used was the Tesla Stock historical data from yahoo finance. The result showed RFR with n_estimators = 87 is superior to its comparison with results of 0.005452 RMSE, 0.000029 MSE, and 0.999296 R2 and compared to Bi-LSTM have about 0.11773 RMSE, 0.01386 MSE, and 0.791263 R2. Based on this study, it can be concluded that RFR with n_estimators = 87 has superior predictive capabilities performance compared with other models and the state-of-the-art of previous studies.   Doi: 10.28991/HEF-2022-03-04-01 Full Text: PD

    Reciprocating compressor prognostics of an instantaneous failure mode utilising temperature only measurements

    Get PDF
    Reciprocating compressors are critical components in the oil and gas sector, though their maintenance cost is known to be relatively high. Compressor valves are the weakest component, being the most frequent failure mode, accounting for almost half the maintenance cost. One of the major targets in industry is minimisation of downtime and cost, while maximising availability and safety of a machine, with maintenance considered a key aspect in achieving this objective. The concept of Condition Based Maintenance and Prognostics and Health Management (CBM/PHM) which is founded on the diagnostics and prognostics principles, is a step towards this direction as it offers a proactive means for scheduling maintenance. Despite the fact that diagnostics is an established area for reciprocating compressors, to date there is limited information in the open literature regarding prognostics, especially given the nature of failures can be instantaneous. This work presents an analysis of prognostic performance of several methods (multiple linear regression, polynomial regression, K-Nearest Neighbours Regression (KNNR)), in relation to their accuracy and variability, using actual temperature only valve failure data, an instantaneous failure mode, from an operating industrial compressor. Furthermore, a variation for Remaining Useful Life (RUL) estimation based on KNNR, along with an ensemble technique merging the results of all aforementioned methods are proposed. Prior to analysis, principal components analysis and statistical process control were employed to create !! and ! metrics, which were proposed to be used as health indicators reflecting degradation process of the valve failure mode and are proposed to be used for direct RUL estimation for the first time. Results demonstrated that even when RUL is relatively short due to instantaneous nature of failure mode, it is feasible to perform good RUL estimates using the proposed techniques
    • …
    corecore