341,846 research outputs found

    An Ensemble Model of QSAR Tools for Regulatory Risk Assessment

    Get PDF
    Quantitative structure activity relationships (QSARs) are theoretical models that relate a quantitative measure of chemical structure to a physical property or a biological effect. QSAR predictions can be used for chemical risk assessment for protection of human and environmental health, which makes them interesting to regulators, especially in the absence of experimental data. For compatibility with regulatory use, QSAR models should be transparent, reproducible and optimized to minimize the number of false negatives. In silico QSAR tools are gaining wide acceptance as a faster alternative to otherwise time-consuming clinical and animal testing methods. However, different QSAR tools often make conflicting predictions for a given chemical and may also vary in their predictive performance across different chemical datasets. In a regulatory context, conflicting predictions raise interpretation, validation and adequacy concerns. To address these concerns, ensemble learning techniques in the machine learning paradigm can be used to integrate predictions from multiple tools. By leveraging various underlying QSAR algorithms and training datasets, the resulting consensus prediction should yield better overall predictive ability. We present a novel ensemble QSAR model using Bayesian classification. The model allows for varying a cut-off parameter that allows for a selection in the desirable trade-off between model sensitivity and specificity. The predictive performance of the ensemble model is compared with four in silico tools (Toxtree, Lazar, OECD Toolbox, and Danish QSAR) to predict carcinogenicity for a dataset of air toxins (332 chemicals) and a subset of the gold carcinogenic potency database (480 chemicals). Leave-one-out cross validation results show that the ensemble model achieves the best trade-off between sensitivity and specificity (accuracy: 83.8 % and 80.4 %, and balanced accuracy: 80.6 % and 80.8 %) and highest inter-rater agreement [kappa (κ): 0.63 and 0.62] for both the datasets. The ROC curves demonstrate the utility of the cut-off feature in the predictive ability of the ensemble model. This feature provides an additional control to the regulators in grading a chemical based on the severity of the toxic endpoint under study

    Assessing the potential of the dart model to discrete return lidar simulation—application to fuel type mapping

    Get PDF
    Fuel type is one of the key factors for analyzing the potential of fire ignition and propaga-tion in agricultural and forest environments. The increase of three-dimensional datasets provided by active sensors, such as LiDAR (Light Detection and Ranging), has improved the classification of fuel types through empirical modelling. Empirical methods are site and sensor specific while Radiative Transfer Models (RTM) approaches provide broader universality. The aim of this work is to analyze the suitability of Discrete Anisotropic Radiative Transfer (DART) model to replicate low density small-footprint Airborne Laser Scanning (ALS) measurements and subsequent fuel type classification. Field data measured in 104 plots are used as ground truth to simulate LiDAR response based on the sensor and flight characteristics of low-density ALS data captured by the Spanish National Plan for Aerial Orthophotography (PNOA) in two different dates (2011 and 2016). The accuracy assessment of the DART simulations is performed using Spearman rank correlation coefficients between the simulated metrics and the ALS-PNOA ones. The results show that 32% of the computed metrics overpassed a correlation value of 0.80 between simulated and ALS-PNOA metrics in 2011 and 28% in 2016. The highest correlations were related to high height percentiles, canopy variability metrics as for example standard deviation and Rumple diversity index, reaching correlation values over 0.94. Two metric selection approaches and Support Vector Machine classification method with variants were compared to classify fuel types. The best-fitted classification model, trained with the DART simulated sample and validated with ALS-PNOA data, was obtained using Support Vector Machine method with radial kernel. The overall accuracy of the classification after validation was 88% and 91% for the 2011 and 2016 years, respectively. The use of DART demonstrates its value for simulating generalizable 3D data for fuel type classification providing relevant information for forest managers in fire prevention and extinction

    A Powerful Paradigm for Cardiovascular Risk Stratification Using Multiclass, Multi-Label, and Ensemble-Based Machine Learning Paradigms: A Narrative Review

    Get PDF
    Background and Motivation: Cardiovascular disease (CVD) causes the highest mortality globally. With escalating healthcare costs, early non-invasive CVD risk assessment is vital. Conventional methods have shown poor performance compared to more recent and fast-evolving Artificial Intelligence (AI) methods. The proposed study reviews the three most recent paradigms for CVD risk assessment, namely multiclass, multi-label, and ensemble-based methods in (i) office-based and (ii) stress-test laboratories. Methods: A total of 265 CVD-based studies were selected using the preferred reporting items for systematic reviews and meta-analyses (PRISMA) model. Due to its popularity and recent development, the study analyzed the above three paradigms using machine learning (ML) frameworks. We review comprehensively these three methods using attributes, such as architecture, applications, pro-and-cons, scientific validation, clinical evaluation, and AI risk-of-bias (RoB) in the CVD framework. These ML techniques were then extended under mobile and cloud-based infrastructure. Findings: Most popular biomarkers used were office-based, laboratory-based, image-based phenotypes, and medication usage. Surrogate carotid scanning for coronary artery risk prediction had shown promising results. Ground truth (GT) selection for AI-based training along with scientific and clinical validation is very important for CVD stratification to avoid RoB. It was observed that the most popular classification paradigm is multiclass followed by the ensemble, and multi-label. The use of deep learning techniques in CVD risk stratification is in a very early stage of development. Mobile and cloud-based AI technologies are more likely to be the future. Conclusions: AI-based methods for CVD risk assessment are most promising and successful. Choice of GT is most vital in AI-based models to prevent the RoB. The amalgamation of image-based strategies with conventional risk factors provides the highest stability when using the three CVD paradigms in non-cloud and cloud-based frameworks

    A comparison of magnetic resonance imaging and neuropsychological examination in the diagnostic distinction of Alzheimer’s disease and behavioral variant frontotemporal dementia

    Get PDF
    The clinical distinction between Alzheimer's disease (AD) and behavioral variant frontotemporal dementia (bvFTD) remains challenging and largely dependent on the experience of the clinician. This study investigates whether objective machine learning algorithms using supportive neuroimaging and neuropsychological clinical features can aid the distinction between both diseases. Retrospective neuroimaging and neuropsychological data of 166 participants (54 AD; 55 bvFTD; 57 healthy controls) was analyzed via a Naïve Bayes classification model. A subgroup of patients (n = 22) had pathologically-confirmed diagnoses. Results show that a combination of gray matter atrophy and neuropsychological features allowed a correct classification of 61.47% of cases at clinical presentation. More importantly, there was a clear dissociation between imaging and neuropsychological features, with the latter having the greater diagnostic accuracy (respectively 51.38 vs. 62.39%). These findings indicate that, at presentation, machine learning classification of bvFTD and AD is mostly based on cognitive and not imaging features. This clearly highlights the urgent need to develop better biomarkers for both diseases, but also emphasizes the value of machine learning in determining the predictive diagnostic features in neurodegeneration

    Applications of machine learning in petroleum system characterization using mud gas data

    Get PDF
    Mud gas data are continuous measurements of the different compounds of the gas released from the formation while drilling. While these data are cheap and extensive, as they are always recorded for safety reasons, they are mainly used qualitatively for fluid and reservoir characterization. It is only recently that mud gas data starts to be used quantitatively. The limited use of mud gas data is generally attributed to the difficulty in making a direct relation to reservoir and fluid properties. The motivation of this study is to assess the potential use of mud gas data for quicker, better, and more cost-effective assessment of fluid and reservoir properties which would be beneficial for both exploration and development. Potentially, this could allow for better understanding of the reservoir, better development strategies, identifying missed net pay leading to new exploration and development opportunities, decrease data acquisition costs and more … In this master project, 40 wells in quad 35 of the Norwegian sector of Northern North Sea, were investigated using data analytics to explore the mud gas data and relate it to reservoir properties by evaluating model outputs and how they relate to real physical behavior. We focused on three predictive tasks (two classifications and one regression): 1) Identifying hydrocarbon bearing zones in the reservoir rock through supervised classification using features from mud gas data 2) Identifying fluid type (Gas or Oil) in hydrocarbon bearing zones through supervised classification using features from mud gas data 3) Estimating permeability in hydrocarbon bearing zones from combining mud gas data with logging data through supervised prediction \end{enumerate} First, data were exported from different sources and loaded to R and Jupyter Notebook. The data were resampled in accordance with mud data resolution. Some of the features were then log transformed and scaled to reduce skewness and improve prediction capability. Then, Multiple machine learning models (Linear with subset selection, Lasso, Ridge, Random Forest, Boosting, SVM and Neural Network) were generated and compared. For each model, hyperparameters were tuned either based on validation set or cross validation. Then, a more realistic approach was developed to test the different models by leave one well out cross validation (LOWOCV), which is less sensitive to systematic error from sampling and experience bias, to rank the different models. For the first and second classification tasks a separate well was also used as a test set. The first classification results (distinguishing water from hydrocarbon) provided a moderately good prediction. The training set for this task consists of 2952 samples and 281 for the test set. Most of the models provided comparable results. The best model ‘logistic regression with features selection’ had a LOWOCV AUC=0.82, F1score=0.55, accuracy=0.77. The same model was the best for the test set with AUC=0.87, F1score=0.82, accuracy=0.96. The modest performance for distinguishing water from hydrocarbon using mud gas data is probably due to the presence of low saturation gas in some sand intervals. The second classification showed much better results and allowed for a very good performance with regards to distinguishing gas and oil. The training set consists of 611 samples and 102 for the test set. The best model ‘logistic regression with features selection’ had a LOWOCV AUC=0.86, F1score=0.87 and accuracy=0.84, and a test set AUC=0.99, F1score=0.94 and accuracy=0.93. The main features important for the prediction are the different gas ratios which is in alignment with the finding of Pixler in 1969 \cite{Pixler}. The third prediction had a sample size of 282 observations and showed that adding mud gas data to porosity and shale volume improved the prediction of permeability compared to conventional approach of using only porosity or porosity with shale volume. The best model was linear regression with subset selection and R2 of the LOWOCV had improved from 0.65 to 0.85 by adding mud gas features as C1, and C1/(DEPTH*MW). Physically, C1/(DEPTH*MW) or (C1/MW) could be considered as a gas rate corrected by the mud weight or depth*mud weight. However, for this task the sample size is small making the generalization of this finding difficult. The results of these different models for the three tasks, were loaded to Petrel software to allow better use and visualization by geoscientists. While the data set is limited, we succeeded in obtaining valuable results from mud gas data which could be a base for further work and investigations
    • …
    corecore