5 research outputs found

    Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO

    No full text
    Variable selection is of crucial significance in QSAR modeling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and categorical data sets were employed to explore the applicability of two distinct variable selection methods random forests (RF) and least absolute shrinkage and selection operator (LASSO). Variable selection was performed: (1) by using recursive random forests to rule out a quarter of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold inner cross-validation to tune its penalty λ for each data set. Along with regular statistical parameters of model performance, we proposed the highest pairwise correlation rate, average pairwise Pearson’s correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this study) and apparently enhance model’s predictive performance as well. Furthermore, random forests showed property of gathering important predictors without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model’s predictive performance. The optimal variables selected by RF share low similarity with those by LASSO (e.g., the Tanimoto coefficients were smaller than 0.20 in seven out of eight data sets). We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is more important than the learning algorithm for modeling. We hope that a standard procedure could be developed based on these proposed statistical metrics to select the truly important variables for model interpretation, as well as for further use to facilitate drug discovery and environmental toxicity assessment

    Reducing Wait Time Prediction In Hospital Emergency Room: Lean Analysis Using a Random Forest Model

    Get PDF
    Most of the patients visiting emergency departments face long waiting times due to overcrowding which is a major concern across the hospital in the United States. Emergency Department (ED) overcrowding is a common phenomenon across hospitals, which leads to issues for the hospital management, such as increased patient s dissatisfaction and an increase in the number of patients choosing to terminate their ED visit without being attended to by a medical healthcare professional. Patients who have to Leave Without Being Seen (LWBS) by doctors often leads to loss of revenue to hospitals encouraging healthcare professionals to analyze ways to improve operational efficiency and reduce the operational expenses of an emergency department. To keep patients informed of the conditions in the emergency room, recently hospitals have started publishing wait times online. Posted wait times help patients to choose the ED which is least overcrowded thus benefiting patients with shortest waiting time and allowing hospitals to allocate and plan resources appropriately. This requires an accurate and efficient method to model the experienced waiting time for patients visiting an emergency medical services unit. In this thesis, the author seeks to estimate the waiting time for low acuity patients within an ED setting; using regularized regression methods such as Lasso, Ridge, Elastic Net, SCAD and MCP; along with tree-based regression (Random Forest). For accurately capturing the dynamic state of emergency rooms, queues of patients at various stage of ED is used as candidate predictor variables along with time patient s arrival time to account for diurnal variation. Best waiting time prediction model is selected based on the analysis of historical data from the hospital. Tree-based regression model predicts wait time of low acuity patients in ED with more accuracy when compared with regularized regression, conventional rolling average, and quantile regression methods. Finally, most influential predictors for predictability of patient wait time are identified for the best performing model

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Machine learning for the prediction of drug-induced toxicity

    Get PDF
    The knowledge of toxicological properties of compounds (e.g. drugs, chemicals, and contaminants) is crucial for drug development, definition of toxicological thresholds and exposure limits. However, toxicological testing, either in vitro or in vivo, is time-consuming, labour intensive and expensive. An alternative to the classic experiments is the use of computational (in silico) approaches, such as machine learning. For machine learning, it is assumed that substances with comparable structure or molecular features also exhibit the comparable pharmacological or toxicological action. Based on the comparison of substances with known pharmacological or toxicological action to substances with unknown properties, models, which were generated using machine learning methods, are able to predict the action of the latter substances. The aim of this work was the development of predictive machine learning models for the estimation of risk of hepatotoxicity and genotoxicity. These models were then applied on two different substance groups and the outcome was compared to available literature data. The acute hepatotoxic potential of over 600 different pyrrolizidine alkaloids (PAs) was evaluated using the methods Random Forest and artificial Neural Networks. The predicted qualitative hepatotoxicity of both models was highly correlated. Furthermore, specific structural motives showed different hepatotoxic potential. Overall, the obtained results fitted well with already published in vitro and in vivo data on the acute hepatotoxic properties of PAs. The genotoxic/ mutagenic potential of PAs was addressed using six different machine learning methods (LAZAR (Lazy Structure-Activity Relationships), Support Vector Machines, Random Forest and two Deep Learning Networks). Even though the models achieved only low to moderate accuracy rates, the best model clearly showed structural specific differences in the predicted genotoxic potential. Furthermore, the acute hepatotoxic potential of 165 protein kinase inhibitors (PKIs) was predicted using Random Forest and artificial Neural Networks. The models confirmed clinical observations that PKIs have in general a high probability for inducing hepatotoxicity. However, interestingly, there seemed to be a target specific difference, with inhibitors of Janus kinases having the lowest hepatotoxic probability of 60-67%. The greatest challenge is the performance of the models. This has to be validated e.g. by cross-validation before the model can be used on the substances of interest. Although group statements could be easily obtained, due caution has to be taken while interpreting the results of predictive models for single compounds and if possible, comparison to already published data is advisable, as a form of external validation
    corecore