13 research outputs found

    Development and Evaluation of ADME Models Using Proprietary and Opensource Data

    Get PDF
    Absorption, Distribution, Metabolism and Elimination (ADME) properties are important factors in the drug discovery pipeline. Literature ADME data are often collected in large chemical databases like ChEMBL, which might be an asset to improve the prediction of ADME properties. Pharmaceutical companies build ADME Quantitative Structure Property Relationships (QSPR) models using proprietary data and thus the inclusion of literature data might be a valuable source for the development of predictive models. The aim of this study was to investigate whether merging literature and proprietary data could improve the predictive activity of proprietary models and enlarge their applicability domain (AD). ADME predictive models for Caco-2 (A to B) permeability and LogD7.4 were built with data extracted from Evotec and ChEMBL database. Predictive models were developed for each property and three different training sets were used based on: proprietary compounds (Evotec models), literature compounds (ChEMBL models) and a merged set of proprietary and literature compounds (Evotec+ChEMBL models). The Random Forest (RF), Partial Least Squares (PLS) and Support Vector Regression (SVR) were used to develop the models. The performance of the models was evaluated by using two types of test sets: a diverse test set (20 % compounds of available data randomly selected) and a temporal test set (data published after the models were built). The descriptors that used were the physiochemical descriptors, the structural Molecular Access System (MACCS) descriptors and the Partial equalisation of orbital electronegativity – van der Walls surface areas (Peoe-VSA) descriptors. The AD of the models was evaluated with four distance to model metrics, which were the: kNN with Euclidean distance, kNN with Manhattan distance, Leverage and Mahalanobis distance. The ability of an existing Evotec Caco-2 permeability model to assess literature compounds (extracted from ChEMBL) was evaluated. The literature test set was predicted with a higher RMSE compared to the RMSE in prediction for internal compounds. Additionally, a number of literature compounds was found to be outside the AD of the Evotec model, thus highlighting an area of improvement for proprietary Evotec models. Furthermore, the effect of the inclusion of literature data in the existing Caco-2 permeability and LogD7.4 Evotec proprietary models was evaluated. The RF algorithm was the highest performing method for the development of Caco-2 permeability models and the SVR for the LogD7.4 models. In addition, the leverage method proved to be the most appropriate for the evaluation of the models’ AD. The permeability model built merging literature and proprietary data (Evotec+ChEMBL model) predicted a literature temporal test set with an RMSE of 0.68 while the Evotec model showed an RMSE of 0.74. Even in the case of the Evotec temporal test set, the two models performed similarly and the AD of the mixed models (incorporating both literature and proprietary data) was enlarged. The 86.15% of the compounds in the proprietary temporal test set were within the AD of the Evotec+ChEMBL model, while 76.50% of the compounds of the same test set appeared to be within the AD of the Evotec model. Similarly, the LogD7.4 Evotec+ChEMBL model predicted a literature temporal test set with an RMSE of 0.77 while the Evotec model showed an RMSE of 0.83. Even in the case of the Evotec temporal test set, the two models performed similarly but the AD of the mixed models (incorporating both literature and proprietary data) was enlarged. The 94.86% of the compounds in the proprietary temporal test set were within the AD of the Evotec+ChEMBL model, while 88.49% of the compounds of the same test set appeared to be within the AD of the Evotec model. This study demonstrated that the inclusion of public ADME data into proprietary models improved the performance of proprietary models and enlarged at the same time their AD. The methodology presented herein will be applied by Evotec computational scientists to re-build the Caco-2 and LogD7.4 Evotec proprietary models considering literature data as discussed in this thesis

    Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.

    Get PDF
    Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold

    Computational analyses of mechanism of action (MoA): data, methods and integration.

    No full text
    The elucidation of a compound's Mechanism of Action (MoA) is a challenging task in the drug discovery process, but it is important in order to rationalise phenotypic findings and to anticipate potential side-effects. Bioinformatic approaches, advances in machine learning techniques and the increasing deposition of high-throughput data in public databases have significantly contributed to recent advances in the field, but it is not straightforward to decide which data and methods are most suitable to use in a given case. In this review, we focus on these methods and data and their applications in generating MoA hypotheses for subsequent experimental validation. We discuss compound-specific data such as -omics, cell morphology and bioactivity data, as well as commonly used supplementary prior knowledge such as network and pathway data, and provide information on databases where this data can be accessed. In terms of methodologies, we discuss both well-established methods (connectivity mapping, pathway enrichment) as well as more developing methods (neural networks and multi-omics integration). Finally, we review case studies where the MoA of a compound was successfully suggested from computational analysis by incorporating multiple data modalities and/or methodologies. Our aim for this review is to provide researchers with insights into the benefits and drawbacks of both the data and methods in terms of level of understanding, biases and interpretation - and to highlight future avenues of investigation which we foresee will improve the field of MoA elucidation, including greater public access to -omics data and methodologies which are capable of data integration

    Integrating cell morphology with gene expression and chemical structure to aid mitochondrial toxicity detection.

    No full text
    Mitochondrial toxicity is an important safety endpoint in drug discovery. Models based solely on chemical structure for predicting mitochondrial toxicity are currently limited in accuracy and applicability domain to the chemical space of the training compounds. In this work, we aimed to utilize both -omics and chemical data to push beyond the state-of-the-art. We combined Cell Painting and Gene Expression data with chemical structural information from Morgan fingerprints for 382 chemical perturbants tested in the Tox21 mitochondrial membrane depolarization assay. We observed that mitochondrial toxicants differ from non-toxic compounds in morphological space and identified compound clusters having similar mechanisms of mitochondrial toxicity, thereby indicating that morphological space provides biological insights related to mechanisms of action of this endpoint. We further showed that models combining Cell Painting, Gene Expression features and Morgan fingerprints improved model performance on an external test set of 244 compounds by 60% (in terms of F1 score) and improved extrapolation to new chemical space. The performance of our combined models was comparable with dedicated in vitro assays for mitochondrial toxicity. Our results suggest that combining chemical descriptors with biological readouts enhances the detection of mitochondrial toxicants, with practical implications in drug discovery

    Probabilistic Random Forest Improves Bioactivity Predictions Close to the Classification Threshold by Taking into Account Experimental Uncertainty

    No full text
    In the context of small molecule property prediction, experimental errors are usually a neglected aspect during model generation. The main caveat to binary classification approaches is that they weight minority cases close to the threshold boundary equivalently in distinguishing between activity classes. For example, a pXC50 activity value of 5.1 or 4.9 are treated equally important in contributing to the opposing activity (e.g., classification threshold of 5), even though experimental error may not afford such discriminatory accuracy. This is detrimental in practice and therefore it is equally important to evaluate the presence of experimental error in databases and apply methodologies to account for variability in experiments and uncertainty near the decision boundary. In order to improve upon this, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF comprises a modification to the long-established Random Forest (RF), to take into account uncertainties in the assigned classes (i.e., activity labels). This enables representing the activity in a framework in-between the classification and regression architecture, with philosophical differences from either approach. Compared to classification, this approach enables better representation of factors increasing/decreasing inactivity. Conversely, one can utilize all data (even delimited/operand/censored data far from a cut-off) at the same time as taking into account the granularity around the cut-off, compared to a classical regression framework. The algorithm was applied toward ~550 target prediction tasks from ChEMBL and PubChem. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information is not considered in any way in the original RF algorithm. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold). The RF models gave errors smaller than the experimental uncertainty, which could indicate that they are overtrained and/or over-confident. Overall, we show that PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. With this approach, we present, to our knowledge, for the first time an application of probabilistic modelling of activity data for target prediction using the PRF algorithm.</p

    Multitask Bioactivity Predictions Using Structural Chemical and Cell Morphology Information

    No full text
    The understanding of the Mechanism-of-Action (MoA) of compounds and the prediction of potential drug targets has an important role in small-molecule drug discovery. The aim of this work was to compare chemical and cell morphology information for bioactivity prediction. The comparison was performed by using bioactivity data from the ExCAPE database, image data from the Cell Painting data set (the largest publicly available data set of cell images with approximately ~30,000 compound perturbations) and Extended Connectivity Fingerprints (ECFPs) using the multitask Bayesian Matrix Factorisation (BMF) approach Macau. We found that the BMF Macau and Random Forest (RF) performance was overall similar when ECFP fingerprints were used as compounds descriptors. However, BMF Macau outperformed RF in 155 out of 224 target classes (69.20%) when image data was used as compounds information. By using BMF Macau 100 (corresponding to about 45%) and 90 ( about 40%) of the 224 targets were predicted with high predictive performance (AUC > 0.8) with ECFP data and image data as side information, respectively. There were targets better predicted by image data as side information, such as b-catenin, and others better predicted by fingerprint-based side information, like proteins belonging to the G-Protein Coupled Receptor 1 family, which could be rationalized from the underlying data distributions in each descriptor domain. In conclusion, both cell morphology changes and structural chemical information contain information about compound bioactivity, which is also partially complementary, and can hence contribute to in silico mechanism of action analysis. </p
    corecore