183 research outputs found

    Extending in Silico Protein Target Prediction Models to Include Functional Effects.

    Get PDF
    In silico protein target deconvolution is frequently used for mechanism-of-action investigations; however existing protocols usually do not predict compound functional effects, such as activation or inhibition, upon binding to their protein counterparts. This study is hence concerned with including functional effects in target prediction. To this end, we assimilated a bioactivity training set for 332 targets, comprising 817,239 active data points with unknown functional effect (binding data) and 20,761,260 inactive compounds, along with 226,045 activating and 1,032,439 inhibiting data points from functional screens. Chemical space analysis of the data first showed some separation between compound sets (binding and inhibiting compounds were more similar to each other than both binding and activating or activating and inhibiting compounds), providing a rationale for implementing functional prediction models. We employed three different architectures to predict functional response, ranging from simplistic random forest models ('Arch1') to cascaded models which use separate binding and functional effect classification steps ('Arch2' and 'Arch3'), differing in the way training sets were generated. Fivefold stratified cross-validation outlined cascading predictions provides superior precision and recall based on an internal test set. We next prospectively validated the architectures using a temporal set of 153,467 of in-house data points (after a 4-month interim from initial data extraction). Results outlined Arch3 performed with the highest target class averaged precision and recall scores of 71% and 53%, which we attribute to the use of inactive background sets. Distance-based applicability domain (AD) analysis outlined that Arch3 provides superior extrapolation into novel areas of chemical space, and thus based on the results presented here, propose as the most suitable architecture for the functional effect prediction of small molecules. We finally conclude including functional effects could provide vital insight in future studies, to annotate cases of unanticipated functional changeover, as outlined by our CHRM1 case study.LM thanks the Biotechnology and Biological Sciences Research Council (BBSRC) (BB/K011804/1); and AstraZeneca, grant number RG75821

    Prediction of the Chemical Context for Buchwald-Hartwig Coupling Reactions

    Get PDF
    We present machine learning models for predicting the chemical context for Buchwald-Hartwig coupling reactions, i. e., what chemicals to add to the reactants to give a productive reaction. Using reaction data from in-house electronic lab notebooks, we train two models: one based on single-label data and one based on multi-label data. Both models show excellent top-3 accuracy of approximately 90 %, which suggests strong predictivity. Furthermore, there seems to be an advantage of including multi-label data because the multi-label model shows higher accuracy and better sensitivity for the individual contexts than the single-label model. Although the models are performant, we also show that such models need to be re-trained periodically as there is a strong temporal characteristic to the usage of different contexts. Therefore, a model trained on historical data will decrease in usefulness with time as newer and better contexts emerge and replace older ones. We hypothesize that such significant transitions in the context-usage will likely affect any model predicting chemical contexts trained on historical data. Consequently, training context prediction models warrants careful planning of what data is used for training and how often the model needs to be re-trained

    Target prediction utilising negative bioactivity data covering large chemical space.

    Get PDF
    BACKGROUND: In silico analyses are increasingly being used to support mode-of-action investigations; however many such approaches do not utilise the large amounts of inactive data held in chemogenomic repositories. The objective of this work is concerned with the integration of such bioactivity data in the target prediction of orphan compounds to produce the probability of activity and inactivity for a range of targets. To this end, a novel human bioactivity data set was constructed through the assimilation of over 195 million bioactivity data points deposited in the ChEMBL and PubChem repositories, and the subsequent application of a sphere-exclusion selection algorithm to oversample presumed inactive compounds. RESULTS: A Bernoulli Naïve Bayes algorithm was trained using the data and evaluated using fivefold cross-validation, achieving a mean recall and precision of 67.7 and 63.8 % for active compounds and 99.6 and 99.7 % for inactive compounds, respectively. We show the performances of the models are considerably influenced by the underlying intraclass training similarity, the size of a given class of compounds, and the degree of additional oversampling. The method was also validated using compounds extracted from WOMBAT producing average precision-recall AUC and BEDROC scores of 0.56 and 0.85, respectively. Inactive data points used for this test are based on presumed inactivity, producing an approximated indication of the true extrapolative ability of the models. A distance-based applicability domain analysis was also conducted; indicating an average Tanimoto Coefficient distance of 0.3 or greater between a test and training set can be used to give a global measure of confidence in model predictions. A final comparison to a method trained solely on active data from ChEMBL performed with precision-recall AUC and BEDROC scores of 0.45 and 0.76. CONCLUSIONS: The inclusion of inactive data for model training produces models with superior AUC and improved early recognition capabilities, although the results from internal and external validation of the models show differing performance between the breadth of models. The realised target prediction protocol is available at https://github.com/lhm30/PIDGIN.Graphical abstractThe inclusion of large scale negative training data for in silico target prediction improves the precision and recall AUC and BEDROC scores for target models.The authors thank Krishna C. Bulusu for proof reading the manuscript. LHM would like to thank BBSRC and AstraZeneca and for their funding. GD thanks EPSRC and Eli Lilly for funding.This is the final version of the article. It first appeared from Springer via http://dx.doi.org/10.1186/s13321-015-0098-

    Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.

    Get PDF
    Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold

    Novel endosomolytic compounds enable highly potent delivery of antisense oligonucleotides

    Get PDF
    The therapeutic and research potentials of oligonucleotides (ONs) have been hampered in part by their inability to effectively escape endosomal compartments to reach their cytosolic and nuclear targets. Splice-switching ONs (SSOs) can be used with endosomolytic small molecule compounds to increase functional delivery. So far, development of these compounds has been hindered by a lack of high-resolution methods that can correlate SSO trafficking with SSO activity. Here we present in-depth characterization of two novel endosomolytic compounds by using a combination of microscopic and functional assays with high spatiotemporal resolution. This system allows the visualization of SSO trafficking, evaluation of endosomal membrane rupture, and quantitates SSO functional activity on a protein level in the presence of endosomolytic compounds. We confirm that the leakage of SSO into the cytosol occurs in parallel with the physical engorgement of LAMP1-positive late endosomes and lysosomes. We conclude that the new compounds interfere with SSO trafficking to the LAMP1-positive endosomal compartments while inducing endosomal membrane rupture and concurrent ON escape into the cytosol. The efficacy of these compounds advocates their use as novel, potent, and quick-acting transfection reagents for antisense ONs

    A data science roadmap for open science organizations engaged in early-stage drug discovery

    Get PDF
    The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design

    11th German Conference on Chemoinformatics (GCC 2015) : Fulda, Germany. 8-10 November 2015.

    Get PDF

    Novel endosomolytic compounds enable highly potent delivery of antisense oligonucleotides

    Get PDF
    The therapeutic and research potentials of oligonucleotides (ONs) have been hampered in part by their inability to effectively escape endosomal compartments to reach their cytosolic and nuclear targets. Splice-switching ONs (SSOs) can be used with endosomolytic small molecule compounds to increase functional delivery. So far, development of these compounds has been hindered by a lack of high-resolution methods that can correlate SSO trafficking with SSO activity. Here we present in-depth characterization of two novel endosomolytic compounds by using a combination of microscopic and functional assays with high spatiotemporal resolution. This system allows the visualization of SSO trafficking, evaluation of endosomal membrane rupture, and quantitates SSO functional activity on a protein level in the presence of endosomolytic compounds. We confirm that the leakage of SSO into the cytosol occurs in parallel with the physical engorgement of LAMP1-positive late endosomes and lysosomes. We conclude that the new compounds interfere with SSO trafficking to the LAMP1-positive endosomal compartments while inducing endosomal membrane rupture and concurrent ON escape into the cytosol. The efficacy of these compounds advocates their use as novel, potent, and quick-acting transfection reagents for antisense ONs

    Blinded predictions and post-hoc analysis of the second solubility challenge data : exploring training data and feature set selection for machine and deep learning models

    Get PDF
    Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state-of-the-art, the American Chemical Society organised a “Second Solubility Challenge” in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019, but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms, and were trained on a relatively small dataset of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility datasets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge datasets, with the best model, a graph convolutional neural network, resulting in a RMSE of 0.86 log units. Critical analysis of the models reveal systematic di↵erences between the performance of models using certain feature sets and training datasets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy, but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modelling complex chemical spaces from sparse training datasets

    A Systematic Review of Longitudinal Trajectories of Mental Health Problems in Children with Neurodevelopmental Disabilities

    Get PDF
    YesTo review the longitudinal trajectories – and the factors influencing their development – of mental health problems in children with neurodevelopmental disabilities. Systematic review methods were employed. Searches of six databases used keywords and MeSH terms related to children with neurodevelopmental disabilities, mental health problems, and longitudinal research. After the removal of duplicates, reviewers independently screened records for inclusion, extracted data (outcomes and influencing factors), and evaluated the risk of bias. Findings were tabulated and synthesized using graphs and a narrative. Searches identified 94,662 unique records, from which 49 publications were included. The median publication year was 2015. Children with attention deficit hyperactivity disorder were the most commonly included population in retrieved studies. In almost 50% of studies, trajectories of mental health problems changed by Swedish Research Council (2018-05824_VR
    corecore