1,464 research outputs found

    Visual and computational analysis of structure-activity relationships in high-throughput screening data

    Get PDF
    Novel analytic methods are required to assimilate the large volumes of structural and bioassay data generated by combinatorial chemistry and high-throughput screening programmes in the pharmaceutical and agrochemical industries. This paper reviews recent work in visualisation and data mining that can be used to develop structure-activity relationships from such chemical/biological datasets

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Identification of a selective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networks

    Get PDF
    Cellular senescence is a barrier to tumorigenesis in normal cells and tumour cells undergo senescence responses to genotoxic stimuli, which is a potential target phenotype for cancer therapy. However, in this setting, mixed-mode responses are common with apoptosis the dominant effect. Hence, more selective senescence inducers are required. Here we report a machine learning-based in silico screen to identify potential senescence agonists. We built profiles of differentially affected biological process networks from expression data obtained under induced telomere dysfunction conditions in colorectal cancer cells and matched these to a panel of 17 protein targets with confirmatory screening data in PubChem. We trained a neural network using 3517 compounds identified as active or inactive against these targets. The resulting classification model was used to screen a virtual library of ~2M lead-like compounds. 147 virtual hits were acquired for validation in growth inhibition and senescence-associated β-galactosidase (SA-β-gal) assays. Among the found hits a benzimidazolone compound, CB-20903630, had low micromolar IC50 for growth inhibition of HCT116 cells and selectively induced SA-β-gal activity in the entire treated cell population without cytotoxicity or apoptosis induction. Growth suppression was mediated by G1 blockade involving increased p21 expression and suppressed cyclin B1, CDK1 and CDC25C. Additionally, the compound inhibited growth of multicellular spheroids and caused severe retardation of population kinetics in long term treatments. Preliminary structure-activity and structure clustering analyses are reported and expression analysis of CB-20903630 against other cell cycle suppressor compounds suggested a PI3K/AKT-inhibitor-like profile in normal cells, with different pathways affected in cancer cells

    Computational Methods Generating High-Resolution Views of Complex Structure-Activity Relationships

    Get PDF
    The analysis of structure-activity relationships (SARs) of small bioactive compounds is a central task in medicinal chemistry and pharmaceutical research. The study of SARs is in principle not limited to computational methods, however, as data sets rapidly grow in size, advanced computational approaches become indispensable for SAR analysis. Activity landscapes are one of the preferred and widely used computational models to study large-scale SARs. Activity cliffs are cardinal features of activity landscape representations and are thought to contain high SAR information content. This work addresses major challenges in systematic SAR exploration and specifically focuses on the design of novel activity landscape models and comprehensive activity cliff analysis. In the first part of the thesis, two conceptually different activity landscape representations are introduced for compounds active against multiple targets. These models are designed to provide an intuitive graphical access to compounds forming single and multi-target activity cliffs and displaying multi-target SAR characteristics. Further, a systematic analysis of the frequency and distribution of activity cliffs is carried out. In addition, a large-scale data mining effort is designed to quantify and analyze fingerprint-dependent changes in SAR information. The second part of this work is dedicated to the concept of activity cliffs and their utility in the practice of medicinal chemistry. Therefore, a computational approach is introduced to search for detectable SAR advantages associated with activity cliffs. In addition, the question is investigated to what extent activity cliffs might be utilized as starting points in practical compound optimization efforts. Finally, all activity cliff configurations formed by currently available bioactive compounds are thoroughly examined. These configurations are further classified and their frequency of occurrence and target distribution are determined. Furthermore, the activity cliff concept is extended to explore the relation between chemical structures and compound promiscuity. The notion of promiscuity cliffs is introduced to deduce structural modifications that might induce large-magnitude promiscuity effects

    Development and Interpretation of Machine Learning Models for Drug Discovery

    Get PDF
    In drug discovery, domain experts from different fields such as medicinal chemistry, biology, and computer science often collaborate to develop novel pharmaceutical agents. Computational models developed in this process must be correct and reliable, but at the same time interpretable. Their findings have to be accessible by experts from other fields than computer science to validate and improve them with domain knowledge. Only if this is the case, the interdisciplinary teams are able to communicate their scientific results both precisely and intuitively. This work is concerned with the development and interpretation of machine learning models for drug discovery. To this end, it describes the design and application of computational models for specialized use cases, such as compound profiling and hit expansion. Novel insights into machine learning for ligand-based virtual screening are presented, and limitations in the modeling of compound potency values are highlighted. It is shown that compound activity can be predicted based on high-dimensional target profiles, without the presence of molecular structures. Moreover, support vector regression for potency prediction is carefully analyzed, and a systematic misprediction of highly potent ligands is discovered. Furthermore, a key aspect is the interpretation and chemically accessible representation of the models. Therefore, this thesis focuses especially on methods to better understand and communicate modeling results. To this end, two interactive visualizations for the assessment of naive Bayes and support vector machine models on molecular fingerprints are presented. These visual representations of virtual screening models are designed to provide an intuitive chemical interpretation of the results

    Exploring QSAR models for activity-cliff prediction

    Get PDF
    Introduction and Methodology: Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that QSAR models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. Results and Conclusions: We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound optimisation tools. For general QSAR-prediction, however, extended connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance
    • …