12 research outputs found

    Machine Learning Applications for Drug Repurposing

    Full text link
    The cost of bringing a drug to market is astounding and the failure rate is intimidating. Drug discovery has been of limited success under the conventional reductionist model of one-drug-one-gene-one-disease paradigm, where a single disease-associated gene is identified and a molecular binder to the specific target is subsequently designed. Under the simplistic paradigm of drug discovery, a drug molecule is assumed to interact only with the intended on-target. However, small molecular drugs often interact with multiple targets, and those off-target interactions are not considered under the conventional paradigm. As a result, drug-induced side effects and adverse reactions are often neglected until a very late stage of the drug discovery, where the discovery of drug-induced side effects and potential drug resistance can decrease the value of the drug and even completely invalidate the use of the drug. Thus, a new paradigm in drug discovery is needed. Structural systems pharmacology is a new paradigm in drug discovery that the drug activities are studied by data-driven large-scale models with considerations of the structures and drugs. Structural systems pharmacology will model, on a genome scale, the energetic and dynamic modifications of protein targets by drug molecules as well as the subsequent collective effects of drug-target interactions on the phenotypic drug responses. To date, however, few experimental and computational methods can determine genome-wide protein-ligand interaction networks and the clinical outcomes mediated by them. As a result, the majority of proteins have not been charted for their small molecular ligands; we have a limited understanding of drug actions. To address the challenge, this dissertation seeks to develop and experimentally validate innovative computational methods to infer genome-wide protein-ligand interactions and multi-scale drug-phenotype associations, including drug-induced side effects. The hypothesis is that the integration of data-driven bioinformatics tools with structure-and-mechanism-based molecular modeling methods will lead to an optimal tool for accurately predicting drug actions and drug associated phenotypic responses, such as side effects. This dissertation starts by reviewing the current status of computational drug discovery for complex diseases in Chapter 1. In Chapter 2, we present REMAP, a one-class collaborative filtering method to predict off-target interactions from protein-ligand interaction network. In our later work, REMAP was integrated with structural genomics and statistical machine learning methods to design a dual-indication polypharmacological anticancer therapy. In Chapter 3, we extend REMAP, the core method in Chapter 2, into a multi-ranked collaborative filtering algorithm, WINTF, and present relevant mathematical justifications. Chapter 4 is an application of WINTF to repurpose an FDA-approved drug diazoxide as a potential treatment for triple negative breast cancer, a deadly subtype of breast cancer. In Chapter 5, we present a multilayer extension of REMAP, applied to predict drug-induced side effects and the associated biological pathways. In Chapter 6, we close this dissertation by presenting a deep learning application to learn biochemical features from protein sequence representation using a natural language processing method

    Improved genome-scale multitarget virtual screening via a novel collaborative filtering approach to cold-start problem

    Full text link
    Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multitarget virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design

    Exploration of chemical space with partial labeled noisy student self‑training and self‑supervised graph embedding

    Full text link
    Background Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure–activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models. Results We evaluated the performance of PLANS-GINFP for predicting Cytochrome P450 (CYP450) binding activity in a CYP450 dataset and chemical toxicity in the Tox21 dataset. The extensive benchmark studies demonstrated that PLANS-GINFP could significantly improve the performance in both cases by a large margin. Both PLANS-based self-training and GINFP-based self-supervised learning contribute to the performance improvement. Conclusion To better exploit chemical structures as an input for machine learning algorithms, we proposed a self-supervised graph neural network-based embedding method that can encode substructure information. Furthermore, we developed a model agnostic self-training method, PLANS, that can be applied to any deep learning architectures to improve prediction accuracies. PLANS provided a way to better utilize partially labeled and unlabeled data. Comprehensive benchmark studies demonstrated their potentials in predicting drug metabolism and toxicity profiles using sparse, noisy, and imbalanced data. PLANS-GINFP could serve as a general solution to improve the predictive modeling for QSAR modeling

    Crowdsourced mapping of unexplored target space of kinase inhibitors

    Get PDF
    Despite decades of intensive search for compounds that modulate the activity of particular protein targets, a large proportion of the human kinome remains as yet undrugged. Effective approaches are therefore required to map the massive space of unexplored compound-kinase interactions for novel and potent activities. Here, we carry out a crowdsourced benchmarking of predictive algorithms for kinase inhibitor potencies across multiple kinase families tested on unpublished bioactivity data. We find the top-performing predictions are based on various models, including kernel learning, gradient boosting and deep learning, and their ensemble leads to a predictive accuracy exceeding that of single-dose kinase activity assays. We design experiments based on the model predictions and identify unexpected activities even for under-studied kinases, thereby accelerating experimental mapping efforts. The open-source prediction algorithms together with the bioactivities between 95 compounds and 295 kinases provide a resource for benchmarking prediction algorithms and for extending the druggable kinome. The IDG-DREAM Challenge carried out crowdsourced benchmarking of predictive algorithms for kinase inhibitor activities on unpublished data. This study provides a resource to compare emerging algorithms and prioritize new kinase activities to accelerate drug discovery and repurposing efforts

    Rational discovery of dual-indication multi-target PDE/Kinase inhibitor for precision anti-cancer therapy using structural systems pharmacology.

    No full text
    Many complex diseases such as cancer are associated with multiple pathological manifestations. Moreover, the therapeutics for their treatments often lead to serious side effects. Thus, it is needed to develop multi-indication therapeutics that can simultaneously target multiple clinical indications of interest and mitigate the side effects. However, conventional one-drug-one-gene drug discovery paradigm and emerging polypharmacology approach rarely tackle the challenge of multi-indication drug design. For the first time, we propose a one-drug-multi-target-multi-indication strategy. We develop a novel structural systems pharmacology platform 3D-REMAP that uses ligand binding site comparison and protein-ligand docking to augment sparse chemical genomics data for the machine learning model of genome-scale chemical-protein interaction prediction. Experimentally validated predictions systematically show that 3D-REMAP outperforms state-of-the-art ligand-based, receptor-based, and machine learning methods alone. As a proof-of-concept, we utilize the concept of drug repurposing that is enabled by 3D-REMAP to design dual-indication anti-cancer therapy. The repurposed drug can demonstrate anti-cancer activity for cancers that do not have effective treatment as well as reduce the risk of heart failure that is associated with all types of existing anti-cancer therapies. We predict that levosimendan, a PDE inhibitor for heart failure, inhibits serine/threonine-protein kinase RIOK1 and other kinases. Subsequent experiments and systems biology analyses confirm this prediction, and suggest that levosimendan is active against multiple cancers, notably lymphoma, through the direct inhibition of RIOK1 and RNA processing pathway. We further develop machine learning models to predict cancer cell-line's and a patient's response to levosimendan. Our findings suggest that levosimendan can be a promising novel lead compound for the development of safe, effective, and precision multi-indication anti-cancer therapy. This study demonstrates the potential of structural systems pharmacology in designing polypharmacology for precision medicine. It may facilitate transforming the conventional one-drug-one-gene-one-disease drug discovery process and single-indication polypharmacology approach into a new one-drug-multi-target-multi-indication paradigm for complex diseases

    Large-Scale Off-Target Identification Using Fast and Accurate Dual Regularized One-Class Collaborative Filtering and Its Application to Drug Repurposing

    Get PDF
    <div><p>Target-based screening is one of the major approaches in drug discovery. Besides the intended target, unexpected drug off-target interactions often occur, and many of them have not been recognized and characterized. The off-target interactions can be responsible for either therapeutic or side effects. Thus, identifying the genome-wide off-targets of lead compounds or existing drugs will be critical for designing effective and safe drugs, and providing new opportunities for drug repurposing. Although many computational methods have been developed to predict drug-target interactions, they are either less accurate than the one that we are proposing here or computationally too intensive, thereby limiting their capability for large-scale off-target identification. In addition, the performances of most machine learning based algorithms have been mainly evaluated to predict off-target interactions in the same gene family for hundreds of chemicals. It is not clear how these algorithms perform in terms of detecting off-targets across gene families on a proteome scale. Here, we are presenting a fast and accurate off-target prediction method, REMAP, which is based on a dual regularized one-class collaborative filtering algorithm, to explore continuous chemical space, protein space, and their interactome on a large scale. When tested in a reliable, extensive, and cross-gene family benchmark, REMAP outperforms the state-of-the-art methods. Furthermore, REMAP is highly scalable. It can screen a dataset of 200 thousands chemicals against 20 thousands proteins within 2 hours. Using the reconstructed genome-wide target profile as the fingerprint of a chemical compound, we predicted that seven FDA-approved drugs can be repurposed as novel anti-cancer therapies. The anti-cancer activity of six of them is supported by experimental evidences. Thus, REMAP is a valuable addition to the existing <i>in silico</i> toolbox for drug target identification, drug repurposing, phenotypic screening, and side effect prediction. The software and benchmark are available at <a href="https://github.com/hansaimlim/REMAP" target="_blank">https://github.com/hansaimlim/REMAP</a>.</p></div

    Performance comparison for REMAP (green), PRW (blue), and NRLMF (orange).

    No full text
    <p>NT2 (2 known targets per chemical) datasets used for varying number of ligands (A) and chemical structural similarity (B). Performance measurement explained in the measuring prediction accuracy of REMAP by TPR vs. cutoff rank section. <b>(A)</b> Performance comparison on the datasets with varying number of ligands per protein. For example, the x-axis of L11to15 means that the proteins of interest have between 11 and 15 known chemicals to bind. <b>(B)</b> Performance comparison on the datasets with the ranges of chemical structural similarity of the tested chemicals to the trained chemicals. For instance, the x-axis of Tc0.6to0.7 means that for the tested chemicals, at least one trained chemical was found such that and no trained chemical was found in greater similarity than 0.7. All TPR values are based on 10-fold cross validation. Error bars represents s.e.m. Asterisks represents statistical significance based one t-test of the 10 TPR values (* for p < 0.05, ** for p < 0.001).</p

    The known uses and target information for the anti-cancer drug cluster in Fig 8B obtained from DrugBank.

    No full text
    <p>The known targets are in UniProt Accession. The target information from UniProt is in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005135#pcbi.1005135.s009" target="_blank">S1 Table</a>.</p

    The overall process of REMAP. The rectangular boxes with capitalized symbols are matrices, and the smaller boxes and ovals are chemicals and proteins, respectively, in the simplified network representation (top-left corner).

    No full text
    <p>Solid lines within the network represent connectivity (edges), and the arrows represent mathematical processes. Red squares represent single similarity values, and blue bars in U and V represent row and column vectors. Lower-case c and p represents chemicals and proteins, respectively. The letter symbols are annotated in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1005135#pcbi.1005135.t001" target="_blank">Table 1</a>.</p

    Average running times of REMAP using a single core node with 2.88 GB of memory. All running times are in seconds.

    No full text
    <p><b>(A)</b> Average running times on the ZINC dataset (12,384 chemicals and 3,500 proteins) according to the low-rank (<i>r)</i>. The linear fit with R<sup>2</sup> = 0.9856 (orange line). <b>(B)</b> Average running times according to the number of proteins (columns) from 1,000 to 20,000. The number of chemicals (rows) were fixed to 200,000. Error bars represent s.e.m., with n ≥ 15 for (A) and n ≥ 30 for (B).</p
    corecore