657 research outputs found

    Machine learning approaches for computer aided drug discovery

    Get PDF
    Pharmaceutical drug discovery is expensive, time consuming and scientifically challenging. In order to increase efficiency of the pre-clinical drug discovery pathway, computational drug discovery methods and most recently, machine learning-based methods are increasingly used as powerful tools to aid early stage drug discovery. In this thesis, I present three complementary computer-aided drug discovery methods, with a focus on aiding hit discovery and hit-to-lead optimization. In addition, this thesis particularly focuses on exploring different molecular representations used to featurise machine learning models, in order explore how best to capture valuable information about protein, ligands and 3D protein-ligand complexes to build more robust, more interpretable and more accurate machine learning models. First, I developed ligand-based models using a Gaussian Process (GP) as an easy-to-implement tool to guide exploration of chemical space for the optimization of protein-ligand binding affinity. I explored different topological fingerprint and autoencoder representations for Bayesian optimisation (BO) and showed that BO is a powerful tool to help medicinal chemists to prioritise which new compounds to make for single-target as well as multi-target optimisation. The algorithm achieved high enrichment of top compounds for both single target and multiobjective optimisation when tested on a well known benchmark dataset of the drug target matrix metalloproteinase-12 and a real, ongoing drug optimisation dataset targeting four bacterial metallo-β-lactamases. Next, I present the development of a knowledge-based approach to drug design, combining new protein-ligand interaction fingerprints with a fragment-based drug discovery approach to understand SARS-CoV-2 Mpro-substrate specificity and to design novel small molecule inhibitors in silico. In combination with a fragment-based drug discovery approach, I show how this knowledge-based interaction fingerprint-driven approach can reveal fruitful fragment-growth design strategies. Lastly, I expand on the knowledge-based contact fingerprints to create a ligand-shaped molecular graph representation (Protein Ligand Interaction Graphs, PLIGs) to develop novel graph-based deep learning protein-ligand binding affinity scoring functions. PLIGs encode all intermolecular interactions in a protein-ligand complex within the node features of the graph and are therefore simple and fully interpretable. I explore a variety of Graph Neural Network architectures in combination with PLIGs and found Graph Attention Networks to perform slightly better than other GNN architectures, performing amongst the best known protein-ligand binding affinity scoring functions

    Structure-based drug discovery with deep learning

    Get PDF
    Artificial intelligence (AI) in the form of deep learning bears promise for drug discovery and chemical biology, e.g.\textit{e.g.}, to predict protein structure and molecular bioactivity, plan organic synthesis, and design molecules de novo\textit{de novo}. While most of the deep learning efforts in drug discovery have focused on ligand-based approaches, structure-based drug discovery has the potential to tackle unsolved challenges, such as affinity prediction for unexplored protein targets, binding-mechanism elucidation, and the rationalization of related chemical kinetic properties. Advances in deep learning methodologies and the availability of accurate predictions for protein tertiary structure advocate for a renaissance\textit{renaissance} in structure-based approaches for drug discovery guided by AI. This review summarizes the most prominent algorithmic concepts in structure-based deep learning for drug discovery, and forecasts opportunities, applications, and challenges ahead

    From Static to Dynamic Structures: Improving Binding Affinity Prediction with a Graph-Based Deep Learning Model

    Full text link
    Accurate prediction of the protein-ligand binding affinities is an essential challenge in the structure-based drug design. Despite recent advance in data-driven methods in affinity prediction, their accuracy is still limited, partially because they only take advantage of static crystal structures while the actual binding affinities are generally depicted by the thermodynamic ensembles between proteins and ligands. One effective way to approximate such a thermodynamic ensemble is to use molecular dynamics (MD) simulation. Here, we curated an MD dataset containing 3,218 different protein-ligand complexes, and further developed Dynaformer, which is a graph-based deep learning model. Dynaformer was able to accurately predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories. In silico experiments demonstrated that our model exhibits state-of-the-art scoring and ranking power on the CASF-2016 benchmark dataset, outperforming the methods hitherto reported. Moreover, we performed a virtual screening on the heat shock protein 90 (HSP90) using Dynaformer that identified 20 candidates and further experimentally validated their binding affinities. We demonstrated that our approach is more efficient, which can identify 12 hit compounds (two were in the submicromolar range), including several newly discovered scaffolds. We anticipate this new synergy between large-scale MD datasets and deep learning models will provide a new route toward accelerating the early drug discovery process.Comment: totally reorganize the texts and figure

    Machine Learning for Kinase Drug Discovery

    Get PDF
    Cancer is one of the major public health issues, causing several million losses every year. Although anti-cancer drugs have been developed and are globally administered, mild to severe side effects are known to occur during treatment. Computer-aided drug discovery has become a cornerstone for unveiling treatments of existing as well as emerging diseases. Computational methods aim to not only speed up the drug design process, but to also reduce time-consuming, costly experiments, as well as in vivo animal testing. In this context, over the last decade especially, deep learning began to play a prominent role in the prediction of molecular activity, property and toxicity. However, there are still major challenges when applying deep learning models in drug discovery. Those challenges include data scarcity for physicochemical tasks, the difficulty of interpreting the prediction made by deep neural networks, and the necessity of open-source and robust workflows to ensure reproducibility and reusability. In this thesis, after reviewing the state-of-the-art in deep learning applied to virtual screening, we address the previously mentioned challenges as follows: Regarding data scarcity in the context of deep learning applied to small molecules, we developed data augmentation techniques based on the SMILES encoding. This linear string notation enumerates the atoms present in a compound by following a path along the molecule graph. Multiplicity of SMILES for a single compound can be reached by traversing the graph using different paths. We applied the developed augmentation techniques to three different deep learning models, including convolutional and recurrent neural networks, and to four property and activity data sets. The results show that augmentation improves the model accuracy independently of the deep learning model, as well as of the data set size. Moreover, we computed the uncertainty of a model by using augmentation at inference time. In this regard, we have shown that the more confident the model is in its prediction, the smaller is the error, implying that a given prediction can be trusted and is close to the target value. The software and associated documentation allows making predictions for novel compounds and have been made freely available. Trusting predictions blindly from algorithms may have serious consequences in areas of healthcare. In this context, better understanding how a neural network classifies a compound based on its input features is highly beneficial by helping to de-risk and optimize compounds. In this research project, we decomposed the inner layers of a deep neural network to identify the toxic substructures, the toxicophores, of a compound that led to the toxicity classification. Using molecular fingerprints —vectors that indicate the presence or absence of a particular atomic environment —we were able to map a toxicity score to each of these substructures. Moreover, we developed a method to visualize in 2D the toxicophores within a compound, the so- called cytotoxicity maps, which could be of great use to medicinal chemists in identifying ways to modify molecules to eliminate toxicity. Not only does the deep learning model reach state-of-the-art results, but the identified toxicophores confirm known toxic substructures, as well as expand new potential candidates. In order to speed up the drug discovery process, the accessibility to robust and modular workflows is extremely advantageous. In this context, the fully open-source TeachOpenCADD project was developed. Significant tasks in both cheminformatics and bioinformatics are implemented in a pedagogical fashion, allowing the material to be used for teaching as well as the starting point for novel research. In this framework, a special pipeline is dedicated to kinases, a family of proteins which are known to be involved in diseases such as cancer. The aim is to gain insights into off-targets, i.e. proteins that are unintentionally affected by a compound, and that can cause adverse effects in treatments. Four measures of kinase similarity are implemented, taking into account sequence, and structural information, as well as protein-ligand interaction, and ligand profiling data. The workflow provides clustering of a set of kinases, which can be further analyzed to understand off-target effects of inhibitors. Results show that analyzing kinases using several perspectives is crucial for the insight into off-target prediction, and gaining a global perspective of the kinome. These novel methods can be exploited in the discovery of new drugs, and more specifically diseases involved in the dysregulation of kinases, such as cancer

    3D Convolutional Neural Networks for Computational Drug Discovery

    Get PDF
    This thesis describes aspects of the implementation and application of voxel-based con- volutional neural networks (CNNs) to problems in computational drug discovery. It opens by justifying the novelty of this approach by presenting a more mainstream approach to the common tasks of virtual screening and binding pose prediction, augmented with more sim- plistic machine learning methods, and demonstrating their suboptimal performance when applied prospectively. It then describes my contributions to our group’s development of voxel-based CNNs as we honed their implementation and training strategy, and reports our library that facilitates featurization and training using this approach. It continues with a prospective assessment of their performance, analogous to the first prospective evaluation, with the addition of a novel CNN-based pose sampling strategy. Next it makes a foray into model explanation, first in an oblique fashion, by examining the transferability of models to tasks that are distinct from but related to the tasks for which they were trained, and by a comparison with an approach based on exploiting dataset bias using other machine learning methods. Finally it describes the implementation of a more direct approach to model ex- planation, by using a trained network to perform optimization of inputs with respect to the network as a whole or individual nodes and analyzing the content of the result as well as its utility as a pseudo-pharmacophore
    • …
    corecore