6 research outputs found

    Machine Learning Small Molecule Properties in Drug Discovery

    Full text link
    Machine learning (ML) is a promising approach for predicting small molecule properties in drug discovery. Here, we provide a comprehensive overview of various ML methods introduced for this purpose in recent years. We review a wide range of properties, including binding affinities, solubility, and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity). We discuss existing popular datasets and molecular descriptors and embeddings, such as chemical fingerprints and graph-based neural networks. We highlight also challenges of predicting and optimizing multiple properties during hit-to-lead and lead optimization stages of drug discovery and explore briefly possible multi-objective optimization techniques that can be used to balance diverse properties while optimizing lead candidates. Finally, techniques to provide an understanding of model predictions, especially for critical decision-making in drug discovery are assessed. Overall, this review provides insights into the landscape of ML models for small molecule property predictions in drug discovery. So far, there are multiple diverse approaches, but their performances are often comparable. Neural networks, while more flexible, do not always outperform simpler models. This shows that the availability of high-quality training data remains crucial for training accurate models and there is a need for standardized benchmarks, additional performance metrics, and best practices to enable richer comparisons between the different techniques and models that can shed a better light on the differences between the many techniques.Comment: 46 pages, 1 figur

    Machine Learning for Kinase Drug Discovery

    Get PDF
    Cancer is one of the major public health issues, causing several million losses every year. Although anti-cancer drugs have been developed and are globally administered, mild to severe side effects are known to occur during treatment. Computer-aided drug discovery has become a cornerstone for unveiling treatments of existing as well as emerging diseases. Computational methods aim to not only speed up the drug design process, but to also reduce time-consuming, costly experiments, as well as in vivo animal testing. In this context, over the last decade especially, deep learning began to play a prominent role in the prediction of molecular activity, property and toxicity. However, there are still major challenges when applying deep learning models in drug discovery. Those challenges include data scarcity for physicochemical tasks, the difficulty of interpreting the prediction made by deep neural networks, and the necessity of open-source and robust workflows to ensure reproducibility and reusability. In this thesis, after reviewing the state-of-the-art in deep learning applied to virtual screening, we address the previously mentioned challenges as follows: Regarding data scarcity in the context of deep learning applied to small molecules, we developed data augmentation techniques based on the SMILES encoding. This linear string notation enumerates the atoms present in a compound by following a path along the molecule graph. Multiplicity of SMILES for a single compound can be reached by traversing the graph using different paths. We applied the developed augmentation techniques to three different deep learning models, including convolutional and recurrent neural networks, and to four property and activity data sets. The results show that augmentation improves the model accuracy independently of the deep learning model, as well as of the data set size. Moreover, we computed the uncertainty of a model by using augmentation at inference time. In this regard, we have shown that the more confident the model is in its prediction, the smaller is the error, implying that a given prediction can be trusted and is close to the target value. The software and associated documentation allows making predictions for novel compounds and have been made freely available. Trusting predictions blindly from algorithms may have serious consequences in areas of healthcare. In this context, better understanding how a neural network classifies a compound based on its input features is highly beneficial by helping to de-risk and optimize compounds. In this research project, we decomposed the inner layers of a deep neural network to identify the toxic substructures, the toxicophores, of a compound that led to the toxicity classification. Using molecular fingerprints —vectors that indicate the presence or absence of a particular atomic environment —we were able to map a toxicity score to each of these substructures. Moreover, we developed a method to visualize in 2D the toxicophores within a compound, the so- called cytotoxicity maps, which could be of great use to medicinal chemists in identifying ways to modify molecules to eliminate toxicity. Not only does the deep learning model reach state-of-the-art results, but the identified toxicophores confirm known toxic substructures, as well as expand new potential candidates. In order to speed up the drug discovery process, the accessibility to robust and modular workflows is extremely advantageous. In this context, the fully open-source TeachOpenCADD project was developed. Significant tasks in both cheminformatics and bioinformatics are implemented in a pedagogical fashion, allowing the material to be used for teaching as well as the starting point for novel research. In this framework, a special pipeline is dedicated to kinases, a family of proteins which are known to be involved in diseases such as cancer. The aim is to gain insights into off-targets, i.e. proteins that are unintentionally affected by a compound, and that can cause adverse effects in treatments. Four measures of kinase similarity are implemented, taking into account sequence, and structural information, as well as protein-ligand interaction, and ligand profiling data. The workflow provides clustering of a set of kinases, which can be further analyzed to understand off-target effects of inhibitors. Results show that analyzing kinases using several perspectives is crucial for the insight into off-target prediction, and gaining a global perspective of the kinome. These novel methods can be exploited in the discovery of new drugs, and more specifically diseases involved in the dysregulation of kinases, such as cancer

    A Comprehensive Survey on Deep Graph Representation Learning

    Full text link
    Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields, including machine learning and data mining. Classic graph embedding methods follow the basic idea that the embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i) traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii) representation learning and downstream tasks are dependent on each other which should be jointly enhanced. With the remarkable success of deep learning, deep graph representation learning has shown great potential and advantages over shallow (traditional) methods, there exist a large number of deep graph representation learning techniques have been proposed in the past decade, especially graph neural networks. In this survey, we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential components of graph representation learning and categorize existing approaches by the ways of graph neural network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides the practical and promising applications of deep graph representation learning. Last but not least, we state new perspectives and suggest challenging directions which deserve further investigations in the future
    corecore