238 research outputs found

    jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats.</p> <p>Results</p> <p>We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al.</p> <p>Conclusions</p> <p>jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.</p

    Analyzing Learned Molecular Representations for Property Prediction

    Full text link
    Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows

    Interpretable molecular encodings and representations for machine learning tasks

    Get PDF
    Molecular encodings and their usage in machine learning models have demonstrated significant breakthroughs in biomedical applications, particularly in the classification of peptides and proteins. To this end, we propose a new encoding method: Interpretable Carbon-based Array of Neighborhoods (iCAN). Designed to address machine learning models' need for more structured and less flexible input, it captures the neighborhoods of carbon atoms in a counting array and improves the utility of the resulting encodings for machine learning models. The iCAN method provides interpretable molecular encodings and representations, enabling the comparison of molecular neighborhoods, identification of repeating patterns, and visualization of relevance heat maps for a given data set. When reproducing a large biomedical peptide classification study, it outperforms its predecessor encoding. When extended to proteins, it outperforms a lead structure-based encoding on 71% of the data sets. Our method offers interpretable encodings that can be applied to all organic molecules, including exotic amino acids, cyclic peptides, and larger proteins, making it highly versatile across various domains and data sets. This work establishes a promising new direction for machine learning in peptide and protein classification in biomedicine and healthcare, potentially accelerating advances in drug discovery and disease diagnosis

    From Static to Dynamic Structures: Improving Binding Affinity Prediction with a Graph-Based Deep Learning Model

    Full text link
    Accurate prediction of the protein-ligand binding affinities is an essential challenge in the structure-based drug design. Despite recent advance in data-driven methods in affinity prediction, their accuracy is still limited, partially because they only take advantage of static crystal structures while the actual binding affinities are generally depicted by the thermodynamic ensembles between proteins and ligands. One effective way to approximate such a thermodynamic ensemble is to use molecular dynamics (MD) simulation. Here, we curated an MD dataset containing 3,218 different protein-ligand complexes, and further developed Dynaformer, which is a graph-based deep learning model. Dynaformer was able to accurately predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories. In silico experiments demonstrated that our model exhibits state-of-the-art scoring and ranking power on the CASF-2016 benchmark dataset, outperforming the methods hitherto reported. Moreover, we performed a virtual screening on the heat shock protein 90 (HSP90) using Dynaformer that identified 20 candidates and further experimentally validated their binding affinities. We demonstrated that our approach is more efficient, which can identify 12 hit compounds (two were in the submicromolar range), including several newly discovered scaffolds. We anticipate this new synergy between large-scale MD datasets and deep learning models will provide a new route toward accelerating the early drug discovery process.Comment: totally reorganize the texts and figure

    Machine Learning for Kinase Drug Discovery

    Get PDF
    Cancer is one of the major public health issues, causing several million losses every year. Although anti-cancer drugs have been developed and are globally administered, mild to severe side effects are known to occur during treatment. Computer-aided drug discovery has become a cornerstone for unveiling treatments of existing as well as emerging diseases. Computational methods aim to not only speed up the drug design process, but to also reduce time-consuming, costly experiments, as well as in vivo animal testing. In this context, over the last decade especially, deep learning began to play a prominent role in the prediction of molecular activity, property and toxicity. However, there are still major challenges when applying deep learning models in drug discovery. Those challenges include data scarcity for physicochemical tasks, the difficulty of interpreting the prediction made by deep neural networks, and the necessity of open-source and robust workflows to ensure reproducibility and reusability. In this thesis, after reviewing the state-of-the-art in deep learning applied to virtual screening, we address the previously mentioned challenges as follows: Regarding data scarcity in the context of deep learning applied to small molecules, we developed data augmentation techniques based on the SMILES encoding. This linear string notation enumerates the atoms present in a compound by following a path along the molecule graph. Multiplicity of SMILES for a single compound can be reached by traversing the graph using different paths. We applied the developed augmentation techniques to three different deep learning models, including convolutional and recurrent neural networks, and to four property and activity data sets. The results show that augmentation improves the model accuracy independently of the deep learning model, as well as of the data set size. Moreover, we computed the uncertainty of a model by using augmentation at inference time. In this regard, we have shown that the more confident the model is in its prediction, the smaller is the error, implying that a given prediction can be trusted and is close to the target value. The software and associated documentation allows making predictions for novel compounds and have been made freely available. Trusting predictions blindly from algorithms may have serious consequences in areas of healthcare. In this context, better understanding how a neural network classifies a compound based on its input features is highly beneficial by helping to de-risk and optimize compounds. In this research project, we decomposed the inner layers of a deep neural network to identify the toxic substructures, the toxicophores, of a compound that led to the toxicity classification. Using molecular fingerprints —vectors that indicate the presence or absence of a particular atomic environment —we were able to map a toxicity score to each of these substructures. Moreover, we developed a method to visualize in 2D the toxicophores within a compound, the so- called cytotoxicity maps, which could be of great use to medicinal chemists in identifying ways to modify molecules to eliminate toxicity. Not only does the deep learning model reach state-of-the-art results, but the identified toxicophores confirm known toxic substructures, as well as expand new potential candidates. In order to speed up the drug discovery process, the accessibility to robust and modular workflows is extremely advantageous. In this context, the fully open-source TeachOpenCADD project was developed. Significant tasks in both cheminformatics and bioinformatics are implemented in a pedagogical fashion, allowing the material to be used for teaching as well as the starting point for novel research. In this framework, a special pipeline is dedicated to kinases, a family of proteins which are known to be involved in diseases such as cancer. The aim is to gain insights into off-targets, i.e. proteins that are unintentionally affected by a compound, and that can cause adverse effects in treatments. Four measures of kinase similarity are implemented, taking into account sequence, and structural information, as well as protein-ligand interaction, and ligand profiling data. The workflow provides clustering of a set of kinases, which can be further analyzed to understand off-target effects of inhibitors. Results show that analyzing kinases using several perspectives is crucial for the insight into off-target prediction, and gaining a global perspective of the kinome. These novel methods can be exploited in the discovery of new drugs, and more specifically diseases involved in the dysregulation of kinases, such as cancer

    Graph-convolution neural network-based flexible docking utilizing coarse-grained distance matrix

    Full text link
    Prediction of protein-ligand complexes for flexible proteins remains still a challenging problem in computational structural biology and drug design. Here we present two novel deep neural network approaches with significant improvement in efficiency and accuracy of binding mode prediction on a large and diverse set of protein systems compared to standard docking. Whereas the first graph convolutional network is used for re-ranking poses the second approach aims to generate and rank poses independent of standard docking approaches. This novel approach relies on the prediction of distance matrices between ligand atoms and protein C_alpha atoms thus incorporating side-chain flexibility implicitly

    Subgraph Matching Kernels for Attributed Graphs

    Full text link
    We propose graph kernels based on subgraph matchings, i.e. structure-preserving bijections between subgraphs. While recently proposed kernels based on common subgraphs (Wale et al., 2008; Shervashidze et al., 2009) in general can not be applied to attributed graphs, our approach allows to rate mappings of subgraphs by a flexible scoring scheme comparing vertex and edge attributes by kernels. We show that subgraph matching kernels generalize several known kernels. To compute the kernel we propose a graph-theoretical algorithm inspired by a classical relation between common subgraphs of two graphs and cliques in their product graph observed by Levi (1973). Encouraging experimental results on a classification task of real-world graphs are presented.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012
    • …
    corecore