199 research outputs found
Exploring the GDB-13 chemical space using deep generative models
Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample
Randomized SMILES strings improve the quality of molecular generative models
Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES
Graph networks for molecular design
Deep learning methods applied to chemistry can be used to accelerate the discovery of new molecules. This work introduces GraphINVENT, a platform developed for graph-based molecular design using graph neural networks (GNNs). GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time. All models implemented in GraphINVENT can quickly learn to build molecules resembling the training set molecules without any explicit programming of chemical rules. The models have been benchmarked using the MOSES distribution-based metrics, showing how GraphINVENT models compare well with state-of-the-art generative models. This work compares six different GNN-based generative models in GraphINVENT, and shows that ultimately the gated-graph neural network performs best against the metrics considered here
In silico generation of novel, drug-like chemical matter using the LSTM neural network
The exploration of novel chemical spaces is one of the most important tasks
of cheminformatics when supporting the drug discovery process. Properly
designed and trained deep neural networks can provide a viable alternative to
brute-force de novo approaches or various other machine-learning techniques for
generating novel drug-like molecules. In this article we present a method to
generate molecules using a long short-term memory (LSTM) neural network and
provide an analysis of the results, including a virtual screening test. Using
the network one million drug-like molecules were generated in 2 hours. The
molecules are novel, diverse (contain numerous novel chemotypes), have good
physicochemical properties and have good synthetic accessibility, even though
these qualities were not specific constraints. Although novel, their structural
features and functional groups remain closely within the drug-like space
defined by the bioactive molecules from ChEMBL. Virtual screening using the
profile QSAR approach confirms that the potential of these novel molecules to
show bioactivity is comparable to the ChEMBL set from which they were derived.
The molecule generator written in Python used in this study is available on
request.Comment: in this version fixed some reference number
Learning Extremal Representations with Deep Archetypal Analysis
Archetypes are typical population representatives in an extremal sense, where
typicality is understood as the most extreme manifestation of a trait or
feature. In linear feature space, archetypes approximate the data convex hull
allowing all data points to be expressed as convex mixtures of archetypes.
However, it might not always be possible to identify meaningful archetypes in a
given feature space. Learning an appropriate feature space and identifying
suitable archetypes simultaneously addresses this problem. This paper
introduces a generative formulation of the linear archetype model,
parameterized by neural networks. By introducing the distance-dependent
archetype loss, the linear archetype model can be integrated into the latent
space of a variational autoencoder, and an optimal representation with respect
to the unknown archetypes can be learned end-to-end. The reformulation of
linear Archetypal Analysis as deep variational information bottleneck, allows
the incorporation of arbitrarily complex side information during training.
Furthermore, an alternative prior, based on a modified Dirichlet distribution,
is proposed. The real-world applicability of the proposed method is
demonstrated by exploring archetypes of female facial expressions while using
multi-rater based emotion scores of these expressions as side information. A
second application illustrates the exploration of the chemical space of small
organic molecules. In this experiment, it is demonstrated that exchanging the
side information but keeping the same set of molecules, e. g. using as side
information the heat capacity of each molecule instead of the band gap energy,
will result in the identification of different archetypes. As an application,
these learned representations of chemical space might reveal distinct starting
points for de novo molecular design.Comment: Under review for publication at the International Journal of Computer
Vision (IJCV). Extended version of our GCPR2019 paper "Deep Archetypal
Analysis
Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents
Few technological ideas have captivated the minds of biochemical researchers to the degree that machine learning (ML) and artificial intelligence (AI) have. Over the last few years, advances in the ML field have driven the design of new computational systems that improve with experience and are able to model increasingly complex chemical and biological phenomena. In this dissertation, we capitalize on these achievements and use machine learning to study drug receptor sites and design drugs to target these sites. First, we analyze the significance of various single nucleotide variations and assess their rate of contribution to cancer. Following that, we used a portfolio of machine learning and data science approaches to design new drugs to target protein kinase inhibitors. We show that these techniques exhibit strong promise in aiding cancer research and drug discovery
Development of Machine Learning Models for Generation and Activity Prediction of the Protein Tyrosine Kinase Inhibitors
The field of computational drug discovery and development continues to grow at a rapid pace, using generative machine learning approaches to present us with solutions to high dimensional and complex problems in drug discovery and design. In this work, we present a platform of Machine Learning based approaches for generation and scoring of novel kinase inhibitor molecules. We utilized a binary Random Forest classification model to develop a Machine Learning based scoring function to evaluate the generated molecules on Kinase Inhibition Likelihood. By training the model on several chemical features of each known kinase inhibitor, we were able to create a metric that captures the differences between a SRC Kinase Inhibitor and a non-SRC Kinase Inhibitor. We implemented the scoring function into a Biased and Unbiased Bayesian Optimization framework to generate molecules based on features of SRC Kinase Inhibitors. We then used similarity metrics such as Tanimoto Similarity to assess their closeness to that of known SRC Kinase Inhibitors. The molecules generated from this experiment demonstrated potential for belonging to the SRC Kinase Inhibitor family though chemical synthesis would be needed to confirm the results. The top molecules generated from the Unbiased and Biased Bayesian Optimization experiments were calculated to respectively have Tanimoto Similarity scores of 0.711 and 0.709 to known SRC Kinase Inhibitors. With calculated Kinase Inhibition Likelihood scores of 0.586 and 0.575, the top molecules generated from the Bayesian Optimization demonstrate a disconnect between the similarity scores to known SRC Kinase Inhibitors and the calculated Kinase Inhibition Likelihood score. It was found that implementing a bias into the Bayesian Optimization process had little effect on the quality of generated molecules. In addition, several molecules generated from the Bayesian Optimization process were sent to the School of Pharmacy for chemical synthesis which gives the experiment more concrete results. The results of this study demonstrated that generating molecules throughBayesian Optimization techniques could aid in the generation of molecules for a specific kinase family, but further expansions of the techniques would be needed for substantial results
- …