48 research outputs found
Improving Chemical Autoencoder Latent Space and Molecular De novo Generation Diversity with Heteroencoders
Chemical autoencoders are attractive models as they combine chemical space
navigation with possibilities for de-novo molecule generation in areas of
interest. This enables them to produce focused chemical libraries around a
single lead compound for employment early in a drug discovery project. Here it
is shown that the choice of chemical representation, such as SMILES strings,
has a large influence on the properties of the latent space. It is further
explored to what extent translating between different chemical representations
influences the latent space similarity to the SMILES strings or circular
fingerprints. By employing SMILES enumeration for either the encoder or
decoder, it is found that the decoder has the largest influence on the
properties of the latent space. Training a sequence to sequence heteroencoder
based on recurrent neural networks(RNNs) with long short-term memory cells
(LSTM) to predict different enumerated SMILES strings from the same canonical
SMILES string gives the largest similarity between latent space distance and
molecular similarity measured as circular fingerprints similarity. Using the
output from the bottleneck in QSAR modelling of five molecular datasets shows
that heteroencoder derived vectors markedly outperforms autoencoder derived
vectors as well as models built using ECFP4 fingerprints, underlining the
increased chemical relevance of the latent space. However, the use of
enumeration during training of the decoder leads to a markedly increase in the
rate of decoding to a different molecules than encoded, a tendency that can be
counteracted with more complex network architectures
Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the policy network, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models
Faster and more diverse de novo molecular optimization with double-loop reinforcement learning using augmented SMILES
Using generative deep learning models and reinforcement learning together can
effectively generate new molecules with desired properties. By employing a
multi-objective scoring function, thousands of high-scoring molecules can be
generated, making this approach useful for drug discovery and material science.
However, the application of these methods can be hindered by computationally
expensive or time-consuming scoring procedures, particularly when a large
number of function calls are required as feedback in the reinforcement learning
optimization. Here, we propose the use of double-loop reinforcement learning
with simplified molecular line entry system (SMILES) augmentation to improve
the efficiency and speed of the optimization. By adding an inner loop that
augments the generated SMILES strings to non-canonical SMILES for use in
additional reinforcement learning rounds, we can both reuse the scoring
calculations on the molecular level, thereby speeding up the learning process,
as well as offer additional protection against mode collapse. We find that
employing between 5 and 10 augmentation repetitions is optimal for the scoring
functions tested and is further associated with an increased diversity in the
generated compounds, improved reproducibility of the sampling runs and the
generation of molecules of higher similarity to known ligands.Comment: 25 pages and 18 Figures. Supplementary material include
Autonomous Drug Design with Multi-Armed Bandits
Recent developments in artificial intelligence and automation support a new
drug design paradigm: autonomous drug design. Under this paradigm, generative
models can provide suggestions on thousands of molecules with specific
properties, and automated laboratories can potentially make, test and analyze
molecules with minimal human supervision. However, since still only a limited
number of molecules can be synthesized and tested, an obvious challenge is how
to efficiently select among provided suggestions in a closed-loop system. We
formulate this task as a stochastic multi-armed bandit problem with multiple
plays, volatile arms and similarity information. To solve this task, we adapt
previous work on multi-armed bandits to this setting, and compare our solution
with random sampling, greedy selection and decaying-epsilon-greedy selection
strategies. According to our simulation results, our approach has the potential
to perform better exploration and exploitation of the chemical space for
autonomous drug design
Transformer-based molecular optimization beyond matched molecular pairs
Molecular optimization aims to improve the drug profile of a starting molecule. It is a fundamental problem in drug discovery but challenging due to (i) the requirement of simultaneous optimization of multiple properties and (ii) the large chemical space to explore. Recently, deep learning methods have been proposed to solve this task by mimicking the chemist\u27s intuition in terms of matched molecular pairs (MMPs). Although MMPs is a widely used strategy by medicinal chemists, it offers limited capability in terms of exploring the space of structural modifications, therefore does not cover the complete space of solutions. Often more general transformations beyond the nature of MMPs are feasible and/or necessary, e.g. simultaneous modifications of the starting molecule at different places including the core scaffold. This study aims to provide a general methodology that offers more general structural modifications beyond MMPs. In particular, the same Transformer architecture is trained on different datasets. These datasets consist of a set of molecular pairs which reflect different types of transformations. Beyond MMP transformation, datasets reflecting general structural changes are constructed from ChEMBL based on two approaches: Tanimoto similarity (allows for multiple modifications) and scaffold matching (allows for multiple modifications but keep the scaffold constant) respectively. We investigate how the model behavior can be altered by tailoring the dataset while using the same model architecture. Our results show that the models trained on differently prepared datasets transform a given starting molecule in a way that it reflects the nature of the dataset used for training the model. These models could complement each other and unlock the capability for the chemists to pursue different options for improving a starting molecule
Graph networks for molecular design
Deep learning methods applied to chemistry can be used to accelerate the discovery of new molecules. This work introduces GraphINVENT, a platform developed for graph-based molecular design using graph neural networks (GNNs). GraphINVENT uses a tiered deep neural network architecture to probabilistically generate new molecules a single bond at a time. All models implemented in GraphINVENT can quickly learn to build molecules resembling the training set molecules without any explicit programming of chemical rules. The models have been benchmarked using the MOSES distribution-based metrics, showing how GraphINVENT models compare well with state-of-the-art generative models. This work compares six different GNN-based generative models in GraphINVENT, and shows that ultimately the gated-graph neural network performs best against the metrics considered here
LibINVENT: Reaction-based Generative Scaffold Decoration for in Silico Library Design
Because of the strong relationship between the desired molecular activity and its structural core, the screening of focused, core-sharing chemical libraries is a key step in lead optimization. Despite the plethora of current research focused on in silico methods for molecule generation, to our knowledge, no tool capable of designing such libraries has been proposed. In this work, we present a novel tool for de novo drug design called LibINVENT. It is capable of rapidly proposing chemical libraries of compounds sharing the same core while maximizing a range of desirable properties. To further help the process of designing focused libraries, the user can list specific chemical reactions that can be used for the library creation. LibINVENT is therefore a flexible tool for generating virtual chemical libraries for lead optimization in a broad range of scenarios. Additionally, the shared core ensures that the compounds in the library are similar, possess desirable properties, and can also be synthesized under the same or similar conditions. The LibINVENT code is freely available in our public repository at https://github.com/MolecularAI/Lib-INVENT. The code necessary for data preprocessing is further available at: https://github.com/MolecularAI/Lib-INVENT-dataset
A de novo molecular generation method using latent vector based generative adversarial network
Deep learning methods applied to drug discovery have been used to generate novel structures. In this study, we propose a new deep learning architecture, LatentGAN, which combines an autoencoder and a generative adversarial neural network for de novo molecular design. We applied the method in two scenarios: One to generate random drug-like compounds and another to generate target-biased compounds. Our results show that the method works well in both cases. Sampled compounds from the trained model can largely occupy the same chemical space as the training set and also generate a substantial fraction of novel compounds. Moreover, the drug-likeness score of compounds sampled from LatentGAN is also similar to that of the training set. Lastly, generated compounds differ from those obtained with a Recurrent Neural Network-based generative model approach, indicating that both methods can be used complementarily.[Figure not available: See fulltext.