4,208 research outputs found
In silico generation of novel, drug-like chemical matter using the LSTM neural network
The exploration of novel chemical spaces is one of the most important tasks
of cheminformatics when supporting the drug discovery process. Properly
designed and trained deep neural networks can provide a viable alternative to
brute-force de novo approaches or various other machine-learning techniques for
generating novel drug-like molecules. In this article we present a method to
generate molecules using a long short-term memory (LSTM) neural network and
provide an analysis of the results, including a virtual screening test. Using
the network one million drug-like molecules were generated in 2 hours. The
molecules are novel, diverse (contain numerous novel chemotypes), have good
physicochemical properties and have good synthetic accessibility, even though
these qualities were not specific constraints. Although novel, their structural
features and functional groups remain closely within the drug-like space
defined by the bioactive molecules from ChEMBL. Virtual screening using the
profile QSAR approach confirms that the potential of these novel molecules to
show bioactivity is comparable to the ChEMBL set from which they were derived.
The molecule generator written in Python used in this study is available on
request.Comment: in this version fixed some reference number
Randomized SMILES strings improve the quality of molecular generative models
Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES
Improving Molecular Pretraining with Complementary Featurizations
Molecular pretraining, which learns molecular representations over massive
unlabeled data, has become a prominent paradigm to solve a variety of tasks in
computational chemistry and drug discovery. Recently, prosperous progress has
been made in molecular pretraining with different molecular featurizations,
including 1D SMILES strings, 2D graphs, and 3D geometries. However, the role of
molecular featurizations with their corresponding neural architectures in
molecular pretraining remains largely unexamined. In this paper, through two
case studies -- chirality classification and aromatic ring counting -- we first
demonstrate that different featurization techniques convey chemical information
differently. In light of this observation, we propose a simple and effective
MOlecular pretraining framework with COmplementary featurizations (MOCO). MOCO
comprehensively leverages multiple featurizations that complement each other
and outperforms existing state-of-the-art models that solely relies on one or
two featurizations on a wide range of molecular property prediction tasks.Comment: 24 pages, work in progres
Leveraging Reaction-aware Substructures for Retrosynthesis Analysis
Retrosynthesis analysis is a critical task in organic chemistry central to
many important industries. Previously, various machine learning approaches have
achieved promising results on this task by representing output molecules as
strings and autoregressively decoded token-by-token with generative models.
Text generation or machine translation models in natural language processing
were frequently utilized approaches. The token-by-token decoding approach is
not intuitive from a chemistry perspective because some substructures are
relatively stable and remain unchanged during reactions. In this paper, we
propose a substructure-level decoding model, where the substructures are
reaction-aware and can be automatically extracted with a fully data-driven
approach. Our approach achieved improvement over previously reported models,
and we find that the performance can be further boosted if the accuracy of
substructure extraction is improved. The substructures extracted by our
approach can provide users with better insights for decision-making compared to
existing methods. We hope this work will generate interest in this fast growing
and highly interdisciplinary area on retrosynthesis prediction and other
related topics.Comment: Work in progres
- …