20 research outputs found
Exploring the GDB-13 chemical space using deep generative models
Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample
Unsupervised Segmentation of Colonoscopy Images
Colonoscopy plays a crucial role in the diagnosis and prognosis of various
gastrointestinal diseases. Due to the challenges of collecting large-scale
high-quality ground truth annotations for colonoscopy images, and more
generally medical images, we explore using self-supervised features from vision
transformers in three challenging tasks for colonoscopy images. Our results
indicate that image-level features learned from DINO models achieve image
classification performance comparable to fully supervised models, and
patch-level features contain rich semantic information for object detection.
Furthermore, we demonstrate that self-supervised features combined with
unsupervised segmentation can be used to discover multiple clinically relevant
structures in a fully unsupervised manner, demonstrating the tremendous
potential of applying these methods in medical image analysis
Randomized SMILES strings improve the quality of molecular generative models
Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES
The Generated Databases (GDBs) as a Source of 3D-shaped Building Blocks for Use in Medicinal Chemistry and Drug Discovery
Drug discovery is in constant need of new molecules to develop drugs addressing unmet medical needs. To assess the chemical space available for drug design, our group investigates the generated databases (GDBs) listing all possible organic molecules up to a defined size, the largest of which is GDB-17 featuring 166.4 billion molecules up to 17 non-hydrogen atoms. While known drugs and bioactive compounds are mostly aromatic and planar, the GDBs contain a plethora of non-aromatic 3D-shaped molecules, which are very useful for drug discovery since they generally have more desirable absorption, distribution, metabolism, excretion and toxicity (ADMET) properties. Here we review GDB enumeration methods and the selection and synthesis of GDB molecules as modulators of ion channels. We summarize the constitution of GDB subsets focusing on fragments (FDB17), medicinal chemistry (GDBMedChem) and ChEMBL-like molecules (GDBChEMBL), and the ring system database GDB4c as a rich source of novel 3D-shaped chiral molecules containing quaternary centers, such as the recently reported trinorbornane
Exploring Chemical Space with Machine Learning
Chemical space is a concept to organize molecular diversity by postulating that different molecules occupy different regions of a mathematical space where the position of each molecule is defined by its properties. Our aim is to develop methods to explicitly explore chemical space in
the area of drug discovery. Here we review our implementations of machine learning in this project, including our use of deep neural networks to enumerate the GDB13 database from a small sample set, to generate analogs of drugs and natural products after training with fragment-size molecules,
and to predict the polypharmacology of molecules after training with known bioactive compounds from ChEMBL. We also discuss visualization methods for big data as means to keep track and learn from machine learning results. Computational tools discussed in this review are freely available at
http://gdb.unibe.ch and https://github.com/reymond-group
A Potent and Selective Janus Kinase Inhibitor with a Chiral 3D‐Shaped Triquinazine Ring System from Chemical Space
The generated databases (GDBs) enumerate billions of possible molecules following simple rules of chemical stability and synthetic feasibility. Exploring the GDBs shows that many chiral, 3D‐shaped ring systems, often containing quaternary centers, have never been exploited for drug design. Shown herein is that such ring systems can be useful for medicinal chemistry by using the example of the enantioselective synthesis of triquinazine, a novel chiral piperazine analogue derived from angular triquinane. It is used to design a nanomolar and selective inhibitor of Janus Kinase 1 and is related to the marketed drug Tofacitinib, which is useful for treating autoimmune diseases
A Potent and Selective Janus Kinase Inhibitor with a Chiral 3D‐Shaped Triquinazine Ring System from Chemical Space
The generated databases (GDBs) enumerate billions of possible molecules following simple rules of chemical stability and synthetic feasibility. Exploring the GDBs shows that many chiral, 3D‐shaped ring systems, often containing quaternary centers, have never been exploited for drug design. Shown herein is that such ring systems can be useful for medicinal chemistry by using the example of the enantioselective synthesis of triquinazine, a novel chiral piperazine analogue derived from angular triquinane. It is used to design a nanomolar and selective inhibitor of Janus Kinase 1 and is related to the marketed drug Tofacitinib, which is useful for treating autoimmune diseases
SMILES-based deep generative scaffold decorator for de-novo drug design
Molecular generative models trained with small sets of molecules represented as SMILES strings can generate large regions of the chemical space. Unfortunately, due to the sequential nature of SMILES strings, these models are not able to generate molecules given a scaffold (i.e., partially-built molecules with explicit attachment points). Herein we report a new SMILES-based molecular generative architecture that generates molecules from scaffolds and can be trained from any arbitrary molecular set. This approach is possible thanks to a new molecular set pre-processing algorithm that exhaustively slices all possible combinations of acyclic bonds of every molecule, combinatorically obtaining a large number of scaffolds with their respective decorations. Moreover, it serves as a data augmentation technique and can be readily coupled with randomized SMILES to obtain even better results with small sets. Two examples showcasing the potential of the architecture in medicinal and synthetic chemistry are described: First, models were trained with a training set obtained from a small set of Dopamine Receptor D2 (DRD2) active modulators and were able to meaningfully decorate a wide range of scaffolds and obtain molecular series predicted active on DRD2. Second, a larger set of drug-like molecules from ChEMBL was selectively sliced using synthetic chemistry constraints (RECAP rules). In this case, the resulting scaffolds with decorations were filtered only to allow those that included fragment-like decorations. This filtering process allowed models trained with this dataset to selectively decorate diverse scaffolds with fragments that were generally predicted to be synthesizable and attachable to the scaffold using known synthetic approaches. In both cases, the models were already able to decorate molecules using specific knowledge without the need to add it with other techniques, such as reinforcement learning. We envision that this architecture will become a useful addition to the already existent architectures for de novo molecular generation