89 research outputs found
The Synthesizability of Molecules Proposed by Generative Models
The discovery of functional molecules is an expensive and time-consuming
process, exemplified by the rising costs of small molecule therapeutic
discovery. One class of techniques of growing interest for early-stage drug
discovery is de novo molecular generation and optimization, catalyzed by the
development of new deep learning approaches. These techniques can suggest novel
molecular structures intended to maximize a multi-objective function, e.g.,
suitability as a therapeutic against a particular target, without relying on
brute-force exploration of a chemical space. However, the utility of these
approaches is stymied by ignorance of synthesizability. To highlight the
severity of this issue, we use a data-driven computer-aided synthesis planning
program to quantify how often molecules proposed by state-of-the-art generative
models cannot be readily synthesized. Our analysis demonstrates that there are
several tasks for which these models generate unrealistic molecular structures
despite performing well on popular quantitative benchmarks. Synthetic
complexity heuristics can successfully bias generation toward
synthetically-tractable chemical space, although doing so necessarily detracts
from the primary objective. This analysis suggests that to improve the utility
of these models in real discovery workflows, new algorithm development is
warranted
An algorithmic framework for synthetic cost-aware decision making in molecular design
Small molecules exhibiting desirable property profiles are often discovered
through an iterative process of designing, synthesizing, and testing sets of
molecules. The selection of molecules to synthesize from all possible
candidates is a complex decision-making process that typically relies on expert
chemist intuition. We propose a quantitative decision-making framework,
SPARROW, that prioritizes molecules for evaluation by balancing expected
information gain and synthetic cost. SPARROW integrates molecular design,
property prediction, and retrosynthetic planning to balance the utility of
testing a molecule with the cost of batch synthesis. We demonstrate through
three case studies that the developed algorithm captures the non-additive costs
inherent to batch synthesis, leverages common reaction steps and intermediates,
and scales to hundreds of molecules. SPARROW is open source and can be found at
http://github.com/coleygroup/sparrow
Computer-Aided Multi-Objective Optimization in Small Molecule Discovery
Molecular discovery is a multi-objective optimization problem that requires
identifying a molecule or set of molecules that balance multiple, often
competing, properties. Multi-objective molecular design is commonly addressed
by combining properties of interest into a single objective function using
scalarization, which imposes assumptions about relative importance and uncovers
little about the trade-offs between objectives. In contrast to scalarization,
Pareto optimization does not require knowledge of relative importance and
reveals the trade-offs between objectives. However, it introduces additional
considerations in algorithm design. In this review, we describe pool-based and
de novo generative approaches to multi-objective molecular discovery with a
focus on Pareto optimization algorithms. We show how pool-based molecular
discovery is a relatively direct extension of multi-objective Bayesian
optimization and how the plethora of different generative models extend from
single-objective to multi-objective optimization in similar ways using
non-dominated sorting in the reward function (reinforcement learning) or to
select molecules for retraining (distribution learning) or propagation (genetic
algorithms). Finally, we discuss some remaining challenges and opportunities in
the field, emphasizing the opportunity to adopt Bayesian optimization
techniques into multi-objective de novo design
Uncertainty Quantification Using Neural Networks for Molecular Property Prediction
Uncertainty quantification (UQ) is an important component of molecular
property prediction, particularly for drug discovery applications where model
predictions direct experimental design and where unanticipated imprecision
wastes valuable time and resources. The need for UQ is especially acute for
neural models, which are becoming increasingly standard yet are challenging to
interpret. While several approaches to UQ have been proposed in the literature,
there is no clear consensus on the comparative performance of these models. In
this paper, we study this question in the context of regression tasks. We
systematically evaluate several methods on five benchmark datasets using
multiple complementary performance metrics. Our experiments show that none of
the methods we tested is unequivocally superior to all others, and none
produces a particularly reliable ranking of errors across multiple datasets.
While we believe these results show that existing UQ methods are not sufficient
for all common use-cases and demonstrate the benefits of further research, we
conclude with a practical recommendation as to which existing techniques seem
to perform well relative to others
Generating Molecular Fragmentation Graphs with Autoregressive Neural Networks
The accurate prediction of tandem mass spectra from molecular structures has
the potential to unlock new metabolomic discoveries by augmenting the
community's libraries of experimental reference standards. Cheminformatic
spectrum prediction strategies use a "bond-breaking" framework to iteratively
simulate mass spectrum fragmentations, but these methods are (a) slow, due to
the need to exhaustively and combinatorially break molecules and (b)
inaccurate, as they often rely upon heuristics to predict the intensity of each
resulting fragment; neural network alternatives mitigate computational cost but
are black-box and not inherently more accurate. We introduce a
physically-grounded neural approach that learns to predict each breakage event
and score the most relevant subset of molecular fragments quickly and
accurately. We evaluate our model by predicting spectra from both public and
private standard libraries, demonstrating that our hybrid approach offers state
of the art prediction accuracy, improved metabolite identification from a
database of candidates, and higher interpretability when compared to previous
breakage methods and black box neural networks. The grounding of our approach
in physical fragmentation events shows especially high promise for elucidating
natural product molecules with more complex scaffolds.Comment: 11 pages, 17 pages with references and appendix, 5 figure
De novo PROTAC design using graph-based deep generative models
PROteolysis TArgeting Chimeras (PROTACs) are an emerging therapeutic modality
for degrading a protein of interest (POI) by marking it for degradation by the
proteasome. Recent developments in artificial intelligence (AI) suggest that
deep generative models can assist with the de novo design of molecules with
desired properties, and their application to PROTAC design remains largely
unexplored. We show that a graph-based generative model can be used to propose
novel PROTAC-like structures from empty graphs. Our model can be guided towards
the generation of large molecules (30--140 heavy atoms) predicted to degrade a
POI through policy-gradient reinforcement learning (RL). Rewards during RL are
applied using a boosted tree surrogate model that predicts a molecule's
degradation potential for each POI. Using this approach, we steer the
generative model towards compounds with higher likelihoods of predicted
degradation activity. Despite being trained on sparse public data, the
generative model proposes molecules with substructures found in known
degraders. After fine-tuning, predicted activity against a challenging POI
increases from 50% to >80% with near-perfect chemical validity for sampled
compounds, suggesting this is a promising approach for the optimization of
large, PROTAC-like molecules for targeted protein degradation.Comment: Presented at NeurIPS 2022 AI4Science Worksho
Reaction profiles for quantum chemistry-computed [3 + 2] cycloaddition reactions
Bio-orthogonal click chemistry based on [3 + 2] dipolar cycloadditions has had a profound impact on the field of biochemistry and significant effort has been devoted to identify promising new candidate reactions for this purpose. To gauge whether a prospective reaction could be a suitable bio-orthogonal click reaction, information about both on- and off-target activation and reaction energies is highly valuable. Here, we use an automated workflow, based on the autodE program, to compute over 5000 reaction profiles for [3 + 2] cycloadditions involving both synthetic dipolarophiles and a set of biologically-inspired structural motifs. Based on a succinct benchmarking study, the B3LYP-D3(BJ)/def2-TZVP//B3LYP-D3(BJ)/def2-SVP level of theory was selected for the DFT calculations, and standard conditions and an (aqueous) SMD model were imposed to mimic physiological conditions. We believe that this data, as well as the presented workflow for high-throughput reaction profile computation, will be useful to screen for new bio-orthogonal reactions, as well as for the development of novel machine learning models for the prediction of chemical reactivity more broadly
Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data
The past decade has seen a number of impressive developmentsinpredictive chemistry and reaction informatics driven by machine learningapplications to computer-aided synthesis planning. While many of thesedevelopments have been made even with relatively small, bespoke datasets, in order to advance the role of AI in the field at scale, theremust be significant improvements in the reporting of reaction data.Currently, the majority of publicly available data is reported inan unstructured format and heavily imbalanced toward high-yieldingreactions, which influences the types of models that can be successfullytrained. In this Perspective, we analyze several data curation andsharing initiatives that have seen success in chemistry and molecularbiology. We discuss several factors that have contributed to theirsuccess and how we can take lessons from these case studies and applythem to reaction data. Finally, we spotlight the Open Reaction Databaseand summarize key actions the community can take toward making reactiondata more findable, accessible, interoperable, and reusable (FAIR),including the use of mandates from funding agencies and publishers
- …