36 research outputs found
Automated patent extraction powers generative modeling in focused chemical spaces
Deep generative models have emerged as an exciting avenue for inverse
molecular design, with progress coming from the interplay between training
algorithms and molecular representations. One of the key challenges in their
applicability to materials science and chemistry has been the lack of access to
sizeable training datasets with property labels. Published patents contain the
first disclosure of new materials prior to their publication in journals, and
are a vast source of scientific knowledge that has remained relatively untapped
in the field of data-driven molecular design. Because patents are filed seeking
to protect specific uses, molecules in patents can be considered to be weakly
labeled into application classes. Furthermore, patents published by the US
Patent and Trademark Office (USPTO) are downloadable and have machine-readable
text and molecular structures. In this work, we train domain-specific
generative models using patent data sources by developing an automated pipeline
to go from USPTO patent digital files to the generation of novel candidates
with minimal human intervention. We test the approach on two in-class extracted
datasets, one in organic electronics and another in tyrosine kinase inhibitors.
We then evaluate the ability of generative models trained on these in-class
datasets on two categories of tasks (distribution learning and property
optimization), identify strengths and limitations, and suggest possible
explanations and remedies that could be used to overcome these in practice
Bridging the Homogeneous-Heterogeneous Divide: Modeling Spin for Reactivity in Single Atom Catalysis
Single atom catalysts (SACs) are emergent catalytic materials that have the promise of merging the scalability of heterogeneous catalysts with the high activity and atom economy of homogeneous catalysts. Computational, first-principles modeling can provide essential insight into SAC mechanism and active site configuration, where the sub-nm-scale environment can challenge even the highest-resolution experimental spectroscopic techniques. Nevertheless, the very properties that make SACs attractive in catalysis, such as localized d electrons of the isolated transition metal center, make them challenging to study with conventional computational modeling using density functional theory (DFT). For example, Fe/N-doped graphitic SACs have exhibited spin-state dependent reactivity that remains poorly understood. However, spin-state ordering in DFT is very sensitive to the nature of the functional approximation chosen. In this work, we develop accurate benchmarks from correlated wavefunction theory (WFT) for relevant octahedral complexes. We use those benchmarks to evaluate optimal DFT functional choice for predicting spin state ordering in small octahedral complexes as well as models of pyridinic and pyrrolic nitrogen environments expected in larger SACs. Using these guidelines, we determine Fe/N-doped graphene SAC model properties and reactivity as well as their sensitivities to DFT functional choice. Finally, we conclude with broad recommendations for computational modeling of open-shell transition metal single-atom catalysts
Singlet versus Triplet Reactivity in an Mn(V)-Oxo Species:Testing Theoretical Predictions Against Experimental Evidence
Discerning the factors
that control the reactivity of high-valent
metal鈥搊xo species is critical to both an understanding of metalloenzyme
reactivity and related transition metal catalysts. Computational studies
have suggested that an excited higher spin state in a number of metal鈥搊xo
species can provide a lower energy barrier for oxidation reactions,
leading to the conclusion that this unobserved higher spin state complex
should be considered as the active oxidant. However, testing these
computational predictions by experiment is difficult and has rarely
been accomplished. Herein, we describe a detailed computational study
on the role of spin state in the reactivity of a high-valent manganese颅(V)鈥搊xo
complex with para-Z-substituted thioanisoles and utilize experimental
evidence to distinguish between the theoretical results. The calculations
show an unusual change in mechanism occurs for the dominant singlet
spin state that correlates with the electron-donating property of
the para-Z substituent, while this change is not observed on the triplet
spin state. Minimum energy crossing point calculations predict small
spin鈥搊rbit coupling constants making the spin state change
from low spin to high spin unlikely. The trends in reactivity for
the para-Z-substituted thioanisole derivatives provide an experimental
measure for the spin state reactivity in manganese鈥搊xo corrolazine
complexes. Hence, the calculations show that the V-shaped Hammett
plot is reproduced by the singlet surface but not by the triplet state
trend. The substituent effect is explained with valence bond models,
which confirm a change from an electrophilic to a nucleophilic mechanism
through a change of substituent
Numerical Nuclear Second Derivatives on a Computing Grid: Enabling and Accelerating Frequency Calculations on Complex Molecular Systems
The
computation of nuclear second derivatives of energy, or the
nuclear Hessian, is an essential routine in quantum chemical investigations
of ground and transition states, thermodynamic calculations, and molecular
vibrations. Analytic nuclear Hessian computations require the resolution
of costly coupled-perturbed self-consistent field (CP-SCF) equations,
while numerical differentiation of analytic first derivatives has
an unfavorable 6<i>N</i> (<i>N</i> = number of
atoms) prefactor. Herein, we present a new method in which grid computing
is used to accelerate and/or enable the evaluation of the nuclear
Hessian via numerical differentiation: NUMFREQ@Grid. Nuclear Hessians
were successfully evaluated by NUMFREQ@Grid at the DFT level as well
as using RIJCOSX-ZORA-MP2 or RIJCOSX-ZORA-B2PLYP for a set of linear
polyacenes with systematically increasing size. For the larger members
of this group, NUMFREQ@Grid was found to outperform the wall clock
time of analytic Hessian evaluation; at the MP2 or B2LYP levels, these
Hessians cannot even be evaluated analytically. We also evaluated
a 156-atom catalytically relevant open-shell transition metal complex
and found that NUMFREQ@Grid is faster (7.7 times shorter wall clock
time) and less demanding (4.4 times less memory requirement) than
an analytic Hessian. Capitalizing on the capabilities of parallel
grid computing, NUMFREQ@Grid can outperform analytic methods in terms
of wall time, memory requirements, and treatable system size. The
NUMFREQ@Grid method presented herein demonstrates how grid computing
can be used to facilitate embarrassingly parallel computational procedures
and is a pioneer for future implementations
Active Learning and Neural Network Potentials Accelerate Molecular Screening of Ether-based Solvate Ionic Liquids
Solvate Ionic Liquids (SIL) have promising applications as electrolyte materials. Despite the broad design space of oligoether ligands, most reported SILs are based on simple tri- and tetraglyme. Here, we describe a computational search for complex ethers that can better stabilize SILs. Through active learning, a neural network interatomic potential is trained from density functional theory data. The learned potential fulfills two key requirements: transferability across composition space, and high speed and accuracy to find low-energy ligand-ion poses across configurational space. Candidate ether ligands for Li+, Mg+2 and Na+ SILs with higher binding affinity and electrochemical stability than the reference compounds are identified. Lastly, their properties are related to the geometry of the coordination sphere
Active learning and neural network potentials accelerate molecular screening of ether-based solvate ionic liquids
Solvate ionic liquids (SIL) have promising applications as electrolyte materials. Despite the broad design space of oligoether ligands, most reported SILs are based on simple tri- and tetraglyme. Here, we describe a computational search for complex ethers that can better stabilize SILs. Through active learning, a neural network interatomic potential is trained from density functional theory data. The learned potential fulfills two key requirements: transferability across composition space, and high speed and accuracy to find low-energy ligand-ion poses across configurational space. Candidate ether ligands for Li+, Mg2+ and Na+ SILs with higher binding affinity and electrochemical stability than the reference compounds are identified. Lastly, their properties are related to the geometry of the coordination sphere
Inverse Design of Ligands Using A Deep Generative Model Semi-supervised by A Data-driven Ligand Field Strength Metric
Transition metal (TM) complexes exhibit diverse structural and electronic properties. The properties of a TM complex can be tuned through modulating the ligand field strength (LFS) inflicted by its ligands. Current quantification of the LFS of a ligand is mainly derived from experimental measurements on a subset of highly symmetrical TM complexes and is limited in ligand scope.
Herein, we report a data-driven method to quantify the LFS of ligands assigned from experimental crystal structures of TM complexes. We first show that the experimental metal-ligand bond lengths of over 4000 mononuclear Fe, Co, and Mn complexes form bimodal distributions. Using gaussian fits on the bimodal distributions, each TM complex is assigned with a spin state label. These spin state labels can then be used to calculate the LFS of the ligands of the complexes.
Using the obtained data-driven LFS metric, we establish that a semi-supervised deep generative model, junction tree variational autoencoder (JTVAE), can be employed to predict LFS values. Our model exhibits a mean absolute error (MAE) of 0.047 and root mean squared error of 0.072 on the training set. The model also allows the generation of novel ligands with desirable LFS values
A Quantitative Uncertainty Metric Controls Error in Neural Network-Driven Chemical Discovery
Machine learning (ML) models, such as artificial neural networks, have emerged as a complement to high-throughput screening, enabling characterization of new compounds in seconds instead of hours. The promise of ML models to enable large-scale, chemical space exploration can only be realized if it is straightforward to identify when molecules and materials are outside the model鈥檚 domain of applicability. Established uncertainty metrics for neural network models are either costly to obtain (e.g., ensemble models) or rely on feature engineering (e.g., feature space distances), and each has limitations in estimating prediction errors for chemical space exploration. We introduce the distance to available data in the latent space of a neural network ML model as a low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. The calibrated performance of this approach exceeds widely used uncertainty metrics and is readily applied to models of increasing complexity at no additional cost. Tightening latent distance cutoffs systematically drives down predicted model errors below training errors, thus enabling predictive error control in chemical discovery or identification of useful data points for active learning.</p
A Joint Semi-Supervised Variational Autoencoder and Transfer Learning Model for Designing Molecular Transition Metal Complexes
Deep generative models (DGMs) have shown great promise in the generation of organic molecules and inorganic materials with chemical sensible structures and optimized properties. However, there is a lack of their applications in transition metal (TM) complexes due to their flexible coordination environment, multiple accessible oxidation and spin states, despite the importance of these complexes in fine chemical synthesis, commodity production, and optical applications. Herein, we propose a joint semi-supervised junction-tree variational autoencoder (SSVAE) and artificial neural network (ANN) classifier model, coined as LiveTransForM (Ligand variational auto-encoder and Transfer learning For transition Metal complexes), for the design of octahedral TM complexes. LiveTransForM allows the design of ligands that build up TM complexes and the prediction of the spin states of the assembled complexes. We show that the accuracy of the classifier is improved when the latent variables from the SSVAE are used as input for the ANN model compared to those from the unsupervised VAE. Input augmentation using the three molecular axes also improves the accuracy of the classifier. 58 complexes with predicted spin states are then generated by LiveTransForM and the accuracy of their spin state labels are validated by density functional theory methods. Two design strategies, single mutation and seeded generation, are also introduced to allow the directed evolution of a parent complex towards a desirable spin state and local modification of seed complexes with similar spin states, respectively
A quantitative uncertainty metric controls error in neural network-driven chemical discovery
This journal is 漏 The Royal Society of Chemistry. Machine learning (ML) models, such as artificial neural networks, have emerged as a complement to high-throughput screening, enabling characterization of new compounds in seconds instead of hours. The promise of ML models to enable large-scale chemical space exploration can only be realized if it is straightforward to identify when molecules and materials are outside the model's domain of applicability. Established uncertainty metrics for neural network models are either costly to obtain (e.g., ensemble models) or rely on feature engineering (e.g., feature space distances), and each has limitations in estimating prediction errors for chemical space exploration. We introduce the distance to available data in the latent space of a neural network ML model as a low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. The calibrated performance of this approach exceeds widely used uncertainty metrics and is readily applied to models of increasing complexity at no additional cost. Tightening latent distance cutoffs systematically drives down predicted model errors below training errors, thus enabling predictive error control in chemical discovery or identification of useful data points for active learning