35 research outputs found

    Automated patent extraction powers generative modeling in focused chemical spaces

    Full text link
    Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by developing an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice

    Bridging the Homogeneous-Heterogeneous Divide: Modeling Spin for Reactivity in Single Atom Catalysis

    Get PDF
    Single atom catalysts (SACs) are emergent catalytic materials that have the promise of merging the scalability of heterogeneous catalysts with the high activity and atom economy of homogeneous catalysts. Computational, first-principles modeling can provide essential insight into SAC mechanism and active site configuration, where the sub-nm-scale environment can challenge even the highest-resolution experimental spectroscopic techniques. Nevertheless, the very properties that make SACs attractive in catalysis, such as localized d electrons of the isolated transition metal center, make them challenging to study with conventional computational modeling using density functional theory (DFT). For example, Fe/N-doped graphitic SACs have exhibited spin-state dependent reactivity that remains poorly understood. However, spin-state ordering in DFT is very sensitive to the nature of the functional approximation chosen. In this work, we develop accurate benchmarks from correlated wavefunction theory (WFT) for relevant octahedral complexes. We use those benchmarks to evaluate optimal DFT functional choice for predicting spin state ordering in small octahedral complexes as well as models of pyridinic and pyrrolic nitrogen environments expected in larger SACs. Using these guidelines, we determine Fe/N-doped graphene SAC model properties and reactivity as well as their sensitivities to DFT functional choice. Finally, we conclude with broad recommendations for computational modeling of open-shell transition metal single-atom catalysts

    Singlet versus Triplet Reactivity in an Mn(V)-Oxo Species:Testing Theoretical Predictions Against Experimental Evidence

    Get PDF
    Discerning the factors that control the reactivity of high-valent metal鈥搊xo species is critical to both an understanding of metalloenzyme reactivity and related transition metal catalysts. Computational studies have suggested that an excited higher spin state in a number of metal鈥搊xo species can provide a lower energy barrier for oxidation reactions, leading to the conclusion that this unobserved higher spin state complex should be considered as the active oxidant. However, testing these computational predictions by experiment is difficult and has rarely been accomplished. Herein, we describe a detailed computational study on the role of spin state in the reactivity of a high-valent manganese颅(V)鈥搊xo complex with para-Z-substituted thioanisoles and utilize experimental evidence to distinguish between the theoretical results. The calculations show an unusual change in mechanism occurs for the dominant singlet spin state that correlates with the electron-donating property of the para-Z substituent, while this change is not observed on the triplet spin state. Minimum energy crossing point calculations predict small spin鈥搊rbit coupling constants making the spin state change from low spin to high spin unlikely. The trends in reactivity for the para-Z-substituted thioanisole derivatives provide an experimental measure for the spin state reactivity in manganese鈥搊xo corrolazine complexes. Hence, the calculations show that the V-shaped Hammett plot is reproduced by the singlet surface but not by the triplet state trend. The substituent effect is explained with valence bond models, which confirm a change from an electrophilic to a nucleophilic mechanism through a change of substituent

    Numerical Nuclear Second Derivatives on a Computing Grid: Enabling and Accelerating Frequency Calculations on Complex Molecular Systems

    No full text
    The computation of nuclear second derivatives of energy, or the nuclear Hessian, is an essential routine in quantum chemical investigations of ground and transition states, thermodynamic calculations, and molecular vibrations. Analytic nuclear Hessian computations require the resolution of costly coupled-perturbed self-consistent field (CP-SCF) equations, while numerical differentiation of analytic first derivatives has an unfavorable 6<i>N</i> (<i>N</i> = number of atoms) prefactor. Herein, we present a new method in which grid computing is used to accelerate and/or enable the evaluation of the nuclear Hessian via numerical differentiation: NUMFREQ@Grid. Nuclear Hessians were successfully evaluated by NUMFREQ@Grid at the DFT level as well as using RIJCOSX-ZORA-MP2 or RIJCOSX-ZORA-B2PLYP for a set of linear polyacenes with systematically increasing size. For the larger members of this group, NUMFREQ@Grid was found to outperform the wall clock time of analytic Hessian evaluation; at the MP2 or B2LYP levels, these Hessians cannot even be evaluated analytically. We also evaluated a 156-atom catalytically relevant open-shell transition metal complex and found that NUMFREQ@Grid is faster (7.7 times shorter wall clock time) and less demanding (4.4 times less memory requirement) than an analytic Hessian. Capitalizing on the capabilities of parallel grid computing, NUMFREQ@Grid can outperform analytic methods in terms of wall time, memory requirements, and treatable system size. The NUMFREQ@Grid method presented herein demonstrates how grid computing can be used to facilitate embarrassingly parallel computational procedures and is a pioneer for future implementations

    Active Learning and Neural Network Potentials Accelerate Molecular Screening of Ether-based Solvate Ionic Liquids

    No full text
    Solvate Ionic Liquids (SIL) have promising applications as electrolyte materials. Despite the broad design space of oligoether ligands, most reported SILs are based on simple tri- and tetraglyme. Here, we describe a computational search for complex ethers that can better stabilize SILs. Through active learning, a neural network interatomic potential is trained from density functional theory data. The learned potential fulfills two key requirements: transferability across composition space, and high speed and accuracy to find low-energy ligand-ion poses across configurational space. Candidate ether ligands for Li+, Mg+2 and Na+ SILs with higher binding affinity and electrochemical stability than the reference compounds are identified. Lastly, their properties are related to the geometry of the coordination sphere

    Active learning and neural network potentials accelerate molecular screening of ether-based solvate ionic liquids

    No full text
    Solvate ionic liquids (SIL) have promising applications as electrolyte materials. Despite the broad design space of oligoether ligands, most reported SILs are based on simple tri- and tetraglyme. Here, we describe a computational search for complex ethers that can better stabilize SILs. Through active learning, a neural network interatomic potential is trained from density functional theory data. The learned potential fulfills two key requirements: transferability across composition space, and high speed and accuracy to find low-energy ligand-ion poses across configurational space. Candidate ether ligands for Li+, Mg2+ and Na+ SILs with higher binding affinity and electrochemical stability than the reference compounds are identified. Lastly, their properties are related to the geometry of the coordination sphere

    Inverse Design of Ligands Using A Deep Generative Model Semi-supervised by A Data-driven Ligand Field Strength Metric

    No full text
    Transition metal (TM) complexes exhibit diverse structural and electronic properties. The properties of a TM complex can be tuned through modulating the ligand field strength (LFS) inflicted by its ligands. Current quantification of the LFS of a ligand is mainly derived from experimental measurements on a subset of highly symmetrical TM complexes and is limited in ligand scope. Herein, we report a data-driven method to quantify the LFS of ligands assigned from experimental crystal structures of TM complexes. We first show that the experimental metal-ligand bond lengths of over 4000 mononuclear Fe, Co, and Mn complexes form bimodal distributions. Using gaussian fits on the bimodal distributions, each TM complex is assigned with a spin state label. These spin state labels can then be used to calculate the LFS of the ligands of the complexes. Using the obtained data-driven LFS metric, we establish that a semi-supervised deep generative model, junction tree variational autoencoder (JTVAE), can be employed to predict LFS values. Our model exhibits a mean absolute error (MAE) of 0.047 and root mean squared error of 0.072 on the training set. The model also allows the generation of novel ligands with desirable LFS values

    A Quantitative Uncertainty Metric Controls Error in Neural Network-Driven Chemical Discovery

    No full text
    Machine learning (ML) models, such as artificial neural networks, have emerged as a complement to high-throughput screening, enabling characterization of new compounds in seconds instead of hours. The promise of ML models to enable large-scale, chemical space exploration can only be realized if it is straightforward to identify when molecules and materials are outside the model鈥檚 domain of applicability. Established uncertainty metrics for neural network models are either costly to obtain (e.g., ensemble models) or rely on feature engineering (e.g., feature space distances), and each has limitations in estimating prediction errors for chemical space exploration. We introduce the distance to available data in the latent space of a neural network ML model as a low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. The calibrated performance of this approach exceeds widely used uncertainty metrics and is readily applied to models of increasing complexity at no additional cost. Tightening latent distance cutoffs systematically drives down predicted model errors below training errors, thus enabling predictive error control in chemical discovery or identification of useful data points for active learning.</p

    A quantitative uncertainty metric controls error in neural network-driven chemical discovery

    No full text
    This journal is 漏 The Royal Society of Chemistry. Machine learning (ML) models, such as artificial neural networks, have emerged as a complement to high-throughput screening, enabling characterization of new compounds in seconds instead of hours. The promise of ML models to enable large-scale chemical space exploration can only be realized if it is straightforward to identify when molecules and materials are outside the model's domain of applicability. Established uncertainty metrics for neural network models are either costly to obtain (e.g., ensemble models) or rely on feature engineering (e.g., feature space distances), and each has limitations in estimating prediction errors for chemical space exploration. We introduce the distance to available data in the latent space of a neural network ML model as a low-cost, quantitative uncertainty metric that works for both inorganic and organic chemistry. The calibrated performance of this approach exceeds widely used uncertainty metrics and is readily applied to models of increasing complexity at no additional cost. Tightening latent distance cutoffs systematically drives down predicted model errors below training errors, thus enabling predictive error control in chemical discovery or identification of useful data points for active learning

    A Joint Semi-Supervised Variational Autoencoder and Transfer Learning Model for Designing Molecular Transition Metal Complexes

    No full text
    Deep generative models (DGMs) have shown great promise in the generation of organic molecules and inorganic materials with chemical sensible structures and optimized properties. However, there is a lack of their applications in transition metal (TM) complexes due to their flexible coordination environment, multiple accessible oxidation and spin states, despite the importance of these complexes in fine chemical synthesis, commodity production, and optical applications. Herein, we propose a joint semi-supervised junction-tree variational autoencoder (SSVAE) and artificial neural network (ANN) classifier model, coined as LiveTransForM (Ligand variational auto-encoder and Transfer learning For transition Metal complexes), for the design of octahedral TM complexes. LiveTransForM allows the design of ligands that build up TM complexes and the prediction of the spin states of the assembled complexes. We show that the accuracy of the classifier is improved when the latent variables from the SSVAE are used as input for the ANN model compared to those from the unsupervised VAE. Input augmentation using the three molecular axes also improves the accuracy of the classifier. 58 complexes with predicted spin states are then generated by LiveTransForM and the accuracy of their spin state labels are validated by density functional theory methods. Two design strategies, single mutation and seeded generation, are also introduced to allow the directed evolution of a parent complex towards a desirable spin state and local modification of seed complexes with similar spin states, respectively
    corecore