13 research outputs found
Cross-validation strategies in QSPR modelling of chemical reactions
In this article, we consider cross-validation of the quantitative structure-property relationship models for reactions and show that the conventional k-fold cross-validation (CV) procedure gives an `optimistically' biased assessment of prediction performance. To address this issue, we suggest two strategies of model cross-validation, `transformation-out' CV, and `solvent-out' CV. Unlike the conventional k-fold cross-validation approach that does not consider the nature of objects, the proposed procedures provide an unbiased estimation of the predictive performance of the models for novel types of structural transformations in chemical reactions and reactions going under new conditions. Both the suggested strategies have been applied to predict the rate constants of bimolecular elimination and nucleophilic substitution reactions, and Diels-Alder cycloaddition. All suggested cross-validation methodologies and tutorial are implemented in the open-source software package CIMtools (https://github.com/cimm-kzn/CIMtools)
Comprehensive Analysis of Applicability Domains of QSPR Models for Chemical Reactions
Nowadays, the problem of the model's applicability domain (AD) definition is an active research topic in chemoinformatics. Although many various AD definitions for the models predicting properties of molecules (Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) models) were described in the literature, no one for chemical reactions (Quantitative Reaction-Property Relationships (QRPR)) has been reported to date. The point is that a chemical reaction is a much more complex object than an individual molecule, and its yield, thermodynamic and kinetic characteristics depend not only on the structures of reactants and products but also on experimental conditions. The QRPR models' performance largely depends on the way that chemical transformation is encoded. In this study, various AD definition methods extensively used in QSAR/QSPR studies of individual molecules, as well as several novel approaches suggested in this work for reactions, were benchmarked on several reaction datasets. The ability to exclude wrong reaction types, increase coverage, improve the model performance and detect Y-outliers were tested. As a result, several "best" AD definitions for the QRPR models predicting reaction characteristics have been revealed and tested on a previously published external dataset with a clear AD definition problem
Discovery of Novel Chemical Reactions by Deep Generative Recurrent Neural Network
Here, we report an application of Artificial Intelligence techniques to generate novel chemical reactions of the given type. A sequence-to-sequence autoencoder was trained on the USPTO reaction database. Each reaction was converted into a single Condensed Graph of Reaction (CGR), followed by their translation into on-purpose developed SMILES/GGR text strings. The autoencoder latent space was visualized on the two-dimensional generative topographic map, from which some zones populated by Suzuki coupling reactions were targeted. These served for the generation of novel reactions by sampling the latent space points and decoding them to SMILES/CGR.<br /
Automatized Assessment of Protective Group Reactivity: A Step Toward Big Reaction Data Analysis
We report
a new method to assess protective groups (PGs) reactivity as a function
of reaction conditions (catalyst, solvent) using raw reaction data.
It is based on an intuitive similarity principle for chemical reactions:
similar reactions proceed under similar conditions. Technically, reaction
similarity can be assessed using the Condensed Graph of Reaction (CGR)
approach representing an ensemble of reactants and products as a single
molecular graph, i.e., as a pseudomolecule for which molecular descriptors
or fingerprints can be calculated. CGR-based in-house tools were used
to process data for 142,111 catalytic hydrogenation reactions extracted
from the Reaxys database. Our results reveal some contradictions with
famous Greene’s Reactivity Charts based on manual expert analysis.
Models developed in this study show high accuracy (ca. 90%) for predicting
optimal experimental conditions of protective group deprotection
Prediction of Optimal Conditions of Hydrogenation Reaction Using the Likelihood Ranking Approach
The selection of experimental conditions leading to a reasonable yield is an important and essential element for the automated development of a synthesis plan and the subsequent synthesis of the target compound. The classical QSPR approach, requiring one-to-one correspondence between chemical structure and a target property, can be used for optimal reaction conditions prediction only on a limited scale when only one condition component (e.g., catalyst or solvent) is considered. However, a particular reaction can proceed under several different conditions. In this paper, we describe the Likelihood Ranking Model representing an artificial neural network that outputs a list of different conditions ranked according to their suitability to a given chemical transformation. Benchmarking calculations demonstrated that our model outperformed some popular approaches to the theoretical assessment of reaction conditions, such as k Nearest Neighbors, and a recurrent artificial neural network performance prediction of condition components (reagents, solvents, catalysts, and temperature). The ability of the Likelihood Ranking model trained on a hydrogenation reactions dataset, (similar to 42,000 reactions) from Reaxys(R) database, to propose conditions that led to the desired product was validated experimentally on a set of three reactions with rich selectivity issues
Atom‐to‐atom Mapping : A Benchmarking Study of Popular Mapping Algorithms and Consensus Strategies
In this paper, we compare the most popular Atom-to-Atom Mapping (AAM) tools: ChemAxon,([1]) Indigo,([2]) RDTool,([3]) NameRXN (NextMove),([4]) and RXNMapper([5]) which implement different AAM algorithms. An open-source RDTool program was optimized, and its modified version ("new RDTool") was considered together with several consensus mapping strategies. The Condensed Graph of Reaction approach was used to calculate chemical distances and develop the "AAM fixer" algorithm for an automatized correction of erroneous mapping. The benchmarking calculations were performed on a Golden dataset containing 1851 manually mapped and curated reactions. The best performing RXNMapper program together with the AMM Fixer was applied to map the USPTO database. The Golden dataset, mapped USPTO and optimized RDTool are available in the GitHub repository https://github.com/Laboratoire-de-Chemoinformatique