11 research outputs found
What’s What: The (Nearly) Definitive Guide to Reaction Role Assignment
When
analyzing chemical reactions it is essential to know which
molecules are actively involved in the reaction and which educts will
form the product molecules. Assigning reaction roles, like reactant,
reagent, or product, to the molecules of a chemical reaction might
be a trivial problem for hand-curated reaction schemes but it is more
difficult to automate, an essential step when handling large amounts
of reaction data. Here, we describe a new fingerprint-based and data-driven
approach to assign reaction roles which is also applicable to rather
unbalanced and noisy reaction schemes. Given a set of molecules involved
and knowing the product(s) of a reaction we assign the most probable
reactants and sort out the remaining reagents. Our approach was validated
using two different data sets: one hand-curated data set comprising
about 680 diverse reactions extracted from patents which span more
than 200 different reaction types and include up to 18 different reactants.
A second set consists of 50 000 randomly picked reactions from
US patents. The results of the second data set were compared to results
obtained using two different atom-to-atom mapping algorithms. For
both data sets our method assigns the reaction roles correctly for
the vast majority of the reactions, achieving an accuracy of 88% and
97% respectively. The median time needed, about 8 ms, indicates that
the algorithm is fast enough to be applied to large collections. The
new method is available as part of the RDKit toolkit and the data
sets and Jupyter notebooks used for evaluation of the new method are
available in the Supporting Information of this publication
What’s What: The (Nearly) Definitive Guide to Reaction Role Assignment
When
analyzing chemical reactions it is essential to know which
molecules are actively involved in the reaction and which educts will
form the product molecules. Assigning reaction roles, like reactant,
reagent, or product, to the molecules of a chemical reaction might
be a trivial problem for hand-curated reaction schemes but it is more
difficult to automate, an essential step when handling large amounts
of reaction data. Here, we describe a new fingerprint-based and data-driven
approach to assign reaction roles which is also applicable to rather
unbalanced and noisy reaction schemes. Given a set of molecules involved
and knowing the product(s) of a reaction we assign the most probable
reactants and sort out the remaining reagents. Our approach was validated
using two different data sets: one hand-curated data set comprising
about 680 diverse reactions extracted from patents which span more
than 200 different reaction types and include up to 18 different reactants.
A second set consists of 50 000 randomly picked reactions from
US patents. The results of the second data set were compared to results
obtained using two different atom-to-atom mapping algorithms. For
both data sets our method assigns the reaction roles correctly for
the vast majority of the reactions, achieving an accuracy of 88% and
97% respectively. The median time needed, about 8 ms, indicates that
the algorithm is fast enough to be applied to large collections. The
new method is available as part of the RDKit toolkit and the data
sets and Jupyter notebooks used for evaluation of the new method are
available in the Supporting Information of this publication
Get Your Atoms in Orderî—¸An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm
Finding a canonical
ordering of the atoms in a molecule is a prerequisite
for generating a unique representation of the molecule. The canonicalization
of a molecule is usually accomplished by applying some sort of graph
relaxation algorithm, the most common of which is the Morgan algorithm.
There are known issues with that algorithm that lead to noncanonical
atom orderings as well as problems when it is applied to large molecules
like proteins. Furthermore, each cheminformatics toolkit or software
provides its own version of a canonical ordering, most based on unpublished
algorithms, which also complicates the generation of a universal unique
identifier for molecules. We present an alternative canonicalization
approach that uses a standard stable-sorting algorithm instead of
a Morgan-like index. Two new invariants that allow canonical ordering
of molecules with dependent chirality as well as those with highly
symmetrical cyclic graphs have been developed. The new approach proved
to be robust and fast when tested on the 1.45 million compounds of
the ChEMBL 20 data set in different scenarios like random renumbering
of input atoms or SMILES round tripping. Our new algorithm is able
to generate a canonical order of the atoms of protein molecules within
a few milliseconds. The novel algorithm is implemented in the open-source
cheminformatics toolkit RDKit. With this paper, we provide a reference
Python implementation of the algorithm that could easily be integrated
in any cheminformatics toolkit. This provides a first step toward
a common standard for canonical atom ordering to generate a universal
unique identifier for molecules other than InChI
Evidence of Water Moleculesî—¸A Statistical Evaluation of Water Molecules Based on Electron Density
Water
molecules play important roles in many biological processes,
especially when mediating protein–ligand interactions. Dehydration
and the hydrophobic effect are of central importance for estimating
binding affinities. Due to the specific geometric characteristics
of hydrogen bond functions of water molecules, meaning two acceptor
and two donor functions in a tetrahedral arrangement, they have to
be modeled accurately. Despite many attempts in the past years, accurate
prediction of water moleculesî—¸structurally as well as energeticallyî—¸remains
a grand challenge. One reason is certainly the lack of experimental
data, since energetic contributions of water molecules can only be
measured indirectly. However, on the structural side, the electron
density clearly shows the positions of stable water molecules. This
information has the potential to improve models on water structure
and energy in proteins and protein interfaces. On the basis of a high-resolution
subset of the Protein Data Bank, we have conducted an extensive statistical
analysis of 2.3 million water molecules, discriminating those water
molecules that are well resolved and those without much evidence of
electron density. In order to perform this classification, we introduce
a new measurement of electron density around an individual atom enabling
the automatic quantification of experimental support. On the basis
of this measurement, we present an analysis of water molecules with
a detailed profile of geometric and structural features. This data,
which is freely available, can be applied to not only modeling and
validation of new water models in structural biology but also in molecular
design
Get Your Atoms in Orderî—¸An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm
Finding a canonical
ordering of the atoms in a molecule is a prerequisite
for generating a unique representation of the molecule. The canonicalization
of a molecule is usually accomplished by applying some sort of graph
relaxation algorithm, the most common of which is the Morgan algorithm.
There are known issues with that algorithm that lead to noncanonical
atom orderings as well as problems when it is applied to large molecules
like proteins. Furthermore, each cheminformatics toolkit or software
provides its own version of a canonical ordering, most based on unpublished
algorithms, which also complicates the generation of a universal unique
identifier for molecules. We present an alternative canonicalization
approach that uses a standard stable-sorting algorithm instead of
a Morgan-like index. Two new invariants that allow canonical ordering
of molecules with dependent chirality as well as those with highly
symmetrical cyclic graphs have been developed. The new approach proved
to be robust and fast when tested on the 1.45 million compounds of
the ChEMBL 20 data set in different scenarios like random renumbering
of input atoms or SMILES round tripping. Our new algorithm is able
to generate a canonical order of the atoms of protein molecules within
a few milliseconds. The novel algorithm is implemented in the open-source
cheminformatics toolkit RDKit. With this paper, we provide a reference
Python implementation of the algorithm that could easily be integrated
in any cheminformatics toolkit. This provides a first step toward
a common standard for canonical atom ordering to generate a universal
unique identifier for molecules other than InChI
Evidence of Water Moleculesî—¸A Statistical Evaluation of Water Molecules Based on Electron Density
Water
molecules play important roles in many biological processes,
especially when mediating protein–ligand interactions. Dehydration
and the hydrophobic effect are of central importance for estimating
binding affinities. Due to the specific geometric characteristics
of hydrogen bond functions of water molecules, meaning two acceptor
and two donor functions in a tetrahedral arrangement, they have to
be modeled accurately. Despite many attempts in the past years, accurate
prediction of water moleculesî—¸structurally as well as energeticallyî—¸remains
a grand challenge. One reason is certainly the lack of experimental
data, since energetic contributions of water molecules can only be
measured indirectly. However, on the structural side, the electron
density clearly shows the positions of stable water molecules. This
information has the potential to improve models on water structure
and energy in proteins and protein interfaces. On the basis of a high-resolution
subset of the Protein Data Bank, we have conducted an extensive statistical
analysis of 2.3 million water molecules, discriminating those water
molecules that are well resolved and those without much evidence of
electron density. In order to perform this classification, we introduce
a new measurement of electron density around an individual atom enabling
the automatic quantification of experimental support. On the basis
of this measurement, we present an analysis of water molecules with
a detailed profile of geometric and structural features. This data,
which is freely available, can be applied to not only modeling and
validation of new water models in structural biology but also in molecular
design
Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach
Big
data is one of the key transformative factors which increasingly influences
all aspects of modern life. Although this transformation brings vast
opportunities it also generates novel challenges, not the least of
which is organizing and searching this data deluge. The field of medicinal
chemistry is not different: more and more data are being generated,
for instance, by technologies such as DNA encoded libraries, peptide
libraries, text mining of large literature corpora, and new in silico
enumeration methods. Handling those huge sets of molecules effectively
is quite challenging and requires compromises that often come at the
expense of the interpretability of the results. In order to find an
intuitive and meaningful approach to organizing large molecular data
sets, we adopted a probabilistic framework called “topic modeling”
from the text-mining field. Here we present the first chemistry-related
implementation of this method, which allows large molecule sets to
be assigned to “chemical topics” and investigating the
relationships between those. In this first study, we thoroughly evaluate
this novel method in different experiments and discuss both its disadvantages
and advantages. We show very promising results in reproducing human-assigned
concepts using the approach to identify and retrieve chemical series
from sets of molecules. We have also created an intuitive visualization
of the chemical topics output by the algorithm. This is a huge benefit
compared to other unsupervised machine-learning methods, like clustering,
which are commonly used to group sets of molecules. Finally, we applied
the new method to the 1.6 million molecules of the ChEMBL22 data set
to test its robustness and efficiency. In about 1 h we built a 100-topic
model of this large data set in which we could identify interesting
topics like “proteins”, “DNA”, or “steroids”.
Along with this publication we provide our data sets and an open-source
implementation of the new method (CheTo) which will be part of an
upcoming version of the open-source cheminformatics toolkit RDKit
Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity
Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information
Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity
Fingerprint methods applied to molecules have proven to be useful for similarity determination and as inputs to machine-learning models. Here, we present the development of a new fingerprint for chemical reactions and validate its usefulness in building machine-learning models and in similarity assessment. Our final fingerprint is constructed as the difference of the atom-pair fingerprints of products and reactants and includes agents via calculated physicochemical properties. We validated the fingerprints on a large data set of reactions text-mined from granted United States patents from the last 40 years that have been classified using a substructure-based expert system. We applied machine learning to build a 50-class predictive model for reaction-type classification that correctly predicts 97% of the reactions in an external test set. Impressive accuracies were also observed when applying the classifier to reactions from an in-house electronic laboratory notebook. The performance of the novel fingerprint for assessing reaction similarity was evaluated by a cluster analysis that recovered 48 out of 50 of the reaction classes with a median F-score of 0.63 for the clusters. The data sets used for training and primary validation as well as all python scripts required to reproduce the analysis are provided in the Supporting Information
Additional file 1 of SIMPD: an algorithm for generating simulated time splits for validating machine learning approaches
Additional file 1. Additional tables and figures