22 research outputs found
Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation
Small
organic molecules are often flexible, i.e., they can adopt
a variety of low-energy conformations in solution that exist in equilibrium
with each other. Two main search strategies are used to generate representative
conformational ensembles for molecules: systematic and stochastic.
In the first approach, each rotatable bond is sampled systematically
in discrete intervals, limiting its use to molecules with a small
number of rotatable bonds. Stochastic methods, on the other hand,
sample the conformational space of a molecule randomly and can thus
be applied to more flexible molecules. Different methods employ different
degrees of experimental data for conformer generation. So-called knowledge-based
methods use predefined libraries of torsional angles and ring conformations.
In the distance geometry approach, on the other hand, a smaller amount
of empirical information is used, i.e., ideal bond lengths, ideal
bond angles, and a few ideal torsional angles. Distance geometry is
a computationally fast method to generate conformers, but it has the
downside that purely distance-based constraints tend to lead to distorted
aromatic rings and sp<sup>2</sup> centers. To correct this, the resulting
conformations are often minimized with a force field, adding computational
complexity and run time. Here we present an alternative strategy that
combines the distance geometry approach with experimental torsion-angle
preferences obtained from small-molecule crystallographic data. The
torsional angles are described by a previously developed set of hierarchically
structured SMARTS patterns. The new approach is implemented in the
open-source cheminformatics library RDKit, and its performance is
assessed by comparing the diversity of the generated ensemble and
the ability to reproduce crystal conformations taken from the crystal
structures of small molecules and proteinâligand complexes
Combining IC<sub>50</sub> or <i>K</i><sub><i>i</i></sub> Values from Different Sources Is a Source of Significant Noise
As part of the ongoing quest to find or construct large
data sets
for use in validating new machine learning (ML) approaches for bioactivity
prediction, it has become distressingly common for researchers to
combine literature IC50 data generated using different
assays into a single data set. It is well-known that there are many
situations where this is a scientifically risky thing to do, even
when the assays are against exactly the same target, but the risks
of assays being incompatible are even higher when pulling data from
large collections of literature data like ChEMBL. Here, we estimate
the amount of noise present in combined data sets using cases where
measurements for the same compound are reported in multiple assays
against the same target. This approach shows that IC50 assays
selected using minimal curation settings have poor agreement with
each other: almost 65% of the points differ by more than 0.3 log units,
27% differ by more than one log unit, and the correlation between
the assays, as measured by Kendallâs Ï, is only 0.51.
Requiring that most of the assay metadata in ChEMBL matches (âmaximal
curationâ) in order to combine two assays improves the situation
(48% of the points differ by more than 0.3 log units, 13% by more
than one log unit, and Kendallâs Ï is 0.71) at the expense
of having smaller data sets. Surprisingly, our analysis shows similar
amounts of noise when combining data from different literature Ki assays. We suggest that good
scientific practice requires careful curation when combining data
sets from different assays and hope that our maximal curation strategy
will help to improve the quality of the data that are being used to
build and validate ML models for bioactivity prediction. To help achieve
this, the code and ChEMBL queries that we used for the maximal curation
approach are available as open-source software in our GitHub repository, https://github.com/rinikerlab/overlapping_assays
Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation
Small
organic molecules are often flexible, i.e., they can adopt
a variety of low-energy conformations in solution that exist in equilibrium
with each other. Two main search strategies are used to generate representative
conformational ensembles for molecules: systematic and stochastic.
In the first approach, each rotatable bond is sampled systematically
in discrete intervals, limiting its use to molecules with a small
number of rotatable bonds. Stochastic methods, on the other hand,
sample the conformational space of a molecule randomly and can thus
be applied to more flexible molecules. Different methods employ different
degrees of experimental data for conformer generation. So-called knowledge-based
methods use predefined libraries of torsional angles and ring conformations.
In the distance geometry approach, on the other hand, a smaller amount
of empirical information is used, i.e., ideal bond lengths, ideal
bond angles, and a few ideal torsional angles. Distance geometry is
a computationally fast method to generate conformers, but it has the
downside that purely distance-based constraints tend to lead to distorted
aromatic rings and sp<sup>2</sup> centers. To correct this, the resulting
conformations are often minimized with a force field, adding computational
complexity and run time. Here we present an alternative strategy that
combines the distance geometry approach with experimental torsion-angle
preferences obtained from small-molecule crystallographic data. The
torsional angles are described by a previously developed set of hierarchically
structured SMARTS patterns. The new approach is implemented in the
open-source cheminformatics library RDKit, and its performance is
assessed by comparing the diversity of the generated ensemble and
the ability to reproduce crystal conformations taken from the crystal
structures of small molecules and proteinâligand complexes
Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation
Small
organic molecules are often flexible, i.e., they can adopt
a variety of low-energy conformations in solution that exist in equilibrium
with each other. Two main search strategies are used to generate representative
conformational ensembles for molecules: systematic and stochastic.
In the first approach, each rotatable bond is sampled systematically
in discrete intervals, limiting its use to molecules with a small
number of rotatable bonds. Stochastic methods, on the other hand,
sample the conformational space of a molecule randomly and can thus
be applied to more flexible molecules. Different methods employ different
degrees of experimental data for conformer generation. So-called knowledge-based
methods use predefined libraries of torsional angles and ring conformations.
In the distance geometry approach, on the other hand, a smaller amount
of empirical information is used, i.e., ideal bond lengths, ideal
bond angles, and a few ideal torsional angles. Distance geometry is
a computationally fast method to generate conformers, but it has the
downside that purely distance-based constraints tend to lead to distorted
aromatic rings and sp<sup>2</sup> centers. To correct this, the resulting
conformations are often minimized with a force field, adding computational
complexity and run time. Here we present an alternative strategy that
combines the distance geometry approach with experimental torsion-angle
preferences obtained from small-molecule crystallographic data. The
torsional angles are described by a previously developed set of hierarchically
structured SMARTS patterns. The new approach is implemented in the
open-source cheminformatics library RDKit, and its performance is
assessed by comparing the diversity of the generated ensemble and
the ability to reproduce crystal conformations taken from the crystal
structures of small molecules and proteinâligand complexes
Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation
Small
organic molecules are often flexible, i.e., they can adopt
a variety of low-energy conformations in solution that exist in equilibrium
with each other. Two main search strategies are used to generate representative
conformational ensembles for molecules: systematic and stochastic.
In the first approach, each rotatable bond is sampled systematically
in discrete intervals, limiting its use to molecules with a small
number of rotatable bonds. Stochastic methods, on the other hand,
sample the conformational space of a molecule randomly and can thus
be applied to more flexible molecules. Different methods employ different
degrees of experimental data for conformer generation. So-called knowledge-based
methods use predefined libraries of torsional angles and ring conformations.
In the distance geometry approach, on the other hand, a smaller amount
of empirical information is used, i.e., ideal bond lengths, ideal
bond angles, and a few ideal torsional angles. Distance geometry is
a computationally fast method to generate conformers, but it has the
downside that purely distance-based constraints tend to lead to distorted
aromatic rings and sp<sup>2</sup> centers. To correct this, the resulting
conformations are often minimized with a force field, adding computational
complexity and run time. Here we present an alternative strategy that
combines the distance geometry approach with experimental torsion-angle
preferences obtained from small-molecule crystallographic data. The
torsional angles are described by a previously developed set of hierarchically
structured SMARTS patterns. The new approach is implemented in the
open-source cheminformatics library RDKit, and its performance is
assessed by comparing the diversity of the generated ensemble and
the ability to reproduce crystal conformations taken from the crystal
structures of small molecules and proteinâligand complexes
Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation
Small
organic molecules are often flexible, i.e., they can adopt
a variety of low-energy conformations in solution that exist in equilibrium
with each other. Two main search strategies are used to generate representative
conformational ensembles for molecules: systematic and stochastic.
In the first approach, each rotatable bond is sampled systematically
in discrete intervals, limiting its use to molecules with a small
number of rotatable bonds. Stochastic methods, on the other hand,
sample the conformational space of a molecule randomly and can thus
be applied to more flexible molecules. Different methods employ different
degrees of experimental data for conformer generation. So-called knowledge-based
methods use predefined libraries of torsional angles and ring conformations.
In the distance geometry approach, on the other hand, a smaller amount
of empirical information is used, i.e., ideal bond lengths, ideal
bond angles, and a few ideal torsional angles. Distance geometry is
a computationally fast method to generate conformers, but it has the
downside that purely distance-based constraints tend to lead to distorted
aromatic rings and sp<sup>2</sup> centers. To correct this, the resulting
conformations are often minimized with a force field, adding computational
complexity and run time. Here we present an alternative strategy that
combines the distance geometry approach with experimental torsion-angle
preferences obtained from small-molecule crystallographic data. The
torsional angles are described by a previously developed set of hierarchically
structured SMARTS patterns. The new approach is implemented in the
open-source cheminformatics library RDKit, and its performance is
assessed by comparing the diversity of the generated ensemble and
the ability to reproduce crystal conformations taken from the crystal
structures of small molecules and proteinâligand complexes
Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation
Small
organic molecules are often flexible, i.e., they can adopt
a variety of low-energy conformations in solution that exist in equilibrium
with each other. Two main search strategies are used to generate representative
conformational ensembles for molecules: systematic and stochastic.
In the first approach, each rotatable bond is sampled systematically
in discrete intervals, limiting its use to molecules with a small
number of rotatable bonds. Stochastic methods, on the other hand,
sample the conformational space of a molecule randomly and can thus
be applied to more flexible molecules. Different methods employ different
degrees of experimental data for conformer generation. So-called knowledge-based
methods use predefined libraries of torsional angles and ring conformations.
In the distance geometry approach, on the other hand, a smaller amount
of empirical information is used, i.e., ideal bond lengths, ideal
bond angles, and a few ideal torsional angles. Distance geometry is
a computationally fast method to generate conformers, but it has the
downside that purely distance-based constraints tend to lead to distorted
aromatic rings and sp<sup>2</sup> centers. To correct this, the resulting
conformations are often minimized with a force field, adding computational
complexity and run time. Here we present an alternative strategy that
combines the distance geometry approach with experimental torsion-angle
preferences obtained from small-molecule crystallographic data. The
torsional angles are described by a previously developed set of hierarchically
structured SMARTS patterns. The new approach is implemented in the
open-source cheminformatics library RDKit, and its performance is
assessed by comparing the diversity of the generated ensemble and
the ability to reproduce crystal conformations taken from the crystal
structures of small molecules and proteinâligand complexes
Whatâs What: The (Nearly) Definitive Guide to Reaction Role Assignment
When
analyzing chemical reactions it is essential to know which
molecules are actively involved in the reaction and which educts will
form the product molecules. Assigning reaction roles, like reactant,
reagent, or product, to the molecules of a chemical reaction might
be a trivial problem for hand-curated reaction schemes but it is more
difficult to automate, an essential step when handling large amounts
of reaction data. Here, we describe a new fingerprint-based and data-driven
approach to assign reaction roles which is also applicable to rather
unbalanced and noisy reaction schemes. Given a set of molecules involved
and knowing the product(s) of a reaction we assign the most probable
reactants and sort out the remaining reagents. Our approach was validated
using two different data sets: one hand-curated data set comprising
about 680 diverse reactions extracted from patents which span more
than 200 different reaction types and include up to 18 different reactants.
A second set consists of 50âŻ000 randomly picked reactions from
US patents. The results of the second data set were compared to results
obtained using two different atom-to-atom mapping algorithms. For
both data sets our method assigns the reaction roles correctly for
the vast majority of the reactions, achieving an accuracy of 88% and
97% respectively. The median time needed, about 8 ms, indicates that
the algorithm is fast enough to be applied to large collections. The
new method is available as part of the RDKit toolkit and the data
sets and Jupyter notebooks used for evaluation of the new method are
available in the Supporting Information of this publication
Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing
The
concept of data fusion - the combination of information from different
sources describing the same object with the expectation to generate
a more accurate representation - has found application in a very broad
range of disciplines. In the context of ligand-based virtual screening
(VS), data fusion has been applied to combine knowledge from either
different active molecules or different fingerprints to improve similarity
search performance. Machine-learning (ML) methods based on fusion
of multiple homogeneous classifiers, in particular random forests,
have also been widely applied in the ML literature. The heterogeneous
version of classifier fusion - fusing the predictions from different
model types - has been less explored. Here, we investigate heterogeneous
classifier fusion for ligand-based VS using three different ML methods,
RF, naıÌve Bayes (NB), and logistic regression (LR), with
four 2D fingerprints, atom pairs, topological torsions, RDKit fingerprint,
and circular fingerprint. The methods are compared using a previously
developed benchmarking platform for 2D fingerprints which is extended
to ML methods in this article. The original data sets are filtered
for difficulty, and a new set of challenging data sets from ChEMBL
is added. Data sets were also generated for a second use case: starting
from a small set of related actives instead of diverse actives. The
final fused model consistently outperforms the other approaches across
the broad variety of targets studied, indicating that heterogeneous
classifier fusion is a very promising approach for ligand-based VS.
The new data sets together with the adapted source code for ML methods
are provided in the Supporting Information
Whatâs What: The (Nearly) Definitive Guide to Reaction Role Assignment
When
analyzing chemical reactions it is essential to know which
molecules are actively involved in the reaction and which educts will
form the product molecules. Assigning reaction roles, like reactant,
reagent, or product, to the molecules of a chemical reaction might
be a trivial problem for hand-curated reaction schemes but it is more
difficult to automate, an essential step when handling large amounts
of reaction data. Here, we describe a new fingerprint-based and data-driven
approach to assign reaction roles which is also applicable to rather
unbalanced and noisy reaction schemes. Given a set of molecules involved
and knowing the product(s) of a reaction we assign the most probable
reactants and sort out the remaining reagents. Our approach was validated
using two different data sets: one hand-curated data set comprising
about 680 diverse reactions extracted from patents which span more
than 200 different reaction types and include up to 18 different reactants.
A second set consists of 50âŻ000 randomly picked reactions from
US patents. The results of the second data set were compared to results
obtained using two different atom-to-atom mapping algorithms. For
both data sets our method assigns the reaction roles correctly for
the vast majority of the reactions, achieving an accuracy of 88% and
97% respectively. The median time needed, about 8 ms, indicates that
the algorithm is fast enough to be applied to large collections. The
new method is available as part of the RDKit toolkit and the data
sets and Jupyter notebooks used for evaluation of the new method are
available in the Supporting Information of this publication