4 research outputs found
The Valence State Combination Model: A Generic Framework for Handling Tautomers and Protonation States
The consistent handling of molecules
is probably the most basic
and important requirement in the field of cheminformatics. Reliable
results can only be obtained if the underlying calculations are independent
of the specific way molecules are represented in the input data. However,
ensuring consistency is a complex task with many pitfalls, an important
one being the fact that the same molecule can be represented by different
valence bond structures. In order to achieve reliability, a cheminformatics
system needs to solve two fundamental problems. First, different choices
of valence bond structures must be identified as the same molecule.
Second, for each molecule all valence bond structures relevant to
the context must be taken into consideration. The latter is especially
important with regard to tautomers and protonation states, as these
have considerable influence on physicochemical properties of molecules.
We present a comprehensive method for the rapid and consistent generation
of reasonable tautomers and protonation states for molecules relevant
in the context of drug design. This method is based on a generic scheme,
the Valence State Combination Model, which has been designed for the
enumeration and scoring of valence bond structures in large data sets.
In order to ensure our method’s consistency, we have developed
procedures which can serve as a general validation scheme for similar
approaches. The analysis of both the average number of generated structures
and the associated runtimes shows that our method is perfectly suited
for typical cheminformatics applications. By comparison with frequently
used and curated public data sets, we can demonstrate that the tautomers
and protonation state produced by our method are chemically reasonable
Unique Ring Families: A Chemically Meaningful Description of Molecular Ring Topologies
The perception of a set of rings forms the basis for
a number of chemoinformatics applications, e.g. the systematic naming
of compounds, the calculation of molecular descriptors, the matching
of SMARTS expressions, and the generation of atomic coordinates. We
introduce the concept of unique ring families (URFs) as an extension
of the concept of relevant cycles (RCs)., URFs are consistent
for different atom orders and represent an intuitive description of
the rings of a molecular graph. Furthermore, in contrast to RCs, URFs
are polynomial in number. We provide an algorithm to efficiently calculate
URFs in polynomial time and demonstrate their suitability for real-time
applications by providing computing time benchmarks for the PubChem
Database. URFs combine three important
properties of chemical ring descriptions, for the first time, namely
being unique, chemically meaningful, and efficient to compute. Therefore,
URFs are a valuable alternative to the commonly used concept of the
smallest set of smallest rings (SSSR) and would be suited to become
the standard measure for ring topologies of small molecules
Reading PDB: Perception of Molecules from 3D Atomic Coordinates
The analysis of small molecule crystal structures is
a common way
to gather valuable information for drug development. The necessary
structural data is usually provided in specific file formats containing
only element identities and three-dimensional atomic coordinates as
reliable chemical information. Consequently, the automated perception
of molecular structures from atomic coordinates has become a standard
task in cheminformatics. The molecules generated by such methods must
be both chemically valid and reasonable to provide a reliable basis
for subsequent calculations. This can be a difficult task since the
provided coordinates may deviate from ideal molecular geometries due
to experimental uncertainties or low resolution. Additionally, the
quality of the input data often differs significantly thus making
it difficult to distinguish between actual structural features and
mere geometric distortions. We present a method for the generation
of molecular structures from atomic coordinates based on the recently
published NAOMI model. By making use of this consistent chemical description,
our method is able to generate reliable results even with input data
of low quality. Molecules from 363 Protein Data Bank (PDB) entries
could be perceived with a success rate of 98%, a result which could
not be achieved with previously described methods. The robustness
of our approach has been assessed by processing all small molecules
from the PDB and comparing them to reference structures. The complete
data set can be processed in less than 3 min, thus showing that our
approach is suitable for large scale applications
Fast Protein Binding Site Comparison via an Index-Based Screening Technology
We present TrixP, a new index-based method for fast protein
binding site comparison and function prediction. TrixP determines
binding site similarities based on the comparison of descriptors that
encode pharmacophoric and spatial features. Therefore, it adopts the
efficient core components of TrixX, a structure-based virtual screening
technology for large compound libraries. TrixP expands this technology
by new components in order to allow a screening of protein libraries.
TrixP accounts for the inherent flexibility of proteins employing
a partial shape matching routine. After the identification of structures
with matching pharmacophoric features and geometric shape, TrixP superimposes
the binding sites and, finally, assesses their similarity according
to the fit of pharmacophoric properties. TrixP is able to find analogies
between closely and distantly related binding sites. Recovery rates
of 81.8% for similar binding site pairs, assisted by rejecting rates
of 99.5% for dissimilar pairs on a test data set containing 1331 pairs,
confirm this ability. TrixP exclusively identifies members of the
same protein family on top ranking positions out of a library consisting
of 9802 binding sites. Furthermore, 30 predicted kinase binding sites
can almost perfectly be classified into their known subfamilies