7 research outputs found
Chemical Structure Identification in Metabolomics: Computational Modeling of Experimental Features
The identification of compounds in complex mixtures remains challenging despite recent advances in analytical techniques. At present, no single method can detect and quantify the vast array of compounds that might be of potential interest in metabolomics studies. High performance liquid chromatography/mass spectrometry (HPLC/MS) is often considered the analytical method of choice for analysis of biofluids. The positive identification of an unknown involves matching at least two orthogonal HPLC/MS measurements (exact mass, retention index, drift time etc.) against an authentic standard. However, due to the limited availability of authentic standards, an alternative approach involves matching known and measured features of the unknown compound with computationally predicted features for a set of candidate compounds downloaded from a chemical database. Computationally predicted features include retention index, ECOM50 (energy required to decompose 50% of a selected precursor ion in a collision induced dissociation cell), drift time, whether the unknown compound is biological or synthetic and a collision induced dissociation (CID) spectrum. Computational predictions are used to filter the initial “bin” of candidate compounds. The final output is a ranked list of candidates that best match the known and measured features. In this mini review, we discuss cheminformatics methods underlying this database search-filter identification approach
BioSM: Metabolomics Tool for Identifying Endogenous Mammalian Biochemical Structures in Chemical Structure Space
The
structural identification of unknown biochemical compounds
in complex biofluids continues to be a major challenge in metabolomics
research. Using LC/MS, there are currently two major options for solving
this problem: searching small biochemical databases, which often do
not contain the unknown of interest or searching large chemical databases
which include large numbers of nonbiochemical compounds. Searching
larger chemical databases (larger chemical space) increases the odds
of identifying an unknown biochemical compound, but only if nonbiochemical
structures can be eliminated from consideration. In this paper we
present BioSM; a cheminformatics tool that uses known endogenous mammalian
biochemical compounds (as scaffolds) and graph matching methods to
identify endogenous mammalian biochemical structures in chemical structure
space. The results of a comprehensive set of empirical experiments
suggest that BioSM identifies endogenous mammalian biochemical structures
with high accuracy. In a leave-one-out cross validation experiment,
BioSM correctly predicted 95% of 1388 Kyoto Encyclopedia of Genes
and Genomes (KEGG) compounds as endogenous mammalian biochemicals
using 1565 scaffolds. Analysis of two additional biological data sets
containing 2330 human metabolites (HMDB) and 2416 plant secondary
metabolites (KEGG) resulted in biochemical annotations of 89% and
72% of the compounds, respectively. When a data set of 3895 drugs
(DrugBank and USAN) was tested, 48% of these structures were predicted
to be biochemical. However, when a set of synthetic chemical compounds
(Chembridge and Chemsynthesis databases) were examined, only 29% of
the 458 207 structures were predicted to be biochemical. Moreover,
BioSM predicted that 34% of 883 199 randomly selected compounds
from PubChem were biochemical. We then expanded the scaffold list
to 3927 biochemical compounds and reevaluated the above data sets
to determine whether scaffold number influenced model performance.
Although there were significant improvements in model sensitivity
and specificity using the larger scaffold list, the data set comparison
results were very similar. These results suggest that additional biochemical
scaffolds will not further improve our representation of biochemical
structure space and that the model is reasonably robust. BioSM provides
a qualitative (yes/no) and quantitative (ranking) method for endogenous
mammalian biochemical annotation of chemical space and, thus, will
be useful in the identification of unknown biochemical structures
in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu
BioSM: Metabolomics Tool for Identifying Endogenous Mammalian Biochemical Structures in Chemical Structure Space
The
structural identification of unknown biochemical compounds
in complex biofluids continues to be a major challenge in metabolomics
research. Using LC/MS, there are currently two major options for solving
this problem: searching small biochemical databases, which often do
not contain the unknown of interest or searching large chemical databases
which include large numbers of nonbiochemical compounds. Searching
larger chemical databases (larger chemical space) increases the odds
of identifying an unknown biochemical compound, but only if nonbiochemical
structures can be eliminated from consideration. In this paper we
present BioSM; a cheminformatics tool that uses known endogenous mammalian
biochemical compounds (as scaffolds) and graph matching methods to
identify endogenous mammalian biochemical structures in chemical structure
space. The results of a comprehensive set of empirical experiments
suggest that BioSM identifies endogenous mammalian biochemical structures
with high accuracy. In a leave-one-out cross validation experiment,
BioSM correctly predicted 95% of 1388 Kyoto Encyclopedia of Genes
and Genomes (KEGG) compounds as endogenous mammalian biochemicals
using 1565 scaffolds. Analysis of two additional biological data sets
containing 2330 human metabolites (HMDB) and 2416 plant secondary
metabolites (KEGG) resulted in biochemical annotations of 89% and
72% of the compounds, respectively. When a data set of 3895 drugs
(DrugBank and USAN) was tested, 48% of these structures were predicted
to be biochemical. However, when a set of synthetic chemical compounds
(Chembridge and Chemsynthesis databases) were examined, only 29% of
the 458 207 structures were predicted to be biochemical. Moreover,
BioSM predicted that 34% of 883 199 randomly selected compounds
from PubChem were biochemical. We then expanded the scaffold list
to 3927 biochemical compounds and reevaluated the above data sets
to determine whether scaffold number influenced model performance.
Although there were significant improvements in model sensitivity
and specificity using the larger scaffold list, the data set comparison
results were very similar. These results suggest that additional biochemical
scaffolds will not further improve our representation of biochemical
structure space and that the model is reasonably robust. BioSM provides
a qualitative (yes/no) and quantitative (ranking) method for endogenous
mammalian biochemical annotation of chemical space and, thus, will
be useful in the identification of unknown biochemical structures
in metabolomics. BioSM is freely available at http://metabolomics.pharm.uconn.edu
In Silico Enzymatic Synthesis of a 400 000 Compound Biochemical Database for Nontargeted Metabolomics
Current
methods of structure identification in mass-spectrometry-based
nontargeted metabolomics rely on matching experimentally determined
features of an unknown compound to those of candidate compounds contained
in biochemical databases. A major limitation of this approach is the
relatively small number of compounds currently included in these databases.
If the correct structure is not present in a database, it cannot be
identified, and if it cannot be identified, it cannot be included
in a database. Thus, there is an urgent need to augment metabolomics
databases with rationally designed biochemical structures using alternative
means. Here we present the In Vivo/In Silico Metabolites Database
(IIMDB), a database of in silico enzymatically synthesized metabolites,
to partially address this problem. The database, which is available
at http://metabolomics.pharm.uconn.edu/iimdb/, includes ∼23 000 known compounds (mammalian metabolites,
drugs, secondary plant metabolites, and glycerophospholipids) collected
from existing biochemical databases plus more than 400 000
computationally generated human phase-I and phase-II metabolites of
these known compounds. IIMDB features a user-friendly web interface
and a programmer-friendly RESTful web service. Ninety-five percent
of the computationally generated metabolites in IIMDB were not found
in any existing database. However, 21 640 were identical to
compounds already listed in PubChem, HMDB, KEGG, or HumanCyc. Furthermore,
the vast majority of these in silico metabolites were scored as biological
using BioSM, a software program that identifies biochemical structures
in chemical structure space. These results suggest that in silico
biochemical synthesis represents a viable approach for significantly
augmenting biochemical databases for nontargeted metabolomics applications