4 research outputs found
SIRIUS: decomposing isotope patterns for metabolite identificationâ
Motivation: High-resolution mass spectrometry (MS) is among the most widely used technologies in metabolomics. Metabolites participate in almost all cellular processes, but most metabolites still remain uncharacterized. Determination of the sum formula is a crucial step in the identification of an unknown metabolite, as it reduces its possible structures to a hopefully manageable set
Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry
BACKGROUND: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas. RESULTS: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80â99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries. CONCLUSION: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65â81%. Corresponding software and supplemental data are available for downloads from the authors' website
Probabilistic Modelling of Liquid Chromatography Time-of-Flight Mass Spectrometry
Liquid Chromatography Time-of-Flight Mass Spectrometry (LC-TOFMS) is an
analytical platform that is widely used in the study of biological mixtures in the
rapidly growing fields of proteomics and metabolomics. The development of
statistical methods for the analysis of the very large data-sets that are typically
produced in LC-TOFMS experiments is a very active area of research. However, the
theoretical basis on which these methods are built is currently rather thin and as a
result, inferences regarding the samples analysed are generally drawn in a somewhat
qualitative fashion.
This thesis concerns the development of a statistical formalism that can be used to
describe and analyse the data produced in an LC-TOFMS experiment. This is done
through the derivation of a number of probability distributions, each corresponding to
a different level of approximation of the distribution of the empirically obtained data.
Using such probabilistic models, statistically rigorous methods are developed and
validated which are designed to address some of the central problems encountered in
the practical analysis of LC-TOFMS data, most notably those related to the
identification of unknown metabolites.
Unlike most existing bioinformatics techniques, this work aims for rigour rather than
generality. Consequently the methods developed are closely tailored to a particular
type of TOF mass spectrometer, although they do carry over to other TOF
instruments, albeit with important restrictions. And while the algorithms presented
may constitute useful analytical tools for the mass spectrometers to which they can be
applied, the broader implications of the general methodological approach that is taken
are also of central importance. In particular, it is arguable that the main value of this
work lies in its role as a proof-of-concept that detailed probabilistic modelling of
TOFMS data is possible and can be used in practice to address important data
analytical problems in a statistically rigorous manner
Bioinformatics solutions for confident identification and targeted quantification of proteins using tandem mass spectrometry
Proteins are the structural supports, signal messengers and molecular workhorses that underpin living processes in every cell. Understanding when and where proteins are expressed, and their structure and functions, is the realm of proteomics. Mass spectrometry (MS) is a powerful method for identifying and quantifying proteins, however, very large datasets are produced, so researchers rely on computational approaches to transform raw data into protein information. This project develops new bioinformatics solutions to support the next generation of proteomic MS research. Part I introduces the state of the art in proteomic bioinformatics in industry and academia. The business history and funding mechanisms are examined to fill a notable gap in management research literature, and to explain events at the sponsor, GlaxoSmithKline. It reveals that public funding of proteomic science has yet to come to fruition and exclusively high-tech niche bioinformatics businesses can succeed in the current climate. Next, a comprehensive review of repositories for proteomic MS is performed, to locate and compile a summary of sources of datasets for research activities in this project, and as a novel summary for the community. Part II addresses the issue of false positive protein identifications produced by automated analysis with a proteomics pipeline. The work shows that by selecting a suitable decoy database design, a statistically significant improvement in identification accuracy can be made. Part III describes development of computational resources for selecting multiple reaction monitoring (MRM) assays for quantifying proteins using MS. A tool for transition design, MRMaid (pronounced âmermaidâ), and database of pre-published transitions, MRMaid-DB, are developed, saving practitioners time and leveraging existing resources for superior transition selection. By improving the quality of identifications, and providing support for quantitative approaches, this project brings the field a small step closer to achieving the goal of systems biology.EThOS - Electronic Theses Online ServiceGBUnited Kingdo