4 research outputs found

    SIRIUS: decomposing isotope patterns for metabolite identification†

    Get PDF
    Motivation: High-resolution mass spectrometry (MS) is among the most widely used technologies in metabolomics. Metabolites participate in almost all cellular processes, but most metabolites still remain uncharacterized. Determination of the sum formula is a crucial step in the identification of an unknown metabolite, as it reduces its possible structures to a hopefully manageable set

    Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry

    Get PDF
    BACKGROUND: Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas. RESULTS: An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80–99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries. CONCLUSION: The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65–81%. Corresponding software and supplemental data are available for downloads from the authors' website

    Probabilistic Modelling of Liquid Chromatography Time-of-Flight Mass Spectrometry

    No full text
    Liquid Chromatography Time-of-Flight Mass Spectrometry (LC-TOFMS) is an analytical platform that is widely used in the study of biological mixtures in the rapidly growing fields of proteomics and metabolomics. The development of statistical methods for the analysis of the very large data-sets that are typically produced in LC-TOFMS experiments is a very active area of research. However, the theoretical basis on which these methods are built is currently rather thin and as a result, inferences regarding the samples analysed are generally drawn in a somewhat qualitative fashion. This thesis concerns the development of a statistical formalism that can be used to describe and analyse the data produced in an LC-TOFMS experiment. This is done through the derivation of a number of probability distributions, each corresponding to a different level of approximation of the distribution of the empirically obtained data. Using such probabilistic models, statistically rigorous methods are developed and validated which are designed to address some of the central problems encountered in the practical analysis of LC-TOFMS data, most notably those related to the identification of unknown metabolites. Unlike most existing bioinformatics techniques, this work aims for rigour rather than generality. Consequently the methods developed are closely tailored to a particular type of TOF mass spectrometer, although they do carry over to other TOF instruments, albeit with important restrictions. And while the algorithms presented may constitute useful analytical tools for the mass spectrometers to which they can be applied, the broader implications of the general methodological approach that is taken are also of central importance. In particular, it is arguable that the main value of this work lies in its role as a proof-of-concept that detailed probabilistic modelling of TOFMS data is possible and can be used in practice to address important data analytical problems in a statistically rigorous manner

    Bioinformatics solutions for confident identification and targeted quantification of proteins using tandem mass spectrometry

    Get PDF
    Proteins are the structural supports, signal messengers and molecular workhorses that underpin living processes in every cell. Understanding when and where proteins are expressed, and their structure and functions, is the realm of proteomics. Mass spectrometry (MS) is a powerful method for identifying and quantifying proteins, however, very large datasets are produced, so researchers rely on computational approaches to transform raw data into protein information. This project develops new bioinformatics solutions to support the next generation of proteomic MS research. Part I introduces the state of the art in proteomic bioinformatics in industry and academia. The business history and funding mechanisms are examined to fill a notable gap in management research literature, and to explain events at the sponsor, GlaxoSmithKline. It reveals that public funding of proteomic science has yet to come to fruition and exclusively high-tech niche bioinformatics businesses can succeed in the current climate. Next, a comprehensive review of repositories for proteomic MS is performed, to locate and compile a summary of sources of datasets for research activities in this project, and as a novel summary for the community. Part II addresses the issue of false positive protein identifications produced by automated analysis with a proteomics pipeline. The work shows that by selecting a suitable decoy database design, a statistically significant improvement in identification accuracy can be made. Part III describes development of computational resources for selecting multiple reaction monitoring (MRM) assays for quantifying proteins using MS. A tool for transition design, MRMaid (pronounced „mermaid‟), and database of pre-published transitions, MRMaid-DB, are developed, saving practitioners time and leveraging existing resources for superior transition selection. By improving the quality of identifications, and providing support for quantitative approaches, this project brings the field a small step closer to achieving the goal of systems biology.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    corecore