368 research outputs found

    A novel scoring schema for peptide identification by searching protein sequence databases using tandem mass spectrometry data

    Get PDF
    BACKGROUND: Tandem mass spectrometry (MS/MS) is a powerful tool for protein identification. Although great efforts have been made in scoring the correlation between tandem mass spectra and an amino acid sequence database, improvements could be made in three aspects, including characterization ofpeaks in spectra, adoption of effective scoring functions and access to thereliability of matching between peptides and spectra. RESULTS: A novel scoring function is presented, along with criteria to estimate the performance confidence of the function. Through learning the typesof product ions and the probability of generating them, a hypothetic spectrum was generated for each candidate peptide. Then relative entropy was introduced to measure the similarity between the hypothetic and the observed spectra. Based on the extreme value distribution (EVD) theory, a threshold was chosen to distinguish a true peptide assignment from a random one. Tests on a public MS/MS dataset demonstrated that this method performs better than the well-known SEQUEST. CONCLUSION: A reliable identification of proteins from the spectra promises a more efficient application of tandem mass spectrometry to proteomes with high complexity

    Computational Methods for Protein Identification from Mass Spectrometry Data

    Get PDF
    Protein identification using mass spectrometry is an indispensable computational tool in the life sciences. A dramatic increase in the use of proteomic strategies to understand the biology of living systems generates an ongoing need for more effective, efficient, and accurate computational methods for protein identification. A wide range of computational methods, each with various implementations, are available to complement different proteomic approaches. A solid knowledge of the range of algorithms available and, more critically, the accuracy and effectiveness of these techniques is essential to ensure as many of the proteins as possible, within any particular experiment, are correctly identified. Here, we undertake a systematic review of the currently available methods and algorithms for interpreting, managing, and analyzing biological data associated with protein identification. We summarize the advances in computational solutions as they have responded to corresponding advances in mass spectrometry hardware. The evolution of scoring algorithms and metrics for automated protein identification are also discussed with a focus on the relative performance of different techniques. We also consider the relative advantages and limitations of different techniques in particular biological contexts. Finally, we present our perspective on future developments in the area of computational protein identification by considering the most recent literature on new and promising approaches to the problem as well as identifying areas yet to be explored and the potential application of methods from other areas of computational biology

    Exploring Information Technologies to Support Shotgun Proteomics

    Get PDF
    Shotgun proteomics refers to the direct analysis of complex protein mixtures to create a profile of the proteins present in the cell. These profiles can be used to study the underlying biological basis for cancer development. Closely studying the profiles as the cancer proliferates reveals the molecular interactions in the cell. They provide clues to researchers on potential drug targets to treat the disease. A little more than a decade old, shotgun proteomics is a relatively new form of discovery, one that is data intensive and requires complex data analysis. Early studies indicated a gap between the ability to analyze biological samples with a mass spectrometer and the information systems available to process and analyze this data. This thesis reflects on an automated proteomic information system at the University of Colorado Central Analytical Facility. Investigators there are using cutting edge proteomic techniques to analyze melanoma cell lines responsible for skin cancer in patients. The paper will provide insight on key design processes in the development of an Oracle relational database and automation system to support high-throughput shotgun proteomics in the facility. It will also discuss significant contributions, technologies, software, a data standard, and leaders in the field developing solutions and products in proteomics

    Computational methods and tools for protein phosphorylation analysis

    Get PDF
    Signaling pathways represent a central regulatory mechanism of biological systems where a key event in their correct functioning is the reversible phosphorylation of proteins. Protein phosphorylation affects at least one-third of all proteins and is the most widely studied posttranslational modification. Phosphorylation analysis is still perceived, in general, as difficult or cumbersome and not readily attempted by many, despite the high value of such information. Specifically, determining the exact location of a phosphorylation site is currently considered a major hurdle, thus reliable approaches are necessary for the detection and localization of protein phosphorylation. The goal of this PhD thesis was to develop computation methods and tools for mass spectrometry-based protein phosphorylation analysis, particularly validation of phosphorylation sites. In the first two studies, we developed methods for improved identification of phosphorylation sites in MALDI-MS. In the first study it was achieved through the automatic combination of spectra from multiple matrices, while in the second study, an optimized protocol for sample loading and washing conditions was suggested. In the third study, we proposed and evaluated the hypothesis that in ESI-MS, tandem CID and HCD spectra of phosphopeptides can be accurately predicted and used in spectral library searching. This novel strategy for phosphosite validation and identification offered accuracy that outperformed the other currently existing popular methods and proved applicable to complex biological samples. And finally, we significantly improved the performance of our command-line prototype tool, added graphical user interface, and options for customizable simulation parameters and filtering of selected spectra, peptides or proteins. The new software, SimPhospho, is open-source and can be easily integrated in a phosphoproteomics data analysis workflow. Together, these bioinformatics methods and tools enable confident phosphosite assignment and improve reliable phosphoproteome identification and reportin

    The Drosophila melanogaster PeptideAtlas facilitates the use of peptide data for improved fly proteomics and genome annotation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Crucial foundations of any quantitative systems biology experiment are correct genome and proteome annotations. Protein databases compiled from high quality empirical protein identifications that are in turn based on correct gene models increase the correctness, sensitivity, and quantitative accuracy of systems biology genome-scale experiments.</p> <p>Results</p> <p>In this manuscript, we present the <it>Drosophila melanogaster </it>PeptideAtlas, a fly proteomics and genomics resource of unsurpassed depth. Based on peptide mass spectrometry data collected in our laboratory the portal <url>http://www.drosophila-peptideatlas.org</url> allows querying fly protein data observed with respect to gene model confirmation and splice site verification as well as for the identification of proteotypic peptides suited for targeted proteomics studies. Additionally, the database provides consensus mass spectra for observed peptides along with qualitative and quantitative information about the number of observations of a particular peptide and the sample(s) in which it was observed.</p> <p>Conclusion</p> <p>PeptideAtlas is an open access database for the <it>Drosophila </it>community that has several features and applications that support (1) reduction of the complexity inherently associated with performing targeted proteomic studies, (2) designing and accelerating shotgun proteomics experiments, (3) confirming or questioning gene models, and (4) adjusting gene models such that they are in line with observed <it>Drosophila </it>peptides. While the database consists of proteomic data it is not required that the user is a proteomics expert.</p

    Novel Computational Methods for the Analysis and Interpretation of MS/MS Data in Metaproteomics

    No full text

    Novel Computational Methods for the Analysis and Interpretation of MS/MS Data in Metaproteomics

    No full text
    Otto-von-Guericke-Universität Magdeburg, Fakultät für Verfahrens- und Systemtechnik, Dissertation, 2016von Dipl.-Bioinf. Thilo MuthLiteraturverzeichnis: Seite 151-17

    De novo sequencing of heparan sulfate saccharides using high-resolution tandem mass spectrometry

    Get PDF
    Heparan sulfate (HS) is a class of linear, sulfated polysaccharides located on cell surface, secretory granules, and in extracellular matrices found in all animal organ systems. It consists of alternately repeating disaccharide units, expressed in animal species ranging from hydra to higher vertebrates including humans. HS binds and mediates the biological activities of over 300 proteins, including growth factors, enzymes, chemokines, cytokines, adhesion and structural proteins, lipoproteins and amyloid proteins. The binding events largely depend on the fine structure - the arrangement of sulfate groups and other variations - on HS chains. With the activated electron dissociation (ExD) high-resolution tandem mass spectrometry technique, researchers acquire rich structural information about the HS molecule. Using this technique, covalent bonds of the HS oligosaccharide ions are dissociated in the mass spectrometer. However, this information is complex, owing to the large number of product ions, and contains a degree of ambiguity due to the overlapping of product ion masses and lability of sulfate groups; as a result, there is a serious barrier to manual interpretation of the spectra. The interpretation of such data creates a serious bottleneck to the understanding of the biological roles of HS. In order to solve this problem, I designed HS-SEQ - the first HS sequencing algorithm using high-resolution tandem mass spectrometry. HS-SEQ allows rapid and confident sequencing of HS chains from millions of candidate structures and I validated its performance using multiple known pure standards. In many cases, HS oligosaccharides exist as mixtures of sulfation positional isomers. I therefore designed MULTI-HS-SEQ, an extended version of HS-SEQ targeting spectra coming from more than one HS sequence. I also developed several pre-processing and post-processing modules to support the automatic identification of HS structure. These methods and tools demonstrated the capacity for large-scale HS sequencing, which should contribute to clarifying the rich information encoded by HS chains as well as developing tailored HS drugs to target a wide spectrum of diseases

    Computational Strategies for Proteogenomics Analyses

    Full text link
    Proteogenomics is an area of proteomics concerning the detection of novel peptides and peptide variants nominated by genomics and transcriptomics experiments. While the term primarily refers to studies utilizing a customized protein database derived from select sequencing experiments, proteogenomics methods can also be applied in the quest for identifying previously unobserved, or missing, proteins in a reference protein database. The identification of novel peptides is difficult and results can be dominated by false positives if conventional computational and statistical approaches for shotgun proteomics are directly applied without consideration of the challenges involved in proteogenomics analyses. In this dissertation, I systematically distill the sources of false positives in peptide identification and present potential remedies, including computational strategies that are necessary to make these approaches feasible for large datasets. In the first part, I analyze high scoring decoys, which are false identifications with high assigned confidences, using multiple peptide identification strategies to understand how they are generated and develop strategies for reducing false positives. I also demonstrate that modified peptides can cause violations in the target-decoy assumptions, which is a cornerstone for error rate estimation in shotgun proteomics, leading to potential underestimation in the number of false positives. Second, I address computational bottlenecks in proteogenomics workflows through the development of two database search engines: EGADS and MSFragger. EGADS aims to address issues relating to the large sequence space involved in proteogenomics studies by using graphical processing units to accelerate both in-silico digestion and similarity scoring. MSFragger implements a novel fragment ion index and searching algorithm that vastly speeds up spectra similarity calculations. For the identification of modified peptides using the open search strategy, MSFragger is over 150X faster than conventional database search tools. Finally, I will discuss refinements to the open search strategy for detecting modified peptides and tools for improved collation and annotation. Using the speed afforded by MSFragger, I perform open searching on several large-scale proteomics experiments, identifying modified peptides on an unprecedented scale and demonstrating its utility in diverse proteomics applications. The ability to rapidly and comprehensively identify modified peptides allows for the reduction of false positives in proteogenomics. It also has implications in discovery proteomics by allowing for the detection of both common and rare (including novel) biological modifications that are often not considered in large scale proteomics experiments. The ability to account for all chemically modified peptides may also improve protein abundance estimates in quantitative proteomics.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138581/1/andykong_1.pd

    Bioinformatics of Phosphoproteomics

    Get PDF
    corecore