    Computational methods and tools for protein phosphorylation analysis

    Signaling pathways represent a central regulatory mechanism of biological systems where a key event in their correct functioning is the reversible phosphorylation of proteins. Protein phosphorylation affects at least one-third of all proteins and is the most widely studied posttranslational modification. Phosphorylation analysis is still perceived, in general, as difficult or cumbersome and not readily attempted by many, despite the high value of such information. Specifically, determining the exact location of a phosphorylation site is currently considered a major hurdle, thus reliable approaches are necessary for the detection and localization of protein phosphorylation. The goal of this PhD thesis was to develop computation methods and tools for mass spectrometry-based protein phosphorylation analysis, particularly validation of phosphorylation sites. In the first two studies, we developed methods for improved identification of phosphorylation sites in MALDI-MS. In the first study it was achieved through the automatic combination of spectra from multiple matrices, while in the second study, an optimized protocol for sample loading and washing conditions was suggested. In the third study, we proposed and evaluated the hypothesis that in ESI-MS, tandem CID and HCD spectra of phosphopeptides can be accurately predicted and used in spectral library searching. This novel strategy for phosphosite validation and identification offered accuracy that outperformed the other currently existing popular methods and proved applicable to complex biological samples. And finally, we significantly improved the performance of our command-line prototype tool, added graphical user interface, and options for customizable simulation parameters and filtering of selected spectra, peptides or proteins. The new software, SimPhospho, is open-source and can be easily integrated in a phosphoproteomics data analysis workflow. Together, these bioinformatics methods and tools enable confident phosphosite assignment and improve reliable phosphoproteome identification and reportin


    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph

    Molecular Formula Identification using High Resolution Mass Spectrometry: Algorithms and Applications in Metabolomics and Proteomics

    Wir untersuchen mehrere theoretische und praktische Aspekte der Identifikation der Summenformel von Biomolekülen mit Hilfe von hochauflösender Massenspektrometrie. Durch die letzten Forschritte in der Instrumentation ist die Massenspektrometrie (MS) zur einen der Schlüsseltechnologien für die Analyse von Biomolekülen in der Proteomik und Metabolomik geworden. Sie misst die Massen der Moleküle in der Probe mit hoher Genauigkeit, und ist für die Messdatenerfassung im Hochdurchsatz gut geeignet. Eine der Kernaufgaben in der MS-basierten Proteomik und Metabolomik ist die Identifikation der Moleküle in der Probe. In der Metabolomik unterliegen Metaboliten der Strukturaufklärung, beginnend bei der Summenformel eines Moleküls, d.h. der Anzahl der Atome jedes Elements. Dies ist der entscheidende Schritt in der Identifikation eines unbekannten Metabolits, da die festgelegte Formel die Anzahl der möglichen Molekülstrukturen auf eine viel kleinere Menge reduziert, die mit Methoden der automatischen Strukturaufklärung weiter analysiert werden kann. Nach der Vorverarbeitung ist die Ausgabe eines Massenspektrometers eine Liste von Peaks, die den Molekülmassen und deren Intensitäten, d.h. der Anzahl der Moleküle mit einer bestimmten Masse, entspricht. Im Prinzip können die Summenformel kleiner Moleküle nur mit präzisen Massen identifiziert werden. Allerdings wurde festgestellt, dass aufgrund der hohen Anzahl der chemisch legitimer Formeln in oberen Massenbereich eine exzellente Massengenaugkeit alleine für die Identifikation nicht genügt. Hochauflösende MS erlaubt die Bestimmung der Molekülmassen und Intensitäten mit hervorragender Genauigkeit. In dieser Arbeit entwickeln wir mehrere Algorithmen und Anwendungen, die diese Information zur Identifikation der Summenformel der Biomolekülen anwenden

    Integration of Mass Spectrometry Data for Structural Biology

    Mass spectrometry (MS) is increasingly being used to probe the structure and dynamics of proteins and the complexes they form with other macromolecules. There are now several specialized MS methods, each with unique sample preparation, data acquisition, and data processing protocols. Collectively, these methods are referred to as structural MS and include cross-linking, hydrogen-deuterium exchange, hydroxyl radical footprinting, native, ion mobility, and top-down MS. Each of these provides a unique type of structural information, ranging from composition and stoichiometry through to residue level proximity and solvent accessibility. Structural MS has proved particularly beneficial in studying protein classes for which analysis by classic structural biology techniques proves challenging such as glycosylated or intrinsically disordered proteins. To capture the structural details for a particular system, especially larger multiprotein complexes, more than one structural MS method with other structural and biophysical techniques is often required. Key to integrating these diverse data are computational strategies and software solutions to facilitate this process. We provide a background to the structural MS methods and briefly summarize other structural methods and how these are combined with MS. We then describe current state of the art approaches for the integration of structural MS data for structural biology. We quantify how often these methods are used together and provide examples where such combinations have been fruitful. To illustrate the power of integrative approaches, we discuss progress in solving the structures of the proteasome and the nuclear pore complex. We also discuss how information from structural MS, particularly pertaining to protein dynamics, is not currently utilized in integrative workflows and how such information can provide a more accurate picture of the systems studied. We conclude by discussing new developments in the MS and computational fields that will further enable in-cell structural studies

    De novo sequencing of heparan sulfate saccharides using high-resolution tandem mass spectrometry

    Heparan sulfate (HS) is a class of linear, sulfated polysaccharides located on cell surface, secretory granules, and in extracellular matrices found in all animal organ systems. It consists of alternately repeating disaccharide units, expressed in animal species ranging from hydra to higher vertebrates including humans. HS binds and mediates the biological activities of over 300 proteins, including growth factors, enzymes, chemokines, cytokines, adhesion and structural proteins, lipoproteins and amyloid proteins. The binding events largely depend on the fine structure - the arrangement of sulfate groups and other variations - on HS chains. With the activated electron dissociation (ExD) high-resolution tandem mass spectrometry technique, researchers acquire rich structural information about the HS molecule. Using this technique, covalent bonds of the HS oligosaccharide ions are dissociated in the mass spectrometer. However, this information is complex, owing to the large number of product ions, and contains a degree of ambiguity due to the overlapping of product ion masses and lability of sulfate groups; as a result, there is a serious barrier to manual interpretation of the spectra. The interpretation of such data creates a serious bottleneck to the understanding of the biological roles of HS. In order to solve this problem, I designed HS-SEQ - the first HS sequencing algorithm using high-resolution tandem mass spectrometry. HS-SEQ allows rapid and confident sequencing of HS chains from millions of candidate structures and I validated its performance using multiple known pure standards. In many cases, HS oligosaccharides exist as mixtures of sulfation positional isomers. I therefore designed MULTI-HS-SEQ, an extended version of HS-SEQ targeting spectra coming from more than one HS sequence. I also developed several pre-processing and post-processing modules to support the automatic identification of HS structure. These methods and tools demonstrated the capacity for large-scale HS sequencing, which should contribute to clarifying the rich information encoded by HS chains as well as developing tailored HS drugs to target a wide spectrum of diseases

    Optimized GeLC-MS/MS for Bottom-Up Proteomics

    Despite tremendous advances in mass spectrometry instrumentation and mass spectrometry-based methodologies, global protein profiling of organellar, cellular, tissue and body fluid proteomes in different organisms remains a challenging task due to the complexity of the samples and the wide dynamic range of protein concentrations. In addition, large amounts of produced data make result exploitation difficult. To overcome these issues, further advances in sample preparation, mass spectrometry instrumentation as well as data processing and data analysis are required. The presented study focuses as first on the improvement of the proteolytic digestion of proteins in in-gel based proteomic approach (Gel-LCMS). To this end commonly used bovine trypsin (BT) was modified with oligosaccharides in order to overcome its main disadvantages, such as weak thermostability and fast autolysis at basic pH. Glycosylated trypsin derivates maintained their cleavage specifity and showed better thermostability, autolysis resistance and less autolytic background than unmodified BT. In line with the “accelerated digestion protocol” (ADP) previously established in our laboratory modified enzymes were tested in in-gel digestion of proteins. Kinetics of in-gel digestion was studied by MALDI TOF mass spectrometry using 18O-labeled peptides as internal standards as well as by label-free quantification approach, which utilizes intensities of peptide ions detected by nanoLC-MS/MS. In the performed kinetic study the effect of temperature, enzyme concentration and digestion time on the yield of digestion products was characterized. The obtained results showed that in-gel digestion of proteins by glycosylated trypsin conjugates was less efficient compared to the conventional digestion (CD) and achieved maximal 50 to 70% of CD yield, suggesting that the attached sugar molecules limit free diffusion of the modified trypsins into the polyacrylamide gel pores. Nevertheless, these thermostable and autolysis resistant enzymes can be regarded as promising candidates for gel-free shotgun approach. To address the reliability issue of proteomic data I further focused on protein identifications with borderline statistical confidence produced by database searching. These hits are typically produced by matching a few marginal quality MS/MS spectra to database peptide sequences and represent a significant bottleneck in proteomics. A method was developed for rapid validation of borderline hits, which takes advantage of the independent interpretation of the acquired tandem mass spectra by de novo sequencing software PepNovo followed by mass-spectrometry driven BLAST (MS BLAST) sequence similarity searching that utilize all partially accurate, degenerate and redundant proposed peptide sequences. It was demonstrated that a combination of MASCOT software, de novo sequencing software PepNovo and MS BLAST, bundled by a simple scripted interface, enabled rapid and efficient validation of a large number of borderline hits, produced by matching of one or two MS/MS spectra with marginal statistical significance

    NBPMF: Novel Network-Based Inference Methods for Peptide Mass Fingerprinting

    Proteins are large, complex molecules that perform a vast array of functions in every living cell. A proteome is a set of proteins produced in an organism, and proteomics is the large-scale study of proteomes. Several high-throughput technologies have been developed in proteomics, where the most commonly applied are mass spectrometry (MS) based approaches. MS is an analytical technique for determining the composition of a sample. Recently it has become a primary tool for protein identification, quantification, and post translational modification (PTM) characterization in proteomics research. There are usually two different ways to identify proteins: top-down and bottom-up. Top-down approaches are based on subjecting intact protein ions and large fragment ions to tandem MS directly, while bottom-up methods are based on mass spectrometric analysis of peptides derived from proteolytic digestion, usually with trypsin. In bottom-up techniques, peptide mass fingerprinting (PMF) is widely used to identify proteins from MS dataset. Conventional PMF representatives such as probabilistic MOWSE algorithm, is based on mass distribution of tryptic peptides. In this thesis, we developed a novel network-based inference software termed NBPMF. By analyzing peptide-protein bipartite network, we designed new peptide protein matching score functions. We present two methods: the static one, ProbS, is based on an independent probability framework; and the dynamic one, HeatS, depicts input dataset as dependent peptides. Moreover, we use linear regression to adjust the matching score according to the masses of proteins. In addition, we consider the order of retention time to further correct the score function. In the post processing, we design two algorithms: assignment of peaks, and protein filtration. The former restricts that a peak can only be assigned to one peptide in order to reduce random matches; and the latter assumes each peak can only be assigned to one protein. In the result validation, we propose two new target-decoy search strategies to estimate the false discovery rate (FDR). The experiments on simulated, authentic, and simulated authentic dataset demonstrate that our NBPMF approaches lead to significantly improved performance compared to several state-of-the-art methods

    Quantification and Simulation of Liquid Chromatography-Mass Spectrometry Data

    Computational mass spectrometry is a fast evolving field that has attracted increased attention over the last couple of years. The performance of software solutions determines the success of analysis to a great extent. New algorithms are required to reflect new experimental procedures and deal with new instrument generations. One essential component of algorithm development is the validation (as well as comparison) of software on a broad range of data sets. This requires a gold standard (or so-called ground truth), which is usually obtained by manual annotation of a real data set. Comprehensive manually annotated public data sets for mass spectrometry data are labor-intensive to produce and their quality strongly depends on the skill of the human expert. Some parts of the data may even be impossible to annotate due to high levels of noise or other ambiguities. Furthermore, manually annotated data is usually not available for all steps in a typical computational analysis pipeline. We thus developed the most comprehensive simulation software to date, which allows to generate multiple levels of ground truth and features a plethora of settings to reflect experimental conditions and instrument settings. The simulator is used to generate several distinct types of data. The data are subsequently employed to evaluate existing algorithms. Additionally, we employ simulation to determine the influence of instrument attributes and sample complexity on the ability of algorithms to recover information. The results give valuable hints on how to optimize experimental setups. Furthermore, this thesis introduces two quantitative approaches, namely a decharging algorithm based on integer linear programming and a new workflow for identification of differentially expressed proteins for a large in vitro study on toxic compounds. Decharging infers the uncharged mass of a peptide (or protein) by clustering all its charge variants. The latter occur frequently under certain experimental conditions. We employ simulation to show that decharging is robust against missing values even for high complexity data and that the algorithm outperforms other solutions in terms of mass accuracy and run time on real data. The last part of this thesis deals with a new state-of-the-art workflow for protein quantification based on isobaric tags for relative and absolute quantitation (iTRAQ). We devise a new approach to isotope correction, propose an experimental design, introduce new metrics of iTRAQ data quality, and confirm putative properties of iTRAQ data using a novel approach. All tools developed as part of this thesis are implemented in OpenMS, a C++ library for computational mass spectrometry