736 research outputs found
De novo sequencing of MS/MS spectra
Proteomics is the study of proteins, their time- and location-dependent expression profiles, as well as their modifications and interactions. Mass spectrometry is useful to investigate many of the questions asked in proteomics. Database search methods are typically employed to identify proteins from complex mixtures. However, databases are not often available or, despite their availability, some sequences are not readily found therein. To overcome this problem, de novo sequencing can be used to directly assign a peptide sequence to a tandem mass spectrometry spectrum. Many algorithms have been proposed for de novo sequencing and a selection of them are detailed in this article. Although a standard accuracy measure has not been agreed upon in the field, relative algorithm performance is discussed. The current state of the de novo sequencing is assessed thereafter and, finally, examples are used to construct possible future perspectives of the field. © 2011 Expert Reviews Ltd.The Turkish Academy of Science (TÜBA
Algorithms for peptide and PTM identification using Tandem mass spectrometry
Ph.DDOCTOR OF PHILOSOPH
Systematic Evaluation of Protein Sequence Filtering Algorithms for Proteoform Identification Using Top-Down Mass Spectrometry
Complex proteoforms contain various primary structural alterations resulting from variations in genes, RNA, and proteins. Top-down mass spectrometry is commonly used for analyzing complex proteoforms because it provides whole sequence information of the proteoforms. Proteoform identification by top-down mass spectral database search is a challenging computational problem because the types and/or locations of some alterations in target proteoforms are in general unknown. Although spectral alignment and mass graph alignment algorithms have been proposed for identifying proteoforms with unknown alterations, they are extremely slow to align millions of spectra against tens of thousands of protein sequences in high throughput proteome level analyses. Many software tools in this area combine efficient protein sequence filtering algorithms and spectral alignment algorithms to speed up database search. As a result, the performance of these tools heavily relies on the sensitivity and efficiency of their filtering algorithms. Here, we propose two efficient approximate spectrum-based filtering algorithms for proteoform identification. We evaluated the performances of the proposed algorithms and four existing ones on simulated and real top-down mass spectrometry data sets. Experiments showed that the proposed algorithms outperformed the existing ones for complex proteoform identification. In addition, combining the proposed filtering algorithms and mass graph alignment algorithms identified many proteoforms missed by ProSightPC in proteome-level proteoform analyses
Top-down analysis of protein samples by de novo sequencing techniques
Motivation: Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data.
Results: We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. The former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns.
Availability and Implementation: Freely available on the web at http://bioinf.spbau.ru/en/twister
De novo sequencing of heparan sulfate saccharides using high-resolution tandem mass spectrometry
Heparan sulfate (HS) is a class of linear, sulfated polysaccharides located on cell surface, secretory granules, and in extracellular matrices found in all animal organ systems. It consists of alternately repeating disaccharide units, expressed in animal species ranging from hydra to higher vertebrates including humans. HS binds and mediates the biological activities of over 300 proteins, including growth factors, enzymes, chemokines, cytokines, adhesion and structural proteins, lipoproteins and amyloid proteins. The binding events largely depend on the fine structure - the arrangement of sulfate groups and other variations - on HS chains.
With the activated electron dissociation (ExD) high-resolution tandem mass spectrometry technique, researchers acquire rich structural information about the HS molecule. Using this technique, covalent bonds of the HS oligosaccharide ions are dissociated in the mass spectrometer. However, this information is complex, owing to the large number of product ions, and contains a degree of ambiguity due to the overlapping of product ion masses and lability of sulfate groups; as a result, there is a serious barrier to manual interpretation of the spectra. The interpretation of such data creates a serious bottleneck to the understanding of the biological roles of HS. In order to solve this problem, I designed HS-SEQ - the first HS sequencing algorithm using high-resolution tandem mass spectrometry. HS-SEQ allows rapid and confident sequencing of HS chains from millions of candidate structures and I validated its performance using multiple known pure standards. In many cases, HS oligosaccharides exist as mixtures of sulfation positional isomers. I therefore designed MULTI-HS-SEQ, an extended version of HS-SEQ targeting spectra coming from more than one HS sequence. I also developed several pre-processing and post-processing modules to support the automatic identification of HS structure. These methods and tools demonstrated the capacity for large-scale HS sequencing, which should contribute to clarifying the rich information encoded by HS chains as well as developing tailored HS drugs to target a wide spectrum of diseases
Complex Proteoform Identification Using Top-Down Mass Spectrometry
Indiana University-Purdue University Indianapolis (IUPUI)Proteoforms are distinct protein molecule forms created by variations in genes, gene
expression, and other biological processes. Many proteoforms contain multiple primary
structural alterations, including amino acid substitutions, terminal truncations, and posttranslational
modifications. These primary structural alterations play a crucial role in
determining protein functions: proteoforms from the same protein with different alterations
may exhibit different functional behaviors. Because top-down mass spectrometry directly
analyzes intact proteoforms and provides complete sequence information of proteoforms, it
has become the method of choice for the identification of complex proteoforms. Although
instruments and experimental protocols for top-down mass spectrometry have been advancing
rapidly in the past several years, many computational problems in this area remain
unsolved, and the development of software tools for analyzing such data is still at its very
early stage. In this dissertation, we propose several novel algorithms for challenging computational
problems in proteoform identification by top-down mass spectrometry. First, we
present two approximate spectrum-based protein sequence filtering algorithms that quickly
find a small number of candidate proteins from a large proteome database for a query mass
spectrum. Second, we describe mass graph-based alignment algorithms that efficiently identify
proteoforms with variable post-translational modifications and/or terminal truncations.
Third, we propose a Markov chain Monte Carlo method for estimating the statistical signi
ficance of identified proteoform spectrum matches. They are the first efficient algorithms
that take into account three types of alterations: variable post-translational modifications,
unexpected alterations, and terminal truncations in proteoform identification. As a result,
they are more sensitive and powerful than other existing methods that consider only one
or two of the three types of alterations. All the proposed algorithms have been incorporated
into TopMG, a complete software pipeline for complex proteoform identification.
Experimental results showed that TopMG significantly increases the number of identifications
than other existing methods in proteome-level top-down mass spectrometry studies. TopMG will facilitate the applications of top-down mass spectrometry in many areas, such
as the identification and quantification of clinically relevant proteoforms and the discovery
of new proteoform biomarkers.2019-06-2
Complex Proteoform Identification Using Top-Down Mass Spectrometry
Indiana University-Purdue University Indianapolis (IUPUI)Proteoforms are distinct protein molecule forms created by variations in genes, gene
expression, and other biological processes. Many proteoforms contain multiple primary
structural alterations, including amino acid substitutions, terminal truncations, and posttranslational
modifications. These primary structural alterations play a crucial role in
determining protein functions: proteoforms from the same protein with different alterations
may exhibit different functional behaviors. Because top-down mass spectrometry directly
analyzes intact proteoforms and provides complete sequence information of proteoforms, it
has become the method of choice for the identification of complex proteoforms. Although
instruments and experimental protocols for top-down mass spectrometry have been advancing
rapidly in the past several years, many computational problems in this area remain
unsolved, and the development of software tools for analyzing such data is still at its very
early stage. In this dissertation, we propose several novel algorithms for challenging computational
problems in proteoform identification by top-down mass spectrometry. First, we
present two approximate spectrum-based protein sequence filtering algorithms that quickly
find a small number of candidate proteins from a large proteome database for a query mass
spectrum. Second, we describe mass graph-based alignment algorithms that efficiently identify
proteoforms with variable post-translational modifications and/or terminal truncations.
Third, we propose a Markov chain Monte Carlo method for estimating the statistical signi
ficance of identified proteoform spectrum matches. They are the first efficient algorithms
that take into account three types of alterations: variable post-translational modifications,
unexpected alterations, and terminal truncations in proteoform identification. As a result,
they are more sensitive and powerful than other existing methods that consider only one
or two of the three types of alterations. All the proposed algorithms have been incorporated
into TopMG, a complete software pipeline for complex proteoform identification.
Experimental results showed that TopMG significantly increases the number of identifications
than other existing methods in proteome-level top-down mass spectrometry studies. TopMG will facilitate the applications of top-down mass spectrometry in many areas, such
as the identification and quantification of clinically relevant proteoforms and the discovery
of new proteoform biomarkers.2019-06-2
Development and application of software and algorithms for network approaches to proteomics data analysis
The cells making up all living organisms integrate external and internal signals to carry out the functions of life. Dysregulation of signaling can lead to a variety of grave diseases, including cancer [Slamon et al., 1987]. In order to understand signal transduction, one has to identify and characterize the main constituents of cellular signaling cascades. Proteins are involved in most cellular processes and form the major class of biomolecules responsible for signal transduction. Post-translational modifications (PTMs) of proteins can modulate their enzymatic activity and their protein-protein interactions (PPIs) which in turn can ultimately lead to changes in protein expression. Classical biochemistry has approached the study of proteins, PTMs and interaction from a reductionist view. The abundance, stability and localization of proteins was studied one protein at a time, following the one gene-one protein-one function paradigm [Beadle and Tatum, 1941]. Pathways were considered to be linear, where signals
would be transmitted from a gene to proteins, eventually resulting in a specific
phenotype. Establishing the crucial link between genotype and phenotype remains challenging despite great advances in omics technologies, such as liquid chromatography (LC)-mass spectrometry (MS) that allow for the system-wide interrogation of proteins.
Systems and network biology [Barabási and Oltvai, 2004, Bensimon et al., 2012,
Jørgensen and Locard-Paulet, 2012, Choudhary and Mann, 2010] aims to transform modern biology by utilizing omics technologies to understand and uncover the various complex networks that govern the cell. The first detected large-scale biological networks have been found to be highly structured and non-random [Albert and Barabási, 2002]. Furthermore, these are assembled from functional and topological modules. The smallest topological modules are formed by the direct physical interactions within protein-protein and protein-RNA complexes. These molecular machines are able to perform a diverse array of cellular functions, such as transcription and degradation [Alberts, 1998]. Members of functional modules are not required to have a direct physical interaction. Instead, such modules also include proteins with temporal co-regulation throughout the cell cycle [Olsen et al., 2010], or following the circadian day-night rhythm [Robles et al., 2014]. The signaling pathways that make up the cellular network [Jordan et al., 2000] are assembled from a hierarchy of these smaller modules [Barabási and Oltvai, 2004]. The regulation of these modules through dynamic rewiring enables the cell to respond to internal an external stimuli. The main challenge in network biology is to develop techniques to probe the topology of various biological networks, to identify topological and functional modules, and to understand their assembly and dynamic rewiring. LC-MS has become a powerful experimental platform that addresses all these challenges directly [Bensimon et al., 2012], and has long been used to study a wide range of biomolecules that participate in the
cellular network. The field of proteomics in particular, which is concerned with the identification and characterization of the proteins in the cell, has been revolutionized by recent technological advances in MS. Proteomics experiments are used not only to quantify peptides and proteins, but also to uncover the edges of the cellular network, by screening for physical PPIs in a global [Hein et al., 2015] or condition specific manner [Kloet et al., 2016]. Crucial for the interpretation of the large-scale data generated by MS experiments is the development of software tools that aid researchers in translating raw measurements into biological insights. The MaxQuant and Perseus platforms were designed for this exact purpose. The aim of this thesis was to develop software tools for the analysis of MS-based proteomics data with a focus on network biology and apply the developed tools to study cellular signaling. The first step was the extension of the Perseus software with network data structures and activities. The new network module allows for the sideby-side analysis of matrices and networks inside an interactive workflow and is described in article 1. We subsequently apply the newly developed software to study the circadian phosphoproteome of cortical synapses (see article 2). In parallel we aimed to improve the analysis of large datasets by adapting the previously Windows-only MaxQuant software to the Linux operating system, which is more prevalent in high performance computing environments (see article 3)
- …