683 research outputs found

    Integrative Analysis Frameworks for Improved Peptide and Protein Identifications from Tandem Mass Spectrometry Data.

    Full text link
    Tandem mass spectrometry (MS/MS) followed by database search is the method of choice for high throughput protein identification in modern proteomic studies. Database searching methods employ spectral matching algorithms and statistical models to identify and quantify proteins in a sample. The major focus of these statistical methods is to assign probability scores to the identifications to distinguish between high confidence, reliable identifications that may be accepted (typically corresponding to a false discovery rate, FDR, of 1% or 5%) and lower confidence, spurious identifications that are rejected. These identification probabilities are determined, in general, considering only evidence from the MS/MS data. However, considering the wealth of external (orthogonal) data available for most biological systems, integrating such orthogonal information into proteomics analysis pipelines can be a promising approach to improve the sensitivity of these analysis pipelines and rescue true positive identifications that were rejected for want of sufficient evidence supporting their presence. In this dissertation, approaches based on naive bayes rescoring, search space restriction, and a hybrid approach that combines both are described for integrating orthogonal information in proteomic analysis pipelines. These methods have been applied for integrating transcript abundance data from RNA-seq and identification frequency data from the Global Proteome Machine database, GPMDB (one of the largest repositories of proteomic experiment results), into analysis pipelines, improving the number of peptide and protein identifications from MS/MS data. Further, estimation of false discovery rates in very large proteomic datasets was also investigated. In very large datasets, usually resulting from integrating data from multiple experiments, some assumptions used in typical target-decoy based FDR estimation in smaller datasets no longer hold true, resulting in artificially inflated error rates. Alternative approaches that would allow accurate FDR estimation in these large scale datasets have been described and benchmarked.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116717/1/avinashs_1.pd

    Anatomy and evolution of database search engines — a central component of mass spectrometry based proteomic workflows

    Get PDF
    Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines.acceptedVersio

    Developing a bioinformatics framework for proteogenomics

    Get PDF
    In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of genome sequencing and as a result there is now a large backlog of genomic data waiting to be interpreted both quickly and accurately. Through advances in proteomics a new field has emerged to help improve genome annotation, termed proteogenomics, which uses peptide mass spectrometry data, enabling the discovery of novel protein coding genes, as well as the refinement and validation of known and putative protein-coding genes. The annotation of genomes relies heavily on ab initio gene prediction programs and/or mapping of a range of RNA transcripts. Although this method provides insights into the gene content of genomes it is unable to distinguish protein-coding genes from putative non-coding RNA genes. This problem is further confounded by the fact that only 5% of the public protein sequence repository at UniProt/SwissProt has been curated and derived from actual protein evidence. This thesis contends that it is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence. Although there have been major improvements in proteogenomics over the last decade there are still numerous challenges to overcome. These key challenges include the loss of sensitivity when using inflated search spaces of putative sequences, how best to interpret novel identifications and how best to control for false discoveries. This thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology. This new approach was applied within four case studies: a prokaryote bacterium; a monocotyledonous wheat plant; a dicotyledonous grape plant; and human. The key contributions of this thesis are: a new methodology for proteogenomics analysis; 145 suggested gene refinements in Bradyrhizobium diazoefficiens (nitrogen-fixing bacteria); 55 new gene predictions (57 protein isoforms) in Vitis vinifera (grape); 49 new gene predictions (52 protein isoforms) in Homo sapiens (human); and 67 new gene predictions (70 protein isoforms) in Triticum aestivum (bread wheat). Lastly, a number of possible improvements for the studies conducted in this thesis and proteogenomics as a whole have been identified and discussed

    Development and Integration of Informatic Tools for Qualitative and Quantitative Characterization of Proteomic Datasets Generated by Tandem Mass Spectrometry

    Get PDF
    Shotgun proteomic experiments provide qualitative and quantitative analytical information from biological samples ranging in complexity from simple bacterial isolates to higher eukaryotes such as plants and humans and even to communities of microbial organisms. Improvements to instrument performance, sample preparation, and informatic tools are increasing the scope and volume of data that can be analyzed by mass spectrometry (MS). To accommodate for these advances, it is becoming increasingly essential to choose and/or create tools that can not only scale well but also those that make more informed decisions using additional features within the data. Incorporating novel and existing tools into a scalable, modular workflow not only provides more accurate, contextualized perspectives of processed data, but it also generates detailed, standardized outputs that can be used for future studies dedicated to mining general analytical or biological features, anomalies, and trends. This research developed cyber-infrastructure that would allow a user to seamlessly run multiple analyses, store the results, and share processed data with other users. The work represented in this dissertation demonstrates successful implementation of an enhanced bioinformatics workflow designed to analyze raw data directly generated from MS instruments and to create fully-annotated reports of qualitative and quantitative protein information for large-scale proteomics experiments. Answering these questions requires several points of engagement between informatics and analytical understanding of the underlying biochemistry of the system under observation. Deriving meaningful information from analytical data can be achieved through linking together the concerted efforts of more focused, logistical questions. This study focuses on the following aspects of proteomics experiments: spectra to peptide matching, peptide to protein mapping, and protein quantification and differential expression. The interaction and usability of these analyses and other existing tools are also described. By constructing a workflow that allows high-throughput processing of massive datasets, data collected within the past decade can be standardized and updated with the most recent analyses

    Computational Framework for Data-Independent Acquisition Proteomics.

    Full text link
    Mass spectrometry (MS) is one of the main techniques for high throughput discovery- and targeted-based proteomics experiments. The most popular method for MS data acquisition has been data dependent acquisition (DDA) strategy which primarily selects high abundance peptides for MS/MS sequencing. DDA incorporates stochastic data acquisitions to avoid repetitive sequencing of same peptide, resulting in relatively irreproducible results for low abundance peptides between experiments. Data independent acquisition (DIA), in which peptide fragment signals are systematically acquired, is emerging as a promising alternative to address the DDA's stochasticity. DIA results in more complex signals, posing computational challenges for complex sample and high-throughput analysis. As a result, targeted extraction which requires pre-existing spectral libraries has been the most commonly used approach for automated DIA data analysis. However, building spectral libraries requires additional amount of analysis time and sample materials which are the major barriers for most research groups. In my dissertation, I develop a computational tool called DIA-Umpire, which includes computational and signal processing algorithms to enable untargeted DIA identification and quantification analysis without any prior spectral library. In the first study, a signal feature detection algorithm is developed to extract and assemble peptide precursor and fragment signals into pseudo MS/MS spectra which can be analyzed by the existing DDA untargeted analysis tools. This novel step enables direct and untargeted (spectral library-free) DIA identification analysis and we show the performance using complex samples including human cell lysate and glycoproteomics datasets. In the second study, a hybrid approach is developed to further improve the DIA quantification sensitivity and reproducibility. The performance of DIA-Umpire quantification approach is demonstrated using an affinity-purification mass spectrometry experiment for protein-protein interaction analysis. Lastly, in the third study, I improve the DIA-Umpire pipeline for data obtained from the Orbitrap family of mass spectrometers. Using public datasets, I show that the improved version of DIA-Umpire is capable of highly sensitive, untargeted analysis of DIA data for the data generated using Orbitrap family of mass spectrometers. The dissertation work addresses the barriers of DIA analysis and should facilitate the adoption of DIA strategy for a broad range of discovery proteomics applications.PhDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120699/1/tsouc_1.pd

    Homology-Based Functional Proteomics By Mass Spectrometry and Advanced Informatic Methods

    Get PDF
    Functional characterization of biochemically-isolated proteins is a central task in the biochemical and genetic description of the biology of cells and tissues. Protein identification by mass spectrometry consists of associating an isolated protein with a specific gene or protein sequence in silico, thus inferring its specific biochemical function based upon previous characterizations of that protein or a similar protein having that sequence identity. By performing this analysis on a large scale in conjunction with biochemical experiments, novel biological knowledge can be developed. The study presented here focuses on mass spectrometry-based proteomics of organisms with unsequenced genomes and corresponding developments in biological sequence database searching with mass spectrometry data. Conventional methods to identify proteins by mass spectrometry analysis have employed proteolytic digestion, fragmentation of resultant peptides, and the correlation of acquired tandem mass spectra with database sequences, relying upon exact matching algorithms; i.e. the analyzed peptide had to previously exist in a database in silico to be identified. One existing sequence-similarity protein identification method was applied (MS BLAST, Shevchenko 2001) and one alternative novel method was developed (MultiTag), for searching protein and EST databases, to enable the recognition of proteins that are generally unrecognizable by conventional softwares but share significant sequence similarity with database entries (~60-90%). These techniques and available database sequences enabled the characterization of the Xenopus laevis microtubule-associated proteome and the Dunaliella salina soluble salt-induced proteome, both organisms with unsequenced genomes and minimal database sequence resources. These sequence-similarity methods extended protein identification capabilities by more than two-fold compared to conventional methods, making existing methods virtually superfluous. The proteomics of Dunaliella salina demonstrated the utility of MS BLAST as an indispensable method for characterization of proteins in organisms with unsequenced genomes, and produced insight into Dunaliella?s inherent resilience to high salinity. The Xenopus study was the first proteomics project to simultaneously use all three central methods of representation for peptide tandem mass spectra for protein identification: sequence tags, amino acids sequences, and mass lists; and it is the largest proteomics study in Xenopus laevis yet completed, which indicated a potential relationship between the mitotic spindle of dividing cells and the protein synthesis machinery. At the beginning of these experiments, the identification of proteins was conceptualized as using ?conventional? versus ?sequence-similarity? techniques, but through the course of experiments, a conceptual shift in understanding occurred along with the techniques developed and employed to encompass variations in mass spectrometry instrumentation, alternative mass spectrum representation forms, and the complexities of database resources, producing a more systematic description and utilization of available resources for the characterization of proteomes by mass spectrometry and advanced informatic approaches. The experiments demonstrated that proteomics technologies are only as powerful in the field of biology as the biochemical experiments are precise and meaningful

    New Statistical Algorithms for the Analysis of Mass Spectrometry Time-Of-Flight Mass Data with Applications in Clinical Diagnostics

    Get PDF
    Mass spectrometry (MS) based techniques have emerged as a standard forlarge-scale protein analysis. The ongoing progress in terms of more sensitive machines and improved data analysis algorithms led to a constant expansion of its fields of applications. Recently, MS was introduced into clinical proteomics with the prospect of early disease detection using proteomic pattern matching. Analyzing biological samples (e.g. blood) by mass spectrometry generates mass spectra that represent the components (molecules) contained in a sample as masses and their respective relative concentrations. In this work, we are interested in those components that are constant within a group of individuals but differ much between individuals of two distinct groups. These distinguishing components that dependent on a particular medical condition are generally called biomarkers. Since not all biomarkers found by the algorithms are of equal (discriminating) quality we are only interested in a small biomarker subset that - as a combination - can be used as a fingerprint for a disease. Once a fingerprint for a particular disease (or medical condition) is identified, it can be used in clinical diagnostics to classify unknown spectra. In this thesis we have developed new algorithms for automatic extraction of disease specific fingerprints from mass spectrometry data. Special emphasis has been put on designing highly sensitive methods with respect to signal detection. Thanks to our statistically based approach our methods are able to detect signals even below the noise level inherent in data acquired by common MS machines, such as hormones. To provide access to these new classes of algorithms to collaborating groups we have created a web-based analysis platform that provides all necessary interfaces for data transfer, data analysis and result inspection. To prove the platform's practical relevance it has been utilized in several clinical studies two of which are presented in this thesis. In these studies it could be shown that our platform is superior to commercial systems with respect to fingerprint identification. As an outcome of these studies several fingerprints for different cancer types (bladder, kidney, testicle, pancreas, colon and thyroid) have been detected and validated. The clinical partners in fact emphasize that these results would be impossible with a less sensitive analysis tool (such as the currently available systems). In addition to the issue of reliably finding and handling signals in noise we faced the problem to handle very large amounts of data, since an average dataset of an individual is about 2.5 Gigabytes in size and we have data of hundreds to thousands of persons. To cope with these large datasets, we developed a new framework for a heterogeneous (quasi) ad-hoc Grid - an infrastructure that allows to integrate thousands of computing resources (e.g. Desktop Computers, Computing Clusters or specialized hardware, such as IBM's Cell Processor in a Playstation 3)

    Application of Pool-seq for variation detection and proteogenomic database creation in β-hemolytic streptococci.

    Get PDF
    Proteogenomics is an emerging field that combines genomic (transcriptomic) and proteomic data with the aim of improving gene models and identification of proteins. Technological advances in each domain increase the potential of the field in fostering further understanding of organisms. For instance, the current low cost and fast sequencing technologies have made it possible to sequence multiple representative samples of organisms thus improving the comprehensiveness of the organisms’ reference proteomes. At the same time, improvements in mass spectrometry techniques have led to an increase in the quality and quantity of proteomics data produced, which are utilized to update the annotation of coding sequences in genomes. Sequencing of pooled individual DNAs (Pool-seq) is one method for sequencing large numbers of samples cost effectively. It is a robust method that can accurately identify variations that exist between samples. Similar to other proteogenomics methods such as the sample specific databases derived from RNA-seq data, the variants from Pool-seq experiments can be utilized to create variant protein databases and improve the completeness of protein reference databases used in mass spectrometry (MS)-based proteomics analysis. In this thesis work, the efficiency of Pool-seq in identifying variants and estimating allele frequencies from strains of three β-hemolytic bacteria (GAS, GGS and GBS) is investigated. Moreover, in this work a novel Python package (‘PoolSeqProGen’) for creating variant protein databases from the Pool-seq experiments was developed. To our knowledge, this was the first work to use Pool-seq for sequencing large numbers of β-hemolytic bacteria and assess its efficiency on such genetically polymorphic bacteria. The ‘PoolSeqProGen’ tool is also the first and only tool available to create proteogenomic databases from Pool-seq data. For organisms such as the β-hemolytic bacteria GAS, GBS and GGS that have open pangenomes, the sequencing and annotation of multiple representative strains is paramount in advancing our understanding of these human pathogens and in developing mass spectrometry databases. Due to the increasing use of MS in diagnostics of infectious diseases, this in turn translates to better diagnosis and treatment of the diseases caused by the pathogens and alleviating their devastating burdens on the human population. In this thesis, it is demonstrated that Pool-seq can be used to cost effectively and accurately identify variations that exist among strains of these polymorphic bacteria. In addition, the utility of the tool developed to extend single genome based databases and thereby improve the completeness of the databases and peptide/protein identifications by using variants identified from Pool-seq experiments is illustrated.Proteogenomiikka on kehittyvä tieteenala, joka yhdistää genomiikkaa ja proteomiikkaa geenimallien parantamiseksi ja proteiinien tunnistamiseksi. Molempien alojen tekninen kehitys lisää tämän yhdistetyn tieteenalan mahdollisuuksia eri eliöiden toimintojen ymmärtämiseksi. Esimerkiksi nykyiset edulliset ja nopeat sekvensointitekniikat ovat mahdollistaneet useiden eri organismien kattavan sekvensoinnin, mikä luonnollisesti parantaa myös näiden organismien vertailuproteomien kattavuutta. Samanaikaisesti massaspektrometriatekniikan kehitys on johtanut proteomiikka-analyysien laadun paranemiseen ja syvyyden lisääntymiseen. Tämä mahdollistaa ennustettujen sekvenssialueiden (esim. uusien geenien) validoinnin. Yhdistettyjen yksittäisten DNA-näytteiden sekvensointi (Pool-sekvensointi) mahdollistaa suurten näytemäärien sekvensoinnin erittäin kustannustehokkaasti. Se on luotettava menetelmä, jolla voidaan tunnistaa tarkasti eri näytteiden väliset vaihtelut. Pool-sekvensointikokeiden muunnelmia voidaan käyttää luomaan variantti-proteiinitietokantoja ja parantamaan massaspektrometriaan perustuvien proteiinitietokantojen kattavuutta. Tässä väitöskirjassa tutkittiin Pool-sekvensoinnin tehokkuutta eri varianttien tunnistamisessa ja alleelitaajuuksien arvioimisessa kolmen β-hemolyyttisen streptokokki-bakteerin (GAS, GGS ja GBS) kannoista. Lisäksi työssä kehitettiin uusi Python-ohjelmointikielellä kirjoitettu ohjelmisto (‘PoolSeqProGen’) proteiinivariantitietokantojen luomiseksi Pool-sekvensointi -kokeista. Tämä on ensimmäinen työ, jossa Pool-sekvensointia käytettiin sekvensoimaan suuri määrä streptokokkeja ja arvioimaan menetelmän tehokkuutta geneettisesti polymorfisissa bakteereissa. ”PoolSeqProGen” -työkalu on myös ensimmäinen ja ainoa saatavilla oleva työkalu proteogenomisten tietokantojen luomiseen Pool-sekvensoinnilla tuotetusta datasta. Kehitettäessä massaspektrometria tietokantoja avoimiin pangenomeihin perustuville organismeille, kutenβ-hemolyyttisille streptokokeille GAS, GBS ja GGS, useiden edustavien kantojen sekvensointi ja annotointi on ensiarvoisen tärkeää. Massaspekrometrian lisääntynyt käyttö tartuntatautien diagnosoinnissa parantaa näiden mikrobien aiheuttamien sairauksien diagnosointia ja mahdollistaa siten myös hoidon paremman kohdentamisen. Tässä väitöskirjatyössä osoitetaan, että Pool-sekvensointia voi käyttää kustannustehokkaasti ja tarkasti polymorfisten bakteerikantojen välillä esiintyvien variaatioiden tunnistamiseen. Lisäksi havainnollistamme yhteen genomiin pohjautuvien tietokantojen laajentamiseksi kehitetyn työkalun hyödyllisyyttä, jolla voidaan parantaa tietokantojen kattavuutta ja peptidi- ja proteiinitunnistusta käyttämällä Pool-sekvensointikokeissa tunnistettuja variantteja
    corecore