3,370 research outputs found

    DART-ID increases single-cell proteome coverage.

    Get PDF
    Analysis by liquid chromatography and tandem mass spectrometry (LC-MS/MS) can identify and quantify thousands of proteins in microgram-level samples, such as those comprised of thousands of cells. This process, however, remains challenging for smaller samples, such as the proteomes of single mammalian cells, because reduced protein levels reduce the number of confidently sequenced peptides. To alleviate this reduction, we developed Data-driven Alignment of Retention Times for IDentification (DART-ID). DART-ID implements principled Bayesian frameworks for global retention time (RT) alignment and for incorporating RT estimates towards improved confidence estimates of peptide-spectrum-matches. When applied to bulk or to single-cell samples, DART-ID increased the number of data points by 30-50% at 1% FDR, and thus decreased missing data. Benchmarks indicate excellent quantification of peptides upgraded by DART-ID and support their utility for quantitative analysis, such as identifying cell types and cell-type specific proteins. The additional datapoints provided by DART-ID boost the statistical power and double the number of proteins identified as differentially abundant in monocytes and T-cells. DART-ID can be applied to diverse experimental designs and is freely available at http://dart-id.slavovlab.net

    Integrative Analysis Frameworks for Improved Peptide and Protein Identifications from Tandem Mass Spectrometry Data.

    Full text link
    Tandem mass spectrometry (MS/MS) followed by database search is the method of choice for high throughput protein identification in modern proteomic studies. Database searching methods employ spectral matching algorithms and statistical models to identify and quantify proteins in a sample. The major focus of these statistical methods is to assign probability scores to the identifications to distinguish between high confidence, reliable identifications that may be accepted (typically corresponding to a false discovery rate, FDR, of 1% or 5%) and lower confidence, spurious identifications that are rejected. These identification probabilities are determined, in general, considering only evidence from the MS/MS data. However, considering the wealth of external (orthogonal) data available for most biological systems, integrating such orthogonal information into proteomics analysis pipelines can be a promising approach to improve the sensitivity of these analysis pipelines and rescue true positive identifications that were rejected for want of sufficient evidence supporting their presence. In this dissertation, approaches based on naive bayes rescoring, search space restriction, and a hybrid approach that combines both are described for integrating orthogonal information in proteomic analysis pipelines. These methods have been applied for integrating transcript abundance data from RNA-seq and identification frequency data from the Global Proteome Machine database, GPMDB (one of the largest repositories of proteomic experiment results), into analysis pipelines, improving the number of peptide and protein identifications from MS/MS data. Further, estimation of false discovery rates in very large proteomic datasets was also investigated. In very large datasets, usually resulting from integrating data from multiple experiments, some assumptions used in typical target-decoy based FDR estimation in smaller datasets no longer hold true, resulting in artificially inflated error rates. Alternative approaches that would allow accurate FDR estimation in these large scale datasets have been described and benchmarked.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116717/1/avinashs_1.pd

    Improved results in proteomics by use of local and peptide-class specific false discovery rates

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Proteomic protein identification results need to be compared across laboratories and platforms, and thus a reliable method is needed to estimate false discovery rates. The target-decoy strategy is a platform-independent and thus a prime candidate for standardized reporting of data. In its current usage based on global population parameters, the method does not utilize individual peptide scores optimally.</p> <p>Results</p> <p>Here we show that proteomic analyses largely benefit from using separate treatment of peptides matching to proteins alone or in groups based on locally estimated false discovery rates. Our implementation reduces the number of false positives and simultaneously increases the number of proteins identified. Importantly, single peptide identifications achieve defined confidence and the sequence coverage of proteins is optimized. As a result, we improve the number of proteins identified in a human serum analysis by 58% without compromising identification confidence.</p> <p>Conclusion</p> <p>We show that proteins can reliably be identified with a single peptide and the sequence coverage for multi-peptide proteins can be increased when using an improved estimation of false discovery rates.</p

    Reliable identification of protein-protein interactions by crosslinking mass spectrometry

    Get PDF
    Protein-protein interactions govern most cellular pathways and processes, and multiple technologies have emerged to systematically map them. Assessing the error of interaction networks has been a challenge. Crosslinking mass spectrometry is currently widening its scope from structural analyses of purified multi-protein complexes towards systems-wide analyses of protein-protein interactions (PPIs). Using a carefully controlled large-scale analysis of Escherichia coli cell lysate, we demonstrate that false-discovery rates (FDR) for PPIs identified by crosslinking mass spectrometry can be reliably estimated. We present an interaction network comprising 590 PPIs at 1% decoy-based PPI-FDR. The structural information included in this network localises the binding site of the hitherto uncharacterised protein YacL to near the DNA exit tunnel on the RNA polymerase.TU Berlin, Open-Access-Mittel – 2021DFG, 390540038, EXC 2008: Unifying Systems in Catalysis "UniSysCat"DFG, 392923329, GRK 2473: Bioaktive Peptide - Innovative Aspekte zur Synthese und BiosyntheseDFG, 426290502, Erfassung der strukturellen Organisation des Mycoplasma pneumoniae Proteoms mittels in-Zell Crosslinking-Massenspektrometri

    Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies

    Get PDF
    [Image: see text] Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five “incorrect” targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives
    • 

    corecore