106 research outputs found

    Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency

    No full text
    Assembling peptides identified from LC−MS/MS spectra into a list of proteins is a critical step in analyzing shotgun proteomics data. As one peptide sequence can be mapped to multiple proteins in a database, naïve protein assembly can substantially overstate the number of proteins found in samples. We model the peptide−protein relationships in a bipartite graph and use efficient graph algorithms to identify protein clusters with shared peptides and to derive the minimal list of proteins. We test the effects of this parsimony analysis approach using MS/MS data sets generated from a defined human protein mixture, a yeast whole cell extract, and a human serum proteome after MARS column depletion. The results demonstrate that the bipartite parsimony technique not only simplifies protein lists but also improves the accuracy of protein identification. We use bipartite graphs for the visualization of the protein assembly results to render the parsimony analysis process transparent to users. Our approach also groups functionally related proteins together and improves the comprehensibility of the results. We have implemented the tool in the IDPicker package. The source code and binaries for this protein assembly pipeline are available under Mozilla Public License at the following URL:  http://www.mc.vanderbilt.edu/msrc/bioinformatics/. Keywords: parsimony analysis • bipartite graph • shotgun proteomics • LC−MS/MS • protein assembl

    DBDigger:  Reorganized Proteomic Database Identification That Improves Flexibility and Speed

    No full text
    Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead, DBDigger determines which spectra can be compared to each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization also reduces the number of times a spectrum must be predicted for a particular candidate sequence and charge state. As a result, DBDigger can accelerate some database searches by more than an order of magnitude. In addition, the software offers features to reduce the performance degradation introduced by posttranslational modification (PTM) searching. DBDigger allows researchers to specify the sequence context in which each PTM is possible. In the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini of peptides. Use of “context-dependent” PTM searching reduces the performance penalty relative to traditional PTM searching. We characterize the performance possible with DBDigger, showcasing MASPIC, a new statistical scorer. We describe the implementation of these innovations in the hope that other researchers will employ them for rapid and highly flexible proteomic database search

    DBDigger:  Reorganized Proteomic Database Identification That Improves Flexibility and Speed

    No full text
    Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead, DBDigger determines which spectra can be compared to each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization also reduces the number of times a spectrum must be predicted for a particular candidate sequence and charge state. As a result, DBDigger can accelerate some database searches by more than an order of magnitude. In addition, the software offers features to reduce the performance degradation introduced by posttranslational modification (PTM) searching. DBDigger allows researchers to specify the sequence context in which each PTM is possible. In the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini of peptides. Use of “context-dependent” PTM searching reduces the performance penalty relative to traditional PTM searching. We characterize the performance possible with DBDigger, showcasing MASPIC, a new statistical scorer. We describe the implementation of these innovations in the hope that other researchers will employ them for rapid and highly flexible proteomic database search

    DBDigger:  Reorganized Proteomic Database Identification That Improves Flexibility and Speed

    No full text
    Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead, DBDigger determines which spectra can be compared to each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization also reduces the number of times a spectrum must be predicted for a particular candidate sequence and charge state. As a result, DBDigger can accelerate some database searches by more than an order of magnitude. In addition, the software offers features to reduce the performance degradation introduced by posttranslational modification (PTM) searching. DBDigger allows researchers to specify the sequence context in which each PTM is possible. In the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini of peptides. Use of “context-dependent” PTM searching reduces the performance penalty relative to traditional PTM searching. We characterize the performance possible with DBDigger, showcasing MASPIC, a new statistical scorer. We describe the implementation of these innovations in the hope that other researchers will employ them for rapid and highly flexible proteomic database search

    DBDigger:  Reorganized Proteomic Database Identification That Improves Flexibility and Speed

    No full text
    Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead, DBDigger determines which spectra can be compared to each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization also reduces the number of times a spectrum must be predicted for a particular candidate sequence and charge state. As a result, DBDigger can accelerate some database searches by more than an order of magnitude. In addition, the software offers features to reduce the performance degradation introduced by posttranslational modification (PTM) searching. DBDigger allows researchers to specify the sequence context in which each PTM is possible. In the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini of peptides. Use of “context-dependent” PTM searching reduces the performance penalty relative to traditional PTM searching. We characterize the performance possible with DBDigger, showcasing MASPIC, a new statistical scorer. We describe the implementation of these innovations in the hope that other researchers will employ them for rapid and highly flexible proteomic database search

    The <i>bis</i>-Electrophile Diepoxybutane Cross-Links DNA to Human Histones but Does Not Result in Enhanced Mutagenesis in Recombinant Systems

    No full text
    1,2-Dibromoethane and 1,3-butadiene are cancer suspects present in the environment and have been used widely in industry. The mutagenic properties of 1,2-dibromoethane and the 1,3-butadiene oxidation product diepoxybutane are thought to be related to the bis-electrophilic character of these chemicals. The discovery that overexpression of O6-alkylguanine alkyltransferase (AGT) enhances bis-electrophile-induced mutagenesis prompted a search for other proteins that may act by a similar mechanism. A human liver screen for nuclear proteins that cross-link with DNA in the presence of 1,2-dibromoethane identified histones H2b and H3 as candidate proteins. Treatment of isolated histones H2b and H3 with diepoxybutane resulted in DNA−protein cross-links and produced protein adducts, and DNA−histone H2b cross-links were identified (immunochemically) in Escherichia coli cells expressing histone H2b. However, heterologous expression of histone H2b in E. coli failed to enhance bis-electrophile-induced mutagenesis. These results are similar to those found with the cross-link candidate glyceraldehyde 3-phosphate dehydrogenase (GAPDH) [Loecken, E. M., and Guengerich, F. P. (2008) Chem. Res. Toxicol. 21, 453−458], but in contrast to GAPDH, histone H2b bound DNA with even higher affinity than AGT. The extent of DNA cross-linking of isolated histone H2b was similar to that of AGT, suggesting that differences in postcross-linking events explain the difference in mutagenesis

    DirecTag: Accurate Sequence Tags from Peptide MS/MS through Statistical Scoring

    No full text
    In shotgun proteomics, tandem mass spectra of peptides are typically identified through database search algorithms such as Sequest. We have developed DirecTag, an open-source algorithm to infer partial sequence tags directly from observed fragment ions. This algorithm is unique in its implementation of three separate scoring systems to evaluate each tag on the basis of peak intensity, m/z fidelity, and complementarity. In data sets from several types of mass spectrometers, DirecTag reproducibly exceeded the accuracy and speed of InsPecT and GutenTag, two previously published algorithms for this purpose. The source code and binaries for DirecTag are available from http://fenchurch.mc.vanderbilt.edu

    QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics

    No full text
    Shotgun proteomics experiments integrate a complex sequence of processes, any of which can introduce variability. Quality metrics computed from LC-MS/MS data have relied upon identifying MS/MS scans, but a new mode for the QuaMeter software produces metrics that are independent of identifications. Rather than evaluating each metric independently, we have created a robust multivariate statistical toolkit that accommodates the correlation structure of these metrics and allows for hierarchical relationships among data sets. The framework enables visualization and structural assessment of variability. Study 1 for the Clinical Proteomics Technology Assessment for Cancer (CPTAC), which analyzed three replicates of two common samples at each of two time points among 23 mass spectrometers in nine laboratories, provided the data to demonstrate this framework, and CPTAC Study 5 provided data from complex lysates under Standard Operating Procedures (SOPs) to complement these findings. Identification-independent quality metrics enabled the differentiation of sites and run-times through robust principal components analysis and subsequent factor analysis. Dissimilarity metrics revealed outliers in performance, and a nested ANOVA model revealed the extent to which all metrics or individual metrics were impacted by mass spectrometer and run time. Study 5 data revealed that even when SOPs have been applied, instrument-dependent variability remains prominent, although it may be reduced, while within-site variability is reduced significantly. Finally, identification-independent quality metrics were shown to be predictive of identification sensitivity in these data sets. QuaMeter and the associated multivariate framework are available from http://fenchurch.mc.vanderbilt.edu and http://homepages.uc.edu/~wang2x7/, respectively

    MASPIC:  Intensity-Based Tandem Mass Spectrometry Scoring Scheme That Improves Peptide Identification at High Confidence

    No full text
    Algorithmic search engines bridge the gap between large tandem mass spectrometry data sets and the identification of proteins associated with biological samples. Improvements in these tools can greatly enhance biological discovery. We present a new scoring scheme for comparing tandem mass spectra with a protein sequence database. The MASPIC (Multinomial Algorithm for Spectral Profile-based Intensity Comparison) scorer converts an experimental tandem mass spectrum into a m/z profile of probability and then scores peak lists from potential candidate peptides using a multinomial distribution model. The MASPIC scoring scheme incorporates intensity, spectral peak density variations, and m/z error distribution associated with peak matches into a multinomial distribution. The scoring scheme was validated on two standard protein mixtures and an additional set of spectra collected on a complex ribosomal protein mixture from Rhodopseudomonas palustris. The results indicate a 5−15% improvement over Sequest for high-confidence identifications. The performance gap grows as sequence database size increases. Additional tests on spectra from proteinase-K digest data showed similar performance improvements demonstrating the advantages in using MASPIC for studying proteins digested with less specific proteases. All these investigations show MASPIC to be a versatile and reliable system for peptide tandem mass spectral identification

    QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics

    No full text
    Shotgun proteomics experiments integrate a complex sequence of processes, any of which can introduce variability. Quality metrics computed from LC-MS/MS data have relied upon identifying MS/MS scans, but a new mode for the QuaMeter software produces metrics that are independent of identifications. Rather than evaluating each metric independently, we have created a robust multivariate statistical toolkit that accommodates the correlation structure of these metrics and allows for hierarchical relationships among data sets. The framework enables visualization and structural assessment of variability. Study 1 for the Clinical Proteomics Technology Assessment for Cancer (CPTAC), which analyzed three replicates of two common samples at each of two time points among 23 mass spectrometers in nine laboratories, provided the data to demonstrate this framework, and CPTAC Study 5 provided data from complex lysates under Standard Operating Procedures (SOPs) to complement these findings. Identification-independent quality metrics enabled the differentiation of sites and run-times through robust principal components analysis and subsequent factor analysis. Dissimilarity metrics revealed outliers in performance, and a nested ANOVA model revealed the extent to which all metrics or individual metrics were impacted by mass spectrometer and run time. Study 5 data revealed that even when SOPs have been applied, instrument-dependent variability remains prominent, although it may be reduced, while within-site variability is reduced significantly. Finally, identification-independent quality metrics were shown to be predictive of identification sensitivity in these data sets. QuaMeter and the associated multivariate framework are available from http://fenchurch.mc.vanderbilt.edu and http://homepages.uc.edu/~wang2x7/, respectively
    corecore