106 research outputs found
Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency
Assembling peptides identified from LC−MS/MS spectra into a list of proteins is a critical step in
analyzing shotgun proteomics data. As one peptide sequence can be mapped to multiple proteins in
a database, naïve protein assembly can substantially overstate the number of proteins found in samples.
We model the peptide−protein relationships in a bipartite graph and use efficient graph algorithms to
identify protein clusters with shared peptides and to derive the minimal list of proteins. We test the
effects of this parsimony analysis approach using MS/MS data sets generated from a defined human
protein mixture, a yeast whole cell extract, and a human serum proteome after MARS column depletion.
The results demonstrate that the bipartite parsimony technique not only simplifies protein lists but
also improves the accuracy of protein identification. We use bipartite graphs for the visualization of
the protein assembly results to render the parsimony analysis process transparent to users. Our
approach also groups functionally related proteins together and improves the comprehensibility of
the results. We have implemented the tool in the IDPicker package. The source code and binaries for
this protein assembly pipeline are available under Mozilla Public License at the following URL: http://www.mc.vanderbilt.edu/msrc/bioinformatics/.
Keywords: parsimony analysis • bipartite graph • shotgun proteomics • LC−MS/MS • protein assembl
DBDigger: Reorganized Proteomic Database Identification That Improves Flexibility and Speed
Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead,
DBDigger determines which spectra can be compared to
each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization
also reduces the number of times a spectrum must be
predicted for a particular candidate sequence and charge
state. As a result, DBDigger can accelerate some database
searches by more than an order of magnitude. In addition,
the software offers features to reduce the performance
degradation introduced by posttranslational modification
(PTM) searching. DBDigger allows researchers to specify
the sequence context in which each PTM is possible. In
the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini
of peptides. Use of “context-dependent” PTM searching
reduces the performance penalty relative to traditional
PTM searching. We characterize the performance possible
with DBDigger, showcasing MASPIC, a new statistical
scorer. We describe the implementation of these innovations in the hope that other researchers will employ them
for rapid and highly flexible proteomic database search
DBDigger: Reorganized Proteomic Database Identification That Improves Flexibility and Speed
Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead,
DBDigger determines which spectra can be compared to
each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization
also reduces the number of times a spectrum must be
predicted for a particular candidate sequence and charge
state. As a result, DBDigger can accelerate some database
searches by more than an order of magnitude. In addition,
the software offers features to reduce the performance
degradation introduced by posttranslational modification
(PTM) searching. DBDigger allows researchers to specify
the sequence context in which each PTM is possible. In
the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini
of peptides. Use of “context-dependent” PTM searching
reduces the performance penalty relative to traditional
PTM searching. We characterize the performance possible
with DBDigger, showcasing MASPIC, a new statistical
scorer. We describe the implementation of these innovations in the hope that other researchers will employ them
for rapid and highly flexible proteomic database search
DBDigger: Reorganized Proteomic Database Identification That Improves Flexibility and Speed
Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead,
DBDigger determines which spectra can be compared to
each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization
also reduces the number of times a spectrum must be
predicted for a particular candidate sequence and charge
state. As a result, DBDigger can accelerate some database
searches by more than an order of magnitude. In addition,
the software offers features to reduce the performance
degradation introduced by posttranslational modification
(PTM) searching. DBDigger allows researchers to specify
the sequence context in which each PTM is possible. In
the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini
of peptides. Use of “context-dependent” PTM searching
reduces the performance penalty relative to traditional
PTM searching. We characterize the performance possible
with DBDigger, showcasing MASPIC, a new statistical
scorer. We describe the implementation of these innovations in the hope that other researchers will employ them
for rapid and highly flexible proteomic database search
DBDigger: Reorganized Proteomic Database Identification That Improves Flexibility and Speed
Database search identification algorithms, such as Sequest and Mascot, constitute powerful enablers for proteomic tandem mass spectrometry. We introduce DBDigger, an algorithm that reorganizes the database identification process to remove a problematic bottleneck. Typically such algorithms determine which candidate sequences can be compared to each spectrum. Instead,
DBDigger determines which spectra can be compared to
each candidate sequence, enabling the software to generate candidate sequences only once for each HPLC separation rather than for each spectrum. This reorganization
also reduces the number of times a spectrum must be
predicted for a particular candidate sequence and charge
state. As a result, DBDigger can accelerate some database
searches by more than an order of magnitude. In addition,
the software offers features to reduce the performance
degradation introduced by posttranslational modification
(PTM) searching. DBDigger allows researchers to specify
the sequence context in which each PTM is possible. In
the case of CNBr digests, for example, modified methionine residues can be limited to occur only at the C-termini
of peptides. Use of “context-dependent” PTM searching
reduces the performance penalty relative to traditional
PTM searching. We characterize the performance possible
with DBDigger, showcasing MASPIC, a new statistical
scorer. We describe the implementation of these innovations in the hope that other researchers will employ them
for rapid and highly flexible proteomic database search
The <i>bis</i>-Electrophile Diepoxybutane Cross-Links DNA to Human Histones but Does Not Result in Enhanced Mutagenesis in Recombinant Systems
1,2-Dibromoethane and 1,3-butadiene are cancer suspects present in the environment and have been used widely in industry. The mutagenic properties of 1,2-dibromoethane and the 1,3-butadiene oxidation product diepoxybutane are thought to be related to the bis-electrophilic character of these chemicals. The discovery that overexpression of O6-alkylguanine alkyltransferase (AGT) enhances bis-electrophile-induced mutagenesis prompted a search for other proteins that may act by a similar mechanism. A human liver screen for nuclear proteins that cross-link with DNA in the presence of 1,2-dibromoethane identified histones H2b and H3 as candidate proteins. Treatment of isolated histones H2b and H3 with diepoxybutane resulted in DNA−protein cross-links and produced protein adducts, and DNA−histone H2b cross-links were identified (immunochemically) in Escherichia coli cells expressing histone H2b. However, heterologous expression of histone H2b in E. coli failed to enhance bis-electrophile-induced mutagenesis. These results are similar to those found with the cross-link candidate glyceraldehyde 3-phosphate dehydrogenase (GAPDH) [Loecken, E. M., and Guengerich, F. P. (2008) Chem. Res. Toxicol. 21, 453−458], but in contrast to GAPDH, histone H2b bound DNA with even higher affinity than AGT. The extent of DNA cross-linking of isolated histone H2b was similar to that of AGT, suggesting that differences in postcross-linking events explain the difference in mutagenesis
DirecTag: Accurate Sequence Tags from Peptide MS/MS through Statistical Scoring
In shotgun proteomics, tandem mass spectra of peptides are typically identified through database search algorithms such as Sequest. We have developed DirecTag, an open-source algorithm to infer partial sequence tags directly from observed fragment ions. This algorithm is unique in its implementation of three separate scoring systems to evaluate each tag on the basis of peak intensity, m/z fidelity, and complementarity. In data sets from several types of mass spectrometers, DirecTag reproducibly exceeded the accuracy and speed of InsPecT and GutenTag, two previously published algorithms for this purpose. The source code and binaries for DirecTag are available from http://fenchurch.mc.vanderbilt.edu
QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics
Shotgun proteomics experiments integrate
a complex sequence of
processes, any of which can introduce variability. Quality metrics
computed from LC-MS/MS data have relied upon identifying MS/MS scans,
but a new mode for the QuaMeter software produces metrics that are
independent of identifications. Rather than evaluating each metric
independently, we have created a robust multivariate statistical toolkit
that accommodates the correlation structure of these metrics and allows
for hierarchical relationships among data sets. The framework enables
visualization and structural assessment of variability. Study 1 for
the Clinical Proteomics Technology Assessment for Cancer (CPTAC),
which analyzed three replicates of two common samples at each of two
time points among 23 mass spectrometers in nine laboratories, provided
the data to demonstrate this framework, and CPTAC Study 5 provided
data from complex lysates under Standard Operating Procedures (SOPs)
to complement these findings. Identification-independent quality metrics
enabled the differentiation of sites and run-times through robust
principal components analysis and subsequent factor analysis. Dissimilarity
metrics revealed outliers in performance, and a nested ANOVA model
revealed the extent to which all metrics or individual metrics were
impacted by mass spectrometer and run time. Study 5 data revealed
that even when SOPs have been applied, instrument-dependent variability
remains prominent, although it may be reduced, while within-site variability
is reduced significantly. Finally, identification-independent quality
metrics were shown to be predictive of identification sensitivity
in these data sets. QuaMeter and the associated multivariate framework
are available from http://fenchurch.mc.vanderbilt.edu and http://homepages.uc.edu/~wang2x7/, respectively
MASPIC: Intensity-Based Tandem Mass Spectrometry Scoring Scheme That Improves Peptide Identification at High Confidence
Algorithmic search engines bridge the gap between large
tandem mass spectrometry data sets and the identification
of proteins associated with biological samples. Improvements in these tools can greatly enhance biological
discovery. We present a new scoring scheme for comparing tandem mass spectra with a protein sequence database. The MASPIC (Multinomial Algorithm for Spectral
Profile-based Intensity Comparison) scorer converts an
experimental tandem mass spectrum into a m/z profile
of probability and then scores peak lists from potential
candidate peptides using a multinomial distribution model.
The MASPIC scoring scheme incorporates intensity,
spectral peak density variations, and m/z error distribution associated with peak matches into a multinomial
distribution. The scoring scheme was validated on two
standard protein mixtures and an additional set of spectra
collected on a complex ribosomal protein mixture from
Rhodopseudomonas palustris. The results indicate a
5−15% improvement over Sequest for high-confidence
identifications. The performance gap grows as sequence
database size increases. Additional tests on spectra from
proteinase-K digest data showed similar performance
improvements demonstrating the advantages in using
MASPIC for studying proteins digested with less specific
proteases. All these investigations show MASPIC to be a
versatile and reliable system for peptide tandem mass
spectral identification
QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics
Shotgun proteomics experiments integrate
a complex sequence of
processes, any of which can introduce variability. Quality metrics
computed from LC-MS/MS data have relied upon identifying MS/MS scans,
but a new mode for the QuaMeter software produces metrics that are
independent of identifications. Rather than evaluating each metric
independently, we have created a robust multivariate statistical toolkit
that accommodates the correlation structure of these metrics and allows
for hierarchical relationships among data sets. The framework enables
visualization and structural assessment of variability. Study 1 for
the Clinical Proteomics Technology Assessment for Cancer (CPTAC),
which analyzed three replicates of two common samples at each of two
time points among 23 mass spectrometers in nine laboratories, provided
the data to demonstrate this framework, and CPTAC Study 5 provided
data from complex lysates under Standard Operating Procedures (SOPs)
to complement these findings. Identification-independent quality metrics
enabled the differentiation of sites and run-times through robust
principal components analysis and subsequent factor analysis. Dissimilarity
metrics revealed outliers in performance, and a nested ANOVA model
revealed the extent to which all metrics or individual metrics were
impacted by mass spectrometer and run time. Study 5 data revealed
that even when SOPs have been applied, instrument-dependent variability
remains prominent, although it may be reduced, while within-site variability
is reduced significantly. Finally, identification-independent quality
metrics were shown to be predictive of identification sensitivity
in these data sets. QuaMeter and the associated multivariate framework
are available from http://fenchurch.mc.vanderbilt.edu and http://homepages.uc.edu/~wang2x7/, respectively
- …
