Recent studies have revealed a relationship between protein abundance and sampling statistics, such
as sequence coverage, peptide count, and spectral count, in label-free liquid chromatography−tandem
mass spectrometry (LC−MS/MS) shotgun proteomics. The use of sampling statistics offers a promising
method of measuring relative protein abundance and detecting differentially expressed or coexpressed
proteins. We performed a systematic analysis of various approaches to quantifying differential protein
expression in eukaryotic Saccharomyces cerevisiae and prokaryotic Rhodopseudomonas palustris label-free LC−MS/MS data. First, we showed that, among three sampling statistics, the spectral count has
the highest technical reproducibility, followed by the less-reproducible peptide count and relatively
nonreproducible sequence coverage. Second, we used spectral count statistics to measure differential
protein expression in pairwise experiments using five statistical tests: Fisher's exact test, G-test, AC
test, t-test, and LPE test. Given the S. cerevisiae data set with spiked proteins as a benchmark and the
false positive rate as a metric, our evaluation suggested that the Fisher's exact test, G-test, and AC test
can be used when the number of replications is limited (one or two), whereas the t-test is useful with
three or more replicates available. Third, we generalized the G-test to increase the sensitivity of detecting
differential protein expression under multiple experimental conditions. Out of 1622 identified R. palustris
proteins in the LC−MS/MS experiment, the generalized G-test detected 1119 differentially expressed
proteins under six growth conditions. Finally, we studied correlated expression of these 1119 proteins
by analyzing pairwise expression correlations and by delineating protein clusters according to expression
patterns. Through pairwise expression correlation analysis, we demonstrated that proteins co-located
in the same operon were much more strongly coexpressed than those from different operons.
Combining cluster analysis with existing protein functional annotations, we identified six protein clusters
with known biological significance. In summary, the proposed generalized G-test using spectral count
sampling statistics is a viable methodology for robust quantification of relative protein abundance and
for sensitive detection of biologically significant differential protein expression under multiple
experimental conditions in label-free shotgun proteomics.
Keywords: label-free • LC−MS/MS • shotgun proteomics • differential expression • correlated expression • clustering
• Saccharomyces cerevisiae • Rhodopseudomonas palustris</i