Addressing Statistical
Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search
Strategies
- Publication date
- Publisher
Abstract
Proteogenomics has the potential to advance genome annotation
through high quality peptide identifications derived from mass spectrometry
experiments, which demonstrate a given gene or isoform is expressed
and translated at the protein level. This can advance our understanding
of genome function, discovering novel genes and gene structure that
have not yet been identified or validated. Because of the high-throughput
shotgun nature of most proteomics experiments, it is essential to
carefully control for false positives and prevent any potential misannotation.
A number of statistical procedures to deal with this are in wide use
in proteomics, calculating false discovery rate (FDR) and posterior
error probability (PEP) values for groups and individual peptide spectrum
matches (PSMs). These methods control for multiple testing and exploit
decoy databases to estimate statistical significance. Here, we show
that database choice has a major effect on these confidence estimates
leading to significant differences in the number of PSMs reported.
We note that standard target:decoy approaches using six-frame translations
of nucleotide sequences, such as assembled transcriptome data, apparently
underestimate the confidence assigned to the PSMs. The source of this
error stems from the inflated and unusual nature of the six-frame
database, where for every target sequence there exists five “incorrect”
targets that are unlikely to code for protein. The attendant FDR and
PEP estimates lead to fewer accepted PSMs at fixed thresholds, and
we show that this effect is a product of the database and statistical
modeling and not the search engine. A variety of approaches to limit
database size and remove noncoding target sequences are examined and
discussed in terms of the altered statistical estimates generated
and PSMs reported. These results are of importance to groups carrying
out proteogenomics, aiming to maximize the validation and discovery
of gene structure in sequenced genomes, while still controlling for
false positives