HSPVdb—the Human Short Peptide Variation Database for improved mass spectrometry-based detection of polymorphic HLA-ligands

AI Nesvizhskii; Arnoud H. de Ru; Aurélie Viars; CA Bergen van; CC Oliveira; Chopie Hassan; D Stepniak; DN Perkins; E Spierings; F Reisinger; Harm Nijveen; HD Meiring; HS Hiemstra; J. H. Fred Falkenburg; Jack A. M. Leunissen; JH Falkenburg; JH Kessler; JK Eng; KD Pruitt; L Hambach; LC Eisenlohr; M Bleakley; Machiel de Jager; Michel G. D. Kester; N Hillen; N Salimi; NJ Edwards; O Ho; Peter A. van Veelen; PJ Kersey; R Storb; S Schandorff; ST Sherry; T Etzold; The UniProt Consortium; VH Engelhard; WA Marijt

HSPVdb—the Human Short Peptide Variation Database for improved mass spectrometry-based detection of polymorphic HLA-ligands

Abstract

T cell epitopes derived from polymorphic proteins or from proteins encoded by alternative reading frames (ARFs) play an important role in (tumor) immunology. Identification of these peptides is successfully performed with mass spectrometry. In a mass spectrometry-based approach, the recorded tandem mass spectra are matched against hypothetical spectra generated from known protein sequence databases. Commonly used protein databases contain a minimal level of redundancy, and thus, are not suitable data sources for searching polymorphic T cell epitopes, either in normal or ARFs. At the same time, however, these databases contain much non-polymorphic sequence information, thereby complicating the matching of recorded and theoretical spectra, and increasing the potential for finding false positives. Therefore, we created a database with peptides from ARFs and peptide variation arising from single nucleotide polymorphisms (SNPs). It is based on the human mRNA sequences from the well-annotated reference sequence (RefSeq) database and associated variation information derived from the Single Nucleotide Polymorphism Database (dbSNP). In this process, we removed all non-polymorphic information. Investigation of the frequency of SNPs in the dbSNP revealed that many SNPs are non-polymorphic “SNPs”. Therefore, we removed those from our dedicated database, and this resulted in a comprehensive high quality database, which we coined the Human Short Peptide Variation Database (HSPVdb). The value of our HSPVdb is shown by identification of the majority of published polymorphic SNP- and/or ARF-derived epitopes from a mass spectrometry-based proteomics workflow, and by a large variety of polymorphic peptides identified as potential T cell epitopes in the HLA-ligandome presented by the Epstein–Barr virus cells