Pool-seq analysis for the identification of polymorphisms in bacterial strains and utilization of the variants for protein database creation

Abstract

Pooled sequencing (Pool-seq) is the sequencing of a single library that contains DNA pooled from different samples. It is a cost-effective alternative to individual whole genome sequencing. In this study, we utilized Pool-seq to sequence 100 streptococcus pyogenes strains in two pools to identify polymorphisms and create variant protein databases for shotgun proteomics analysis. We investigated the efficacy of the pooling strategy and the four tools used for variant calling by using individual sequence data of six of the strains in the pools as well as 3407 publicly available strains from the European Nucleotide Archive. Besides the raw sequence data from the public repository, we also extracted polymorphisms from 19 S.pyogenes publicly available complete genomes and compared the variations against our pools. In total 78955 variants (76981 SNPs and 1725 INDELs ) were identified from the two pools. Of these, ∼ 60.5% and 95.7% were discovered in the complete genomes and the European Nucleotide Archive data respectively. Collectively, the four variant calling tools were able to mine majority of the variants, ∼ 96.5%, found from the six individual strains, suggesting Pool-seq is a robust approach for variation discovery. Variants from the pools that fell in coding regions and had non synonymous effects constituted 24% and were used to create variant protein databases for shotgun proteomics analysis. These variant databases improved protein identification in mass spectrometry analysis

    Similar works