Encoded haplotype data as input to ipPCA can better resolve population clustering

Abstract

Background Studies in population genetics are mainly based on the analysis of genetic variations among different populations. With the advent of advanced genotyping technology, large number of Single Nucleotide Polymorphisms (SNPs) can be used to capture the underlying population variations. Iterative pruning principal component analysis (ipPCA) is a very powerful tool to cluster subpopulations based on their SNP profiles. However, when several similar populations are considered in the analysis, differentiating these populations can become very challenging. Haplotype has been known to capture more segregation information and higher power than SNP but due to high inference complexity, this concept has not been widely used. Recently, haplotype sharing (HS) was reported as a good alternative method to evaluate variation among populations. HS interrogates the entire genotyping without estimating haplotype block, making it computational efficient, yet retaining population profile. Adopting HS technique and introducing a new haplotype encoding as the input to ipPCA to perform population clustering can yield very good outcomes. Results In this study we transformed an indigenous Thai SNP genotyping data, obtained from Pan Asian SNP consortium, into encoded haplotype profiles. The dataset include 13 indigenous populations (245 individuals) composing of approximately 54K SNPs for each individual. To do this, an encoded haplotype matrix was constructed by inferring overlapping haplotype based on sliding window approach in BEAGLE, an efficient haplotype inference tool. We fed this encoded haplotype matrix to ipPCA to cluster these individuals into sub-groups using only their genetic profiles. We compared the results obtained from standard protocol of ipPCA with the one that use the encoded haplotype matrix in terms of numbers of clustered subpopulations as well as the accuracy to correctly assign an individual to a correct subpopulation. Using the encoded haplotype matrix as input to ipPCA rendered the exact 13 subpopulations to be clustered with 99.18% of individual assignment accuracy, whereas the conventional ipPCA identified only 10 subpopulations with 93.47% of individual assignment accuracy. Conclusions Our result demonstrated the great potential of using the encoded haplotype matrix with ipPCA for population genetics studies. This new protocol can promote the clustering of individuals using only their genetic profiles

    Similar works