Inference of population structure from genetic data plays an important role
in population and medical genetics studies. The traditional EIGENSTRAT method
has been widely used for computing and selecting top principal components that
capture population structure information (Price et al., 2006). With the
advancement and decreasing cost of sequencing technology, whole-genome
sequencing data provide much richer information about the underlying population
structures. However, the EIGENSTRAT method was originally developed for
analyzing array-based genotype data and thus may not perform well on sequencing
data for two reasons. First, the number of genetic variants p is much larger
than the sample size n in sequencing data such that the sample-to-marker
ratio n/p is nearly zero, violating the assumption of the Tracy-Widom test
used in the EIGENSTRAT method. Second, the EIGENSTRAT method might not be able
to handle the linkage disequilibrium (LD) well in sequencing data. To resolve
those two critical issues, we propose a new statistical method called ERStruct
to estimate the number of latent sub-populations based on sequencing data. We
propose to use the ratio of successive eigenvalues as a more robust testing
statistic, and then we approximate the null distribution of our proposed test
statistic using modern random matrix theory. Simulation studies found that our
proposed ERStruct method has outperformed the traditional Tracy-Widom test on
sequencing data. We further use two public data sets from the HapMap 3 and the
1000 Genomes Projects to demonstrate the performance of our ERStruct method. We
also implement our ERStruct in a MATLAB toolbox which is now publicly available
on github through https://github.com/bglvly/ERStruct