2,902 research outputs found
Geometry of the sample frequency spectrum and the perils of demographic inference
The sample frequency spectrum (SFS), which describes the distribution of
mutant alleles in a sample of DNA sequences, is a widely used summary statistic
in population genetics. The expected SFS has a strong dependence on the
historical population demography and this property is exploited by popular
statistical methods to infer complex demographic histories from DNA sequence
data. Most, if not all, of these inference methods exhibit pathological
behavior, however. Specifically, they often display runaway behavior in
optimization, where the inferred population sizes and epoch durations can
degenerate to 0 or diverge to infinity, and show undesirable sensitivity of the
inferred demography to perturbations in the data. The goal of this paper is to
provide theoretical insights into why such problems arise. To this end, we
characterize the geometry of the expected SFS for piecewise-constant
demographic histories and use our results to show that the aforementioned
pathological behavior of popular inference methods is intrinsic to the geometry
of the expected SFS. We provide explicit descriptions and visualizations for a
toy model with sample size 4, and generalize our intuition to arbitrary sample
sizes n using tools from convex and algebraic geometry. We also develop a
universal characterization result which shows that the expected SFS of a sample
of size n under an arbitrary population history can be recapitulated by a
piecewise-constant demography with only k(n) epochs, where k(n) is between n/2
and 2n-1. The set of expected SFS for piecewise-constant demographies with
fewer than k(n) epochs is open and non-convex, which causes the above phenomena
for inference from data.Comment: 21 pages, 5 figure
- …