39 research outputs found
Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum
The sample frequency spectrum (SFS) of DNA sequences from a collection of
individuals is a summary statistic which is commonly used for parametric
inference in population genetics. Despite the popularity of SFS-based inference
methods, currently little is known about the information-theoretic limit on the
estimation accuracy as a function of sample size. Here, we show that using the
SFS to estimate the size history of a population has a minimax error of at
least , where is the number of independent segregating sites
used in the analysis. This rate is exponentially worse than known convergence
rates for many classical estimation problems in statistics. Another surprising
aspect of our theoretical bound is that it does not depend on the dimension of
the SFS, which is related to the number of sampled individuals. This means
that, for a fixed number of segregating sites considered, using more
individuals does not help to reduce the minimax error bound. Our result
pertains to populations that have experienced a bottleneck, and we argue that
it can be expected to apply to many populations in nature.Comment: 17 pages, 1 figur
The Kalmanson Complex
Let X be a finite set of cardinality n. The Kalmanson complex K_n is the
simplicial complex whose vertices are non-trivial X-splits, and whose facets
are maximal circular split systems over X. In this paper we examine K_n from
three perspectives. In addition to the T-theoretic description, we show that
K_n has a geometric realization as the Kalmanson conditions on a finite metric.
A third description arises in terms of binary matrices which possess the
circular ones property. We prove the equivalence of these three definitions.
This leads to a simplified proof of the well-known equivalence between
Kalmanson and circular decomposable metrics, as well as a partial description
of the f-vector of K_n.Comment: Improved exposition. 24 pages, 2 figures, 1 tabl
Multi-locus analysis of genomic time series data from experimental evolution.
Genomic time series data generated by evolve-and-resequence (E&R) experiments offer a powerful window into the mechanisms that drive evolution. However, standard population genetic inference procedures do not account for sampling serially over time, and new methods are needed to make full use of modern experimental evolution data. To address this problem, we develop a Gaussian process approximation to the multi-locus Wright-Fisher process with selection over a time course of tens of generations. The mean and covariance structure of the Gaussian process are obtained by computing the corresponding moments in discrete-time Wright-Fisher models conditioned on the presence of a linked selected site. This enables our method to account for the effects of linkage and selection, both along the genome and across sampled time points, in an approximate but principled manner. We first use simulated data to demonstrate the power of our method to correctly detect, locate and estimate the fitness of a selected allele from among several linked sites. We study how this power changes for different values of selection strength, initial haplotypic diversity, population size, sampling frequency, experimental duration, number of replicates, and sequencing coverage depth. In addition to providing quantitative estimates of selection parameters from experimental evolution data, our model can be used by practitioners to design E&R experiments with requisite power. We also explore how our likelihood-based approach can be used to infer other model parameters, including effective population size and recombination rate. Then, we apply our method to analyze genome-wide data from a real E&R experiment designed to study the adaptation of D. melanogaster to a new laboratory environment with alternating cold and hot temperatures
Inference of Population History using Coalescent HMMs: Review and Outlook
Studying how diverse human populations are related is of historical and
anthropological interest, in addition to providing a realistic null model for
testing for signatures of natural selection or disease associations.
Furthermore, understanding the demographic histories of other species is
playing an increasingly important role in conservation genetics. A number of
statistical methods have been developed to infer population demographic
histories using whole-genome sequence data, with recent advances focusing on
allowing for more flexible modeling choices, scaling to larger data sets, and
increasing statistical power. Here we review coalescent hidden Markov models, a
powerful class of population genetic inference methods that can effectively
utilize linkage disequilibrium information. We highlight recent advances, give
advice for practitioners, point out potential pitfalls, and present possible
future research directions.Comment: 12 pages, 2 figure
Recommended from our members
Efficiently inferring the demographic history of many populations with allele count data.
The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than p reviously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed "basal Eurasian" admixture event in human history. We implement and release our method in a new open-source software package momi2
SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling
Motivation: Computational methods are essential to extract actionable
information from raw sequencing data, and to thus fulfill the promise of
next-generation sequencing technology. Unfortunately, computational tools
developed to call variants from human sequencing data disagree on many of their
predictions, and current methods to evaluate accuracy and computational
performance are ad-hoc and incomplete. Agreement on benchmarking variant
calling methods would stimulate development of genomic processing tools and
facilitate communication among researchers.
Results: We propose SMaSH, a benchmarking methodology for evaluating human
genome variant calling algorithms. We generate synthetic datasets, organize and
interpret a wide range of existing benchmarking data for real genomes, and
propose a set of accuracy and computational performance metrics for evaluating
variant calling methods on this benchmarking data. Moreover, we illustrate the
utility of SMaSH to evaluate the performance of some leading single nucleotide
polymorphism (SNP), indel, and structural variant calling algorithms.
Availability: We provide free and open access online to the SMaSH toolkit,
along with detailed documentation, at smash.cs.berkeley.edu