Search CORE

39 research outputs found

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Author: Song Yun S.
Terhorst Jonathan
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 15/05/2015
Field of study

The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic which is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, currently little is known about the information-theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least

O(1/\log s)

, where

s

is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number

s

of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.Comment: 17 pages, 1 figur

arXiv.org e-Print Archive

Crossref

PubMed Central

eScholarship - University of California

The Kalmanson Complex

Author: Terhorst Jonathan
Publication venue
Publication date: 01/01/2011
Field of study

Let X be a finite set of cardinality n. The Kalmanson complex K_n is the simplicial complex whose vertices are non-trivial X-splits, and whose facets are maximal circular split systems over X. In this paper we examine K_n from three perspectives. In addition to the T-theoretic description, we show that K_n has a geometric realization as the Kalmanson conditions on a finite metric. A third description arises in terms of binary matrices which possess the circular ones property. We prove the equivalence of these three definitions. This leads to a simplified proof of the well-known equivalence between Kalmanson and circular decomposable metrics, as well as a partial description of the f-vector of K_n.Comment: Improved exposition. 24 pages, 2 figures, 1 tabl

arXiv.org e-Print Archive

Multi-locus analysis of genomic time series data from experimental evolution.

Author: Schlötterer Christian
Song Yun S
Terhorst Jonathan
Publication venue: eScholarship, University of California
Publication date: 01/04/2015
Field of study

Genomic time series data generated by evolve-and-resequence (E&R) experiments offer a powerful window into the mechanisms that drive evolution. However, standard population genetic inference procedures do not account for sampling serially over time, and new methods are needed to make full use of modern experimental evolution data. To address this problem, we develop a Gaussian process approximation to the multi-locus Wright-Fisher process with selection over a time course of tens of generations. The mean and covariance structure of the Gaussian process are obtained by computing the corresponding moments in discrete-time Wright-Fisher models conditioned on the presence of a linked selected site. This enables our method to account for the effects of linkage and selection, both along the genome and across sampled time points, in an approximate but principled manner. We first use simulated data to demonstrate the power of our method to correctly detect, locate and estimate the fitness of a selected allele from among several linked sites. We study how this power changes for different values of selection strength, initial haplotypic diversity, population size, sampling frequency, experimental duration, number of replicates, and sequencing coverage depth. In addition to providing quantitative estimates of selection parameters from experimental evolution data, our model can be used by practitioners to design E&R experiments with requisite power. We also explore how our likelihood-based approach can be used to infer other model parameters, including effective population size and recombination rate. Then, we apply our method to analyze genome-wide data from a real E&R experiment designed to study the adaptation of D. melanogaster to a new laboratory environment with alternating cold and hot temperatures

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

FigShare

Inference of Population History using Coalescent HMMs: Review and Outlook

Author: Song Yun S.
Spence Jeffrey P.
Steinrücken Matthias
Terhorst Jonathan
Publication venue
Publication date: 08/07/2018
Field of study

Studying how diverse human populations are related is of historical and anthropological interest, in addition to providing a realistic null model for testing for signatures of natural selection or disease associations. Furthermore, understanding the demographic histories of other species is playing an increasingly important role in conservation genetics. A number of statistical methods have been developed to infer population demographic histories using whole-genome sequence data, with recent advances focusing on allowing for more flexible modeling choices, scaling to larger data sets, and increasing statistical power. Here we review coalescent hidden Markov models, a powerful class of population genetic inference methods that can effectively utilize linkage disequilibrium information. We highlight recent advances, give advice for practitioners, point out potential pitfalls, and present possible future research directions.Comment: 12 pages, 2 figure

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Recommended from our members

Efficiently inferring the demographic history of many populations with allele count data.

Author: Durbin Richard
Kamm Jack
Song Yun S
Terhorst Jonathan
Publication venue: J Am Stat Assoc
Publication date: 01/01/2020
Field of study

The sample frequency spectrum (SFS), or histogram of allele counts, is an important summary statistic in evolutionary biology, and is often used to infer the history of population size changes, migrations, and other demographic events affecting a set of populations. The expected multipopulation SFS under a given demographic model can be efficiently computed when the populations in the model are related by a tree, scaling to hundreds of populations. Admixture, back-migration, and introgression are common natural processes that violate the assumption of a tree-like population history, however, and until now the expected SFS could be computed for only a handful of populations when the demographic history is not a tree. In this article, we present a new method for efficiently computing the expected SFS and linear functionals of it, for demographies described by general directed acyclic graphs. This method can scale to more populations than p reviously possible for complex demographic histories including admixture. We apply our method to an 8-population SFS to estimate the timing and strength of a proposed "basal Eurasian" admixture event in human history. We implement and release our method in a new open-source software package momi2

Apollo (Cambridge)

SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling

Author: Bresler Ma'ayan
Curtis Kristal
Hartl Christopher
Jordan Michael I.
Liptrap Jesse
Newcomb Julie
Patterson David
Song Yun S.
Talwalkar Ameet
Terhorst Jonathan
Publication venue
Publication date: 05/01/2014
Field of study

Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at smash.cs.berkeley.edu

arXiv.org e-Print Archive

Crossref

PubMed Central

eScholarship - University of California

Communication-Efficient Distributed Dual Coordinate Ascent.

Author: Hofmann Thomas
Jaggi Martin
Jordan Michael I.
Krishnan Sanjay
Smith Virginia
Takác Martin
Terhorst Jonathan
Publication venue
Publication date: 21/06/2017
Field of study

Infoscience - École polytechnique fédérale de Lausanne