802 research outputs found
Bayesian modeling of recombination events in bacterial populations
Background: We consider the discovery of recombinant segments jointly with their origins within multilocus DNA sequences from bacteria representing heterogeneous populations of fairly closely related species. The currently available methods for recombination detection capable of probabilistic characterization of uncertainty have a limited applicability in practice as the number of
strains in a data set increases.
Results: We introduce a Bayesian spatial structural model representing the continuum of origins over sites within the observed sequences, including a probabilistic characterization of uncertainty related to the origin of any particular site. To enable a statistically accurate and practically feasible approach to the analysis of large-scale data sets representing a single genus, we have developed a novel software tool (BRAT, Bayesian Recombination Tracker) implementing the model and the
corresponding learning algorithm, which is capable of identifying the posterior optimal structure and to estimate the marginal posterior probabilities of putative origins over the sites.
Conclusion: A multitude of challenging simulation scenarios and an analysis of real data from seven
housekeeping genes of 120 strains of genus Burkholderia are used to illustrate the possibilities
offered by our approach. The software is freely available for download at URL http://web.abo.fi/fak/
mnf//mate/jc/software/brat.html
Inference of Population History using Coalescent HMMs: Review and Outlook
Studying how diverse human populations are related is of historical and
anthropological interest, in addition to providing a realistic null model for
testing for signatures of natural selection or disease associations.
Furthermore, understanding the demographic histories of other species is
playing an increasingly important role in conservation genetics. A number of
statistical methods have been developed to infer population demographic
histories using whole-genome sequence data, with recent advances focusing on
allowing for more flexible modeling choices, scaling to larger data sets, and
increasing statistical power. Here we review coalescent hidden Markov models, a
powerful class of population genetic inference methods that can effectively
utilize linkage disequilibrium information. We highlight recent advances, give
advice for practitioners, point out potential pitfalls, and present possible
future research directions.Comment: 12 pages, 2 figure
Regression approaches for Approximate Bayesian Computation
This book chapter introduces regression approaches and regression adjustment
for Approximate Bayesian Computation (ABC). Regression adjustment adjusts
parameter values after rejection sampling in order to account for the imperfect
match between simulations and observations. Imperfect match between simulations
and observations can be more pronounced when there are many summary statistics,
a phenomenon coined as the curse of dimensionality. Because of this imperfect
match, credibility intervals obtained with regression approaches can be
inflated compared to true credibility intervals. The chapter presents the main
concepts underlying regression adjustment. A theorem that compares theoretical
properties of posterior distributions obtained with and without regression
adjustment is presented. Last, a practical application of regression adjustment
in population genetics shows that regression adjustment shrinks posterior
distributions compared to rejection approaches, which is a solution to avoid
inflated credibility intervals.Comment: Book chapter, published in Handbook of Approximate Bayesian
Computation 201
A nonparametric HMM for genetic imputation and coalescent inference
Genetic sequence data are well described by hidden Markov models (HMMs) in
which latent states correspond to clusters of similar mutation patterns. Theory
from statistical genetics suggests that these HMMs are nonhomogeneous (their
transition probabilities vary along the chromosome) and have large support for
self transitions. We develop a new nonparametric model of genetic sequence
data, based on the hierarchical Dirichlet process, which supports these self
transitions and nonhomogeneity. Our model provides a parameterization of the
genetic process that is more parsimonious than other more general nonparametric
models which have previously been applied to population genetics. We provide
truncation-free MCMC inference for our model using a new auxiliary sampling
scheme for Bayesian nonparametric HMMs. In a series of experiments on male X
chromosome data from the Thousand Genomes Project and also on data simulated
from a population bottleneck we show the benefits of our model over the popular
finite model fastPHASE, which can itself be seen as a parametric truncation of
our model. We find that the number of HMM states found by our model is
correlated with the time to the most recent common ancestor in population
bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics
applied to large and complex genetic data
Methods for Assessing Population Relationships and History Using Genomic Data
Genetic data contain a record of our evolutionary history. The availability of
large-scale datasets of human populations from various geographic areas and
timescales, coupled with advances in the computational methods to analyze
these data, has transformed our ability to use genetic data to learn about
our evolutionary past. Here, we review some of the widely used statistical
methods to explore and characterize population relationships and history
using genomic data. We describe the intuition behind commonly used approaches, their interpretation, and important limitations. For illustration, we
apply some of these techniques to genome-wide autosomal data from 929 individuals representing 53 worldwide populations that are part of the Human
Genome Diversity Project. Finally, we discuss the new frontiers in genomic
methods to learn about population history. In sum, this review highlights
the power (and limitations) of DNA to infer features of human evolutionary
history, complementing the knowledge gleaned from other disciplines, such
as archaeology, anthropology, and linguistics
A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks
An explosion of high-throughput DNA sequencing in the past decade has led to
a surge of interest in population-scale inference with whole-genome data.
Recent work in population genetics has centered on designing inference methods
for relatively simple model classes, and few scalable general-purpose inference
techniques exist for more realistic, complex models. To achieve this, two
inferential challenges need to be addressed: (1) population data are
exchangeable, calling for methods that efficiently exploit the symmetries of
the data, and (2) computing likelihoods is intractable as it requires
integrating over a set of correlated, extremely high-dimensional latent
variables. These challenges are traditionally tackled by likelihood-free
methods that use scientific simulators to generate datasets and reduce them to
hand-designed, permutation-invariant summary statistics, often leading to
inaccurate inference. In this work, we develop an exchangeable neural network
that performs summary statistic-free, likelihood-free inference. Our framework
can be applied in a black-box fashion across a variety of simulation-based
tasks, both within and outside biology. We demonstrate the power of our
approach on the recombination hotspot testing problem, outperforming the
state-of-the-art.Comment: 9 pages, 8 figure
Statistical Population Genomics
This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions
Decoding coalescent hidden Markov models in linear time
In many areas of computational biology, hidden Markov models (HMMs) have been
used to model local genomic features. In particular, coalescent HMMs have been
used to infer ancient population sizes, migration rates, divergence times, and
other parameters such as mutation and recombination rates. As more loci,
sequences, and hidden states are added to the model, however, the runtime of
coalescent HMMs can quickly become prohibitive. Here we present a new algorithm
for reducing the runtime of coalescent HMMs from quadratic in the number of
hidden time states to linear, without making any additional approximations. Our
algorithm can be incorporated into various coalescent HMMs, including the
popular method PSMC for inferring variable effective population sizes. Here we
implement this algorithm to speed up our demographic inference method diCal,
which is equivalent to PSMC when applied to a sample of two haplotypes. We
demonstrate that the linear-time method can reconstruct a population size
change history more accurately than the quadratic-time method, given similar
computation resources. We also apply the method to data from the 1000 Genomes
project, inferring a high-resolution history of size changes in the European
population.Comment: 18 pages, 5 figures. To appear in the Proceedings of the 18th Annual
International Conference on Research in Computational Molecular Biology
(RECOMB 2014). The final publication is available at link.springer.co
VolcanoFinder:Genomic scans for adaptive introgression
Recent research shows that introgression between closely-related species is an important source of adaptive alleles for a wide range of taxa. Typically, detection of adaptive introgression from genomic data relies on comparative analyses that require sequence data from both the recipient and the donor species. However, in many cases, the donor is unknown or the data is not currently available. Here, we introduce a genome-scan method-VolcanoFinder-to detect recent events of adaptive introgression using polymorphism data from the recipient species only. VolcanoFinder detects adaptive introgression sweeps from the pattern of excess intermediate-frequency polymorphism they produce in the flanking region of the genome, a pattern which appears as a volcano-shape in pairwise genetic diversity. Using coalescent theory, we derive analytical predictions for these patterns. Based on these results, we develop a composite-likelihood test to detect signatures of adaptive introgression relative to the genomic background. Simulation results show that VolcanoFinder has high statistical power to detect these signatures, even for older sweeps and for soft sweeps initiated by multiple migrant haplotypes. Finally, we implement VolcanoFinder to detect archaic introgression in European and sub-Saharan African human populations, and uncovered interesting candidates in both populations, such as TSHR in Europeans and TCHH-RPTN in Africans. We discuss their biological implications and provide guidelines for identifying and circumventing artifactual signals during empirical applications of VolcanoFinder
- …