163 research outputs found
The space of ultrametric phylogenetic trees
The reliability of a phylogenetic inference method from genomic sequence data
is ensured by its statistical consistency. Bayesian inference methods produce a
sample of phylogenetic trees from the posterior distribution given sequence
data. Hence the question of statistical consistency of such methods is
equivalent to the consistency of the summary of the sample. More generally,
statistical consistency is ensured by the tree space used to analyse the
sample.
In this paper, we consider two standard parameterisations of phylogenetic
time-trees used in evolutionary models: inter-coalescent interval lengths and
absolute times of divergence events. For each of these parameterisations we
introduce a natural metric space on ultrametric phylogenetic trees. We compare
the introduced spaces with existing models of tree space and formulate several
formal requirements that a metric space on phylogenetic trees must possess in
order to be a satisfactory space for statistical analysis, and justify them. We
show that only a few known constructions of the space of phylogenetic trees
satisfy these requirements. However, our results suggest that these basic
requirements are not enough to distinguish between the two metric spaces we
introduce and that the choice between metric spaces requires additional
properties to be considered. Particularly, that the summary tree minimising the
square distance to the trees from the sample might be different for different
parameterisations. This suggests that further fundamental insight is needed
into the problem of statistical consistency of phylogenetic inference methods.Comment: Minor changes. This version has been published in JTB. 27 pages, 9
figure
Bayesian phylogenetic estimation of fossil ages
Recent advances have allowed for both morphological fossil evidence and
molecular sequences to be integrated into a single combined inference of
divergence dates under the rule of Bayesian probability. In particular the
fossilized birth-death tree prior and the Lewis-Mk model of discrete
morphological evolution allow for the estimation of both divergence times and
phylogenetic relationships between fossil and extant taxa. We exploit this
statistical framework to investigate the internal consistency of these models
by producing phylogenetic estimates of the age of each fossil in turn, within
two rich and well-characterized data sets of fossil and extant species
(penguins and canids). We find that the estimation accuracy of fossil ages is
generally high with credible intervals seldom excluding the true age and median
relative error in the two data sets of 5.7% and 13.2% respectively. The median
relative standard error (RSD) was 9.2% and 7.2% respectively, suggesting good
precision, although with some outliers. In fact in the two data sets we analyze
the phylogenetic estimates of fossil age is on average < 2 My from the midpoint
age of the geological strata from which it was excavated. The high level of
internal consistency found in our analyses suggests that the Bayesian
statistical model employed is an adequate fit for both the geological and
morphological data, and provides evidence from real data that the framework
used can accurately model the evolution of discrete morphological traits coded
from fossil and extant taxa. We anticipate that this approach will have diverse
applications beyond divergence time dating, including dating fossils that are
temporally unconstrained, testing of the "morphological clock", and for
uncovering potential model misspecification and/or data errors when
controversial phylogenetic hypotheses are obtained based on combined divergence
dating analyses.Comment: 28 pages, 8 figure
BEAST: Bayesian evolutionary analysis by sampling trees
<p>Abstract</p> <p>Background</p> <p>The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented.</p> <p>Results</p> <p>BEAST version 1.4.6 consists of 81000 lines of Java source code, 779 classes and 81 packages. It provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions. BEAST source code is object-oriented, modular in design and freely available at <url>http://beast-mcmc.googlecode.com/</url> under the GNU LGPL license.</p> <p>Conclusion</p> <p>BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation. It also provides a resource for the further development of new models and statistical methods of evolutionary analysis.</p
Calibrated Tree Priors for Relaxed Phylogenetics and Divergence Time Estimation
The use of fossil evidence to calibrate divergence time estimation has a long
history. More recently Bayesian MCMC has become the dominant method of
divergence time estimation and fossil evidence has been re-interpreted as the
specification of prior distributions on the divergence times of calibration
nodes. These so-called "soft calibrations" have become widely used but the
statistical properties of calibrated tree priors in a Bayesian setting has not
been carefully investigated. Here we clarify that calibration densities, such
as those defined in BEAST 1.5, do not represent the marginal prior distribution
of the calibration node. We illustrate this with a number of analytical results
on small trees. We also describe an alternative construction for a calibrated
Yule prior on trees that allows direct specification of the marginal prior
distribution of the calibrated divergence time, with or without the restriction
of monophyly. This method requires the computation of the Yule prior
conditional on the height of the divergence being calibrated. Unfortunately, a
practical solution for multiple calibrations remains elusive. Our results
suggest that direct estimation of the prior induced by specifying multiple
calibration densities should be a prerequisite of any divergence time dating
analysis
Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model
The evolution of RNA viruses such as HIV, Hepatitis C and Influenza virus
occurs so rapidly that the viruses' genomes contain information on past
ecological dynamics. Hence, we develop a phylodynamic method that enables the
joint estimation of epidemiological parameters and phylogenetic history. Based
on a compartmental susceptible-infected-removed (SIR) model, this method
provides separate information on incidence and prevalence of infections.
Detailed information on the interaction of host population dynamics and
evolutionary history can inform decisions on how to contain or entirely avoid
disease outbreaks.
We apply our Birth-Death SIR method (BDSIR) to two viral data sets. First,
five human immunodeficiency virus type 1 clusters sampled in the United Kingdom
between 1999 and 2003 are analyzed. The estimated basic reproduction ratios
range from 1.9 to 3.2 among the clusters. All clusters show a decline in the
growth rate of the local epidemic in the middle or end of the 90's.
The analysis of a hepatitis C virus (HCV) genotype 2c data set shows that the
local epidemic in the C\'ordoban city Cruz del Eje originated around 1906
(median), coinciding with an immigration wave from Europe to central Argentina
that dates from 1880--1920. The estimated time of epidemic peak is around 1970.Comment: Journal link:
http://rsif.royalsocietypublishing.org/content/11/94/20131106.ful
Bayesian inference of population size history from multiple loci
<p>Abstract</p> <p>Background</p> <p>Effective population size (<it>N</it><sub><it>e</it></sub>) is related to genetic variability and is a basic parameter in many models of population genetics. A number of methods for inferring current and past population sizes from genetic data have been developed since JFC Kingman introduced the n-coalescent in 1982. Here we present the Extended Bayesian Skyline Plot, a non-parametric Bayesian Markov chain Monte Carlo algorithm that extends a previous coalescent-based method in several ways, including the ability to analyze multiple loci.</p> <p>Results</p> <p>Through extensive simulations we show the accuracy and limitations of inferring population size as a function of the amount of data, including recovering information about evolutionary bottlenecks. We also analyzed two real data sets to demonstrate the behavior of the new method; a single gene Hepatitis C virus data set sampled from Egypt and a 10 locus <it>Drosophila ananassae </it>data set representing 16 different populations.</p> <p>Conclusion</p> <p>The results demonstrate the essential role of multiple loci in recovering population size dynamics. Multi-locus data from a small number of individuals can precisely recover past bottlenecks in population size which can not be characterized by analysis of a single locus. We also demonstrate that sequence data quality is important because even moderate levels of sequencing errors result in a considerable decrease in estimation accuracy for realistic levels of population genetic variability.</p
Bayesian random local clocks, or one rate to rule them all
<p>Abstract</p> <p>Background</p> <p>Relaxed molecular clock models allow divergence time dating and "relaxed phylogenetic" inference, in which a time tree is estimated in the face of unequal rates across lineages. We present a new method for relaxing the assumption of a strict molecular clock using Markov chain Monte Carlo to implement Bayesian modeling averaging over random local molecular clocks. The new method approaches the problem of rate variation among lineages by proposing a series of local molecular clocks, each extending over a subregion of the full phylogeny. Each branch in a phylogeny (subtending a clade) is a possible location for a change of rate from one local clock to a new one. Thus, including both the global molecular clock and the unconstrained model results, there are a total of 2<sup>2<it>n-</it>2 </sup>possible rate models available for averaging with 1, 2, ..., 2<it>n - </it>2 different rate categories.</p> <p>Results</p> <p>We propose an efficient method to sample this model space while simultaneously estimating the phylogeny. The new method conveniently allows a direct test of the strict molecular clock, in which one rate rules them all, against a large array of alternative local molecular clock models. We illustrate the method's utility on three example data sets involving mammal, primate and influenza evolution. Finally, we explore methods to visualize the complex posterior distribution that results from inference under such models.</p> <p>Conclusions</p> <p>The examples suggest that large sequence datasets may only require a small number of local molecular clocks to reconcile their branch lengths with a time scale. All of the analyses described here are implemented in the open access software package BEAST 1.5.4 (<url>http://beast-mcmc.googlecode.com/</url>).</p
- …