3,360 research outputs found
Consistency and convergence rate of phylogenetic inference via regularization
It is common in phylogenetics to have some, perhaps partial, information
about the overall evolutionary tree of a group of organisms and wish to find an
evolutionary tree of a specific gene for those organisms. There may not be
enough information in the gene sequences alone to accurately reconstruct the
correct "gene tree." Although the gene tree may deviate from the "species tree"
due to a variety of genetic processes, in the absence of evidence to the
contrary it is parsimonious to assume that they agree. A common statistical
approach in these situations is to develop a likelihood penalty to incorporate
such additional information. Recent studies using simulation and empirical data
suggest that a likelihood penalty quantifying concordance with a species tree
can significantly improve the accuracy of gene tree reconstruction compared to
using sequence data alone. However, the consistency of such an approach has not
yet been established, nor have convergence rates been bounded. Because
phylogenetics is a non-standard inference problem, the standard theory does not
apply. In this paper, we propose a penalized maximum likelihood estimator for
gene tree reconstruction, where the penalty is the square of the
Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species
tree. We prove that this method is consistent, and derive its convergence rate
for estimating the discrete gene tree structure and continuous edge lengths
(representing the amount of evolution that has occurred on that branch)
simultaneously. We find that the regularized estimator is "adaptive fast
converging," meaning that it can reconstruct all edges of length greater than
any given threshold from gene sequences of polynomial length. Our method does
not require the species tree to be known exactly; in fact, our asymptotic
theory holds for any such guide tree.Comment: 34 pages, 5 figures. To appear on The Annals of Statistic
The space of ultrametric phylogenetic trees
The reliability of a phylogenetic inference method from genomic sequence data
is ensured by its statistical consistency. Bayesian inference methods produce a
sample of phylogenetic trees from the posterior distribution given sequence
data. Hence the question of statistical consistency of such methods is
equivalent to the consistency of the summary of the sample. More generally,
statistical consistency is ensured by the tree space used to analyse the
sample.
In this paper, we consider two standard parameterisations of phylogenetic
time-trees used in evolutionary models: inter-coalescent interval lengths and
absolute times of divergence events. For each of these parameterisations we
introduce a natural metric space on ultrametric phylogenetic trees. We compare
the introduced spaces with existing models of tree space and formulate several
formal requirements that a metric space on phylogenetic trees must possess in
order to be a satisfactory space for statistical analysis, and justify them. We
show that only a few known constructions of the space of phylogenetic trees
satisfy these requirements. However, our results suggest that these basic
requirements are not enough to distinguish between the two metric spaces we
introduce and that the choice between metric spaces requires additional
properties to be considered. Particularly, that the summary tree minimising the
square distance to the trees from the sample might be different for different
parameterisations. This suggests that further fundamental insight is needed
into the problem of statistical consistency of phylogenetic inference methods.Comment: Minor changes. This version has been published in JTB. 27 pages, 9
figure
Latent tree models
Latent tree models are graphical models defined on trees, in which only a
subset of variables is observed. They were first discussed by Judea Pearl as
tree-decomposable distributions to generalise star-decomposable distributions
such as the latent class model. Latent tree models, or their submodels, are
widely used in: phylogenetic analysis, network tomography, computer vision,
causal modeling, and data clustering. They also contain other well-known
classes of models like hidden Markov models, Brownian motion tree model, the
Ising model on a tree, and many popular models used in phylogenetics. This
article offers a concise introduction to the theory of latent tree models. We
emphasise the role of tree metrics in the structural description of this model
class, in designing learning algorithms, and in understanding fundamental
limits of what and when can be learned
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
- …