3,049 research outputs found
Tropical Geometry of Phylogenetic Tree Space: A Statistical Perspective
Phylogenetic trees are the fundamental mathematical representation of
evolutionary processes in biology. As data objects, they are characterized by
the challenges associated with "big data," as well as the complication that
their discrete geometric structure results in a non-Euclidean phylogenetic tree
space, which poses computational and statistical limitations. We propose and
study a novel framework to study sets of phylogenetic trees based on tropical
geometry. In particular, we focus on characterizing our framework for
statistical analyses of evolutionary biological processes represented by
phylogenetic trees. Our setting exhibits analytic, geometric, and topological
properties that are desirable for theoretical studies in probability and
statistics, as well as increased computational efficiency over the current
state-of-the-art. We demonstrate our approach on seasonal influenza data.Comment: 28 pages, 5 figures, 1 tabl
Computing the Distribution of a Tree Metric
The Robinson-Foulds (RF) distance is by far the most widely used measure of
dissimilarity between trees. Although the distribution of these distances has
been investigated for twenty years, an algorithm that is explicitly polynomial
time has yet to be described for computing this distribution (which is also the
distribution of trees around a given tree under the popular Robinson-Foulds
metric). In this paper we derive a polynomial-time algorithm for this
distribution. We show how the distribution can be approximated by a Poisson
distribution determined by the proportion of leaves that lie in `cherries' of
the given tree. We also describe how our results can be used to derive
normalization constants that are required in a recently-proposed maximum
likelihood approach to supertree construction.Comment: 16 pages, 3 figure
On the inference of large phylogenies with long branches: How long is too long?
Recent work has highlighted deep connections between sequence-length
requirements for high-probability phylogeny reconstruction and the related
problem of the estimation of ancestral sequences. In [Daskalakis et al.'09],
building on the work of [Mossel'04], a tight sequence-length requirement was
obtained for the CFN model. In particular the required sequence length for
high-probability reconstruction was shown to undergo a sharp transition (from
to , where is the number of leaves) at the
"critical" branch length \critmlq (if it exists) of the ancestral
reconstruction problem.
Here we consider the GTR model. For this model, recent results of [Roch'09]
show that the tree can be accurately reconstructed with sequences of length
when the branch lengths are below \critksq, known as the
Kesten-Stigum (KS) bound. Although for the CFN model \critmlq = \critksq, it
is known that for the more general GTR models one has \critmlq \geq \critksq
with a strict inequality in many cases. Here, we show that this phenomenon also
holds for phylogenetic reconstruction by exhibiting a family of symmetric
models and a phylogenetic reconstruction algorithm which recovers the tree
from -length sequences for some branch lengths in the range
(\critksq,\critmlq). Second we prove that phylogenetic reconstruction under
GTR models requires a polynomial sequence-length for branch lengths above
\critmlq
Tracing evolutionary links between species
The idea that all life on earth traces back to a common beginning dates back
at least to Charles Darwin's {\em Origin of Species}. Ever since, biologists
have tried to piece together parts of this `tree of life' based on what we can
observe today: fossils, and the evolutionary signal that is present in the
genomes and phenotypes of different organisms. Mathematics has played a key
role in helping transform genetic data into phylogenetic (evolutionary) trees
and networks. Here, I will explain some of the central concepts and basic
results in phylogenetics, which benefit from several branches of mathematics,
including combinatorics, probability and algebra.Comment: 18 pages, 6 figures (Invited review paper (draft version) for AMM
Learning Latent Tree Graphical Models
We study the problem of learning a latent tree graphical model where samples
are available only from a subset of variables. We propose two consistent and
computationally efficient algorithms for learning minimal latent trees, that
is, trees without any redundant hidden nodes. Unlike many existing methods, the
observed nodes (or variables) are not constrained to be leaf nodes. Our first
algorithm, recursive grouping, builds the latent tree recursively by
identifying sibling groups using so-called information distances. One of the
main contributions of this work is our second algorithm, which we refer to as
CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree
over the observed variables is constructed. This global step groups the
observed nodes that are likely to be close to each other in the true latent
tree, thereby guiding subsequent recursive grouping (or equivalent procedures)
on much smaller subsets of variables. This results in more accurate and
efficient learning of latent trees. We also present regularized versions of our
algorithms that learn latent tree approximations of arbitrary distributions. We
compare the proposed algorithms to other methods by performing extensive
numerical experiments on various latent tree graphical models such as hidden
Markov models and star graphs. In addition, we demonstrate the applicability of
our methods on real-world datasets by modeling the dependency structure of
monthly stock returns in the S&P index and of the words in the 20 newsgroups
dataset
- …