2,625 research outputs found
Sequence-Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier
We introduce a new distance-based phylogeny reconstruction technique which
provably achieves, at sufficiently short branch lengths, a polylogarithmic
sequence-length requirement -- improving significantly over previous polynomial
bounds for distance-based methods. The technique is based on an averaging
procedure that implicitly reconstructs ancestral sequences.
In the same token, we extend previous results on phase transitions in
phylogeny reconstruction to general time-reversible models. More precisely, we
show that in the so-called Kesten-Stigum zone (roughly, a region of the
parameter space where ancestral sequences are well approximated by ``linear
combinations'' of the observed sequences) sequences of length \poly(\log n)
suffice for reconstruction when branch lengths are discretized. Here is the
number of extant species.
Our results challenge, to some extent, the conventional wisdom that estimates
of evolutionary distances alone carry significantly less information about
phylogenies than full sequence datasets
Mean and Variance of Phylogenetic Trees
We describe the use of the Frechet mean and variance in the
Billera-Holmes-Vogtmann (BHV) treespace to summarize and explore the diversity
of a set of phylogenetic trees. We show that the Frechet mean is comparable to
other summary methods, and, despite its stickiness property, is more likely to
be binary than the majority-rules consensus tree. We show that the Frechet
variance is faster and more precise than commonly used variance measures. The
Frechet mean and variance are more theoretically justified, and more robust,
than previous estimates of this type, and can be estimated reasonably
efficiently, providing a foundation for building more advanced statistical
methods and leading to applications such as mean hypothesis testing.Comment: 26 pages, 12 figures; revisions include new dataset, improved
expositio
Fruit flies and moduli: interactions between biology and mathematics
Possibilities for using geometry and topology to analyze statistical problems
in biology raise a host of novel questions in geometry, probability, algebra,
and combinatorics that demonstrate the power of biology to influence the future
of pure mathematics. This expository article is a tour through some biological
explorations and their mathematical ramifications. The article starts with
evolution of novel topological features in wing veins of fruit flies, which are
quantified using the algebraic structure of multiparameter persistent homology.
The statistical issues involved highlight mathematical implications of sampling
from moduli spaces. These lead to geometric probability on stratified spaces,
including the sticky phenomenon for Frechet means and the origin of this
mathematical area in the reconstruction of phylogenetic trees.Comment: 10 pages, 2 figures (consisting of 5 .jpg images); accepted at
Notices of the American Mathematical Societ
A scale-free method for testing the proportionality of branch lengths between two phylogenetic trees
We introduce a scale-free method for testing the proportionality of branch
lengths between two phylogenetic trees that have the same topology and contain
the same set of taxa. This method scales both trees to a total length of 1 and
sums up the differences for each branch. Compared to previous methods, ours
yields a fully symmetrical score that measures proportionality without being
affected by scale. We call this score the normalized tree distance (NTD). Based
on real data, we demonstrate that NTD scores are distributed unimodally, in a
manner similar to a lognormal distribution. The NTD score can be used to, for
example, detect co-evolutionary processes and measure the accuracy of branch
length estimates.Comment: 13 pages of main text, 2 tables and 4 figure
Dynamic Geodesics in Treespace via Parametric Maximum Flow
Shortest paths in treespace, which represent minimal deformations between
trees, are unique and can be computed in polynomial time. The ability to
quickly compute shortest paths has enabled new approaches for statistical
analysis of populations of trees and phylogenetic inference. This paper gives a
new algorithm for updating geodesic paths when the end points are dynamic. Such
algorithms will be especially useful when optimizing for objectives that are
functions of distances from a search point to other points e.g. for finding a
tree which has the minimum average distance to a collection of trees. Our
method for updating treespace shortest paths is based on parametric sensitivity
analysis of the maximum flow subproblems that are optimized when solving for a
treespace geodesic
The space of ultrametric phylogenetic trees
The reliability of a phylogenetic inference method from genomic sequence data
is ensured by its statistical consistency. Bayesian inference methods produce a
sample of phylogenetic trees from the posterior distribution given sequence
data. Hence the question of statistical consistency of such methods is
equivalent to the consistency of the summary of the sample. More generally,
statistical consistency is ensured by the tree space used to analyse the
sample.
In this paper, we consider two standard parameterisations of phylogenetic
time-trees used in evolutionary models: inter-coalescent interval lengths and
absolute times of divergence events. For each of these parameterisations we
introduce a natural metric space on ultrametric phylogenetic trees. We compare
the introduced spaces with existing models of tree space and formulate several
formal requirements that a metric space on phylogenetic trees must possess in
order to be a satisfactory space for statistical analysis, and justify them. We
show that only a few known constructions of the space of phylogenetic trees
satisfy these requirements. However, our results suggest that these basic
requirements are not enough to distinguish between the two metric spaces we
introduce and that the choice between metric spaces requires additional
properties to be considered. Particularly, that the summary tree minimising the
square distance to the trees from the sample might be different for different
parameterisations. This suggests that further fundamental insight is needed
into the problem of statistical consistency of phylogenetic inference methods.Comment: Minor changes. This version has been published in JTB. 27 pages, 9
figure
Computing medians and means in Hadamard spaces
The geometric median as well as the Frechet mean of points in an Hadamard
space are important in both theory and applications. Surprisingly, no
algorithms for their computation are hitherto known. To address this issue, we
use a split version of the proximal point algorithm for minimizing a sum of
convex functions and prove that this algorithm produces a sequence converging
to a minimizer of the objective function, which extends a recent result of D.
Bertsekas (2001) into Hadamard spaces. The method is quite robust and not only
does it yield algorithms for the median and the mean, but it also applies to
various other optimization problems. We moreover show that another algorithm
for computing the Frechet mean can be derived from the law of large numbers due
to K.-T. Sturm (2002). In applications, computing medians and means is probably
most needed in tree space, which is an instance of an Hadamard space, invented
by Billera, Holmes, and Vogtmann (2001) as a tool for averaging phylogenetic
trees. It turns out, however, that it can be also used to model numerous other
tree-like structures. Since there now exists a polynomial-time algorithm for
computing geodesics in tree space due to M. Owen and S. Provan (2011), we
obtain efficient algorithms for computing medians and means, which can be
directly used in practice.Comment: Corrected version. Accepted in SIAM Journal on Optimizatio
Topological metrizations of trees, and new quartet methods of tree inference
Topological phylogenetic trees can be assigned edge weights in several
natural ways, highlighting different aspects of the tree. Here the rooted
triple and quartet metrizations are introduced, and applied to formulate novel
fast methods of inferring large trees from rooted triple and quartet data.
These methods can be applied in new statistically consistent procedures for
inference of a species tree from gene trees under the multispecies coalescent
model.Comment: Final versio
An algorithm for constructing principal geodesics in phylogenetic treespace
Most phylogenetic analyses result in a sample of trees, but summarizing and
visualizing these samples can be challenging. Consensus trees often provide
limited information about a sample, and so methods such as consensus networks,
clustering and multidimensional scaling have been developed and applied to tree
samples. This paper describes a stochastic algorithm for constructing a
principal geodesic or line through treespace which is analogous to the first
principal component in standard Principal Components Analysis. A principal
geodesic summarizes the most variable features of a sample of trees, in terms
of both tree topology and branch lengths, and it can be visualized as an
animation of smoothly changing trees. The algorithm performs a stochastic
search through parameter space for a geodesic which minimises the sum of
squared projected distances of the data points. This procedure aims to identify
the globally optimal principal geodesic, though convergence to locally optimal
geodesics is possible. The methodology is illustrated by constructing principal
geodesics for experimental and simulated data sets, demonstrating the insight
into samples of trees that can be gained and how the method improves on a
previously published approach. A java package called GeoPhytter for
constructing and visualising principal geodesics is freely available from
www.ncl.ac.uk/~ntmwn/geophytter.Comment: 6 figures, IEEE/ACM Transactions on Computational Biology and
Bioinformatics, Vol. 11, No. 2, 201
Hypothesis Testing For Network Data in Functional Neuroimaging
In recent years, it has become common practice in neuroscience to use
networks to summarize relational information in a set of measurements,
typically assumed to be reflective of either functional or structural
relationships between regions of interest in the brain. One of the most basic
tasks of interest in the analysis of such data is the testing of hypotheses, in
answer to questions such as "Is there a difference between the networks of
these two groups of subjects?" In the classical setting, where the unit of
interest is a scalar or a vector, such questions are answered through the use
of familiar two-sample testing strategies. Networks, however, are not Euclidean
objects, and hence classical methods do not directly apply. We address this
challenge by drawing on concepts and techniques from geometry, and
high-dimensional statistical inference. Our work is based on a precise
geometric characterization of the space of graph Laplacian matrices and a
nonparametric notion of averaging due to Fr\'echet. We motivate and illustrate
our resulting methodologies for testing in the context of networks derived from
functional neuroimaging data on human subjects from the 1000 Functional
Connectomes Project. In particular, we show that this global test is more
statistical powerful, than a mass-univariate approach. In addition, we have
also provided a method for visualizing the individual contribution of each edge
to the overall test statistic.Comment: 34 pages. 5 figure
- …