2,625 research outputs found

    Sequence-Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier

    Full text link
    We introduce a new distance-based phylogeny reconstruction technique which provably achieves, at sufficiently short branch lengths, a polylogarithmic sequence-length requirement -- improving significantly over previous polynomial bounds for distance-based methods. The technique is based on an averaging procedure that implicitly reconstructs ancestral sequences. In the same token, we extend previous results on phase transitions in phylogeny reconstruction to general time-reversible models. More precisely, we show that in the so-called Kesten-Stigum zone (roughly, a region of the parameter space where ancestral sequences are well approximated by ``linear combinations'' of the observed sequences) sequences of length \poly(\log n) suffice for reconstruction when branch lengths are discretized. Here nn is the number of extant species. Our results challenge, to some extent, the conventional wisdom that estimates of evolutionary distances alone carry significantly less information about phylogenies than full sequence datasets

    Mean and Variance of Phylogenetic Trees

    Full text link
    We describe the use of the Frechet mean and variance in the Billera-Holmes-Vogtmann (BHV) treespace to summarize and explore the diversity of a set of phylogenetic trees. We show that the Frechet mean is comparable to other summary methods, and, despite its stickiness property, is more likely to be binary than the majority-rules consensus tree. We show that the Frechet variance is faster and more precise than commonly used variance measures. The Frechet mean and variance are more theoretically justified, and more robust, than previous estimates of this type, and can be estimated reasonably efficiently, providing a foundation for building more advanced statistical methods and leading to applications such as mean hypothesis testing.Comment: 26 pages, 12 figures; revisions include new dataset, improved expositio

    Fruit flies and moduli: interactions between biology and mathematics

    Full text link
    Possibilities for using geometry and topology to analyze statistical problems in biology raise a host of novel questions in geometry, probability, algebra, and combinatorics that demonstrate the power of biology to influence the future of pure mathematics. This expository article is a tour through some biological explorations and their mathematical ramifications. The article starts with evolution of novel topological features in wing veins of fruit flies, which are quantified using the algebraic structure of multiparameter persistent homology. The statistical issues involved highlight mathematical implications of sampling from moduli spaces. These lead to geometric probability on stratified spaces, including the sticky phenomenon for Frechet means and the origin of this mathematical area in the reconstruction of phylogenetic trees.Comment: 10 pages, 2 figures (consisting of 5 .jpg images); accepted at Notices of the American Mathematical Societ

    A scale-free method for testing the proportionality of branch lengths between two phylogenetic trees

    Full text link
    We introduce a scale-free method for testing the proportionality of branch lengths between two phylogenetic trees that have the same topology and contain the same set of taxa. This method scales both trees to a total length of 1 and sums up the differences for each branch. Compared to previous methods, ours yields a fully symmetrical score that measures proportionality without being affected by scale. We call this score the normalized tree distance (NTD). Based on real data, we demonstrate that NTD scores are distributed unimodally, in a manner similar to a lognormal distribution. The NTD score can be used to, for example, detect co-evolutionary processes and measure the accuracy of branch length estimates.Comment: 13 pages of main text, 2 tables and 4 figure

    Dynamic Geodesics in Treespace via Parametric Maximum Flow

    Full text link
    Shortest paths in treespace, which represent minimal deformations between trees, are unique and can be computed in polynomial time. The ability to quickly compute shortest paths has enabled new approaches for statistical analysis of populations of trees and phylogenetic inference. This paper gives a new algorithm for updating geodesic paths when the end points are dynamic. Such algorithms will be especially useful when optimizing for objectives that are functions of distances from a search point to other points e.g. for finding a tree which has the minimum average distance to a collection of trees. Our method for updating treespace shortest paths is based on parametric sensitivity analysis of the maximum flow subproblems that are optimized when solving for a treespace geodesic

    The space of ultrametric phylogenetic trees

    Get PDF
    The reliability of a phylogenetic inference method from genomic sequence data is ensured by its statistical consistency. Bayesian inference methods produce a sample of phylogenetic trees from the posterior distribution given sequence data. Hence the question of statistical consistency of such methods is equivalent to the consistency of the summary of the sample. More generally, statistical consistency is ensured by the tree space used to analyse the sample. In this paper, we consider two standard parameterisations of phylogenetic time-trees used in evolutionary models: inter-coalescent interval lengths and absolute times of divergence events. For each of these parameterisations we introduce a natural metric space on ultrametric phylogenetic trees. We compare the introduced spaces with existing models of tree space and formulate several formal requirements that a metric space on phylogenetic trees must possess in order to be a satisfactory space for statistical analysis, and justify them. We show that only a few known constructions of the space of phylogenetic trees satisfy these requirements. However, our results suggest that these basic requirements are not enough to distinguish between the two metric spaces we introduce and that the choice between metric spaces requires additional properties to be considered. Particularly, that the summary tree minimising the square distance to the trees from the sample might be different for different parameterisations. This suggests that further fundamental insight is needed into the problem of statistical consistency of phylogenetic inference methods.Comment: Minor changes. This version has been published in JTB. 27 pages, 9 figure

    Computing medians and means in Hadamard spaces

    Full text link
    The geometric median as well as the Frechet mean of points in an Hadamard space are important in both theory and applications. Surprisingly, no algorithms for their computation are hitherto known. To address this issue, we use a split version of the proximal point algorithm for minimizing a sum of convex functions and prove that this algorithm produces a sequence converging to a minimizer of the objective function, which extends a recent result of D. Bertsekas (2001) into Hadamard spaces. The method is quite robust and not only does it yield algorithms for the median and the mean, but it also applies to various other optimization problems. We moreover show that another algorithm for computing the Frechet mean can be derived from the law of large numbers due to K.-T. Sturm (2002). In applications, computing medians and means is probably most needed in tree space, which is an instance of an Hadamard space, invented by Billera, Holmes, and Vogtmann (2001) as a tool for averaging phylogenetic trees. It turns out, however, that it can be also used to model numerous other tree-like structures. Since there now exists a polynomial-time algorithm for computing geodesics in tree space due to M. Owen and S. Provan (2011), we obtain efficient algorithms for computing medians and means, which can be directly used in practice.Comment: Corrected version. Accepted in SIAM Journal on Optimizatio

    Topological metrizations of trees, and new quartet methods of tree inference

    Full text link
    Topological phylogenetic trees can be assigned edge weights in several natural ways, highlighting different aspects of the tree. Here the rooted triple and quartet metrizations are introduced, and applied to formulate novel fast methods of inferring large trees from rooted triple and quartet data. These methods can be applied in new statistically consistent procedures for inference of a species tree from gene trees under the multispecies coalescent model.Comment: Final versio

    An algorithm for constructing principal geodesics in phylogenetic treespace

    Full text link
    Most phylogenetic analyses result in a sample of trees, but summarizing and visualizing these samples can be challenging. Consensus trees often provide limited information about a sample, and so methods such as consensus networks, clustering and multidimensional scaling have been developed and applied to tree samples. This paper describes a stochastic algorithm for constructing a principal geodesic or line through treespace which is analogous to the first principal component in standard Principal Components Analysis. A principal geodesic summarizes the most variable features of a sample of trees, in terms of both tree topology and branch lengths, and it can be visualized as an animation of smoothly changing trees. The algorithm performs a stochastic search through parameter space for a geodesic which minimises the sum of squared projected distances of the data points. This procedure aims to identify the globally optimal principal geodesic, though convergence to locally optimal geodesics is possible. The methodology is illustrated by constructing principal geodesics for experimental and simulated data sets, demonstrating the insight into samples of trees that can be gained and how the method improves on a previously published approach. A java package called GeoPhytter for constructing and visualising principal geodesics is freely available from www.ncl.ac.uk/~ntmwn/geophytter.Comment: 6 figures, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 11, No. 2, 201

    Hypothesis Testing For Network Data in Functional Neuroimaging

    Get PDF
    In recent years, it has become common practice in neuroscience to use networks to summarize relational information in a set of measurements, typically assumed to be reflective of either functional or structural relationships between regions of interest in the brain. One of the most basic tasks of interest in the analysis of such data is the testing of hypotheses, in answer to questions such as "Is there a difference between the networks of these two groups of subjects?" In the classical setting, where the unit of interest is a scalar or a vector, such questions are answered through the use of familiar two-sample testing strategies. Networks, however, are not Euclidean objects, and hence classical methods do not directly apply. We address this challenge by drawing on concepts and techniques from geometry, and high-dimensional statistical inference. Our work is based on a precise geometric characterization of the space of graph Laplacian matrices and a nonparametric notion of averaging due to Fr\'echet. We motivate and illustrate our resulting methodologies for testing in the context of networks derived from functional neuroimaging data on human subjects from the 1000 Functional Connectomes Project. In particular, we show that this global test is more statistical powerful, than a mass-univariate approach. In addition, we have also provided a method for visualizing the individual contribution of each edge to the overall test statistic.Comment: 34 pages. 5 figure
    • …
    corecore