11,019 research outputs found

    On the inference of large phylogenies with long branches: How long is too long?

    Get PDF
    Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In [Daskalakis et al.'09], building on the work of [Mossel'04], a tight sequence-length requirement was obtained for the CFN model. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(log⁥n)O(\log n) to poly(n)\hbox{poly}(n), where nn is the number of leaves) at the "critical" branch length \critmlq (if it exists) of the ancestral reconstruction problem. Here we consider the GTR model. For this model, recent results of [Roch'09] show that the tree can be accurately reconstructed with sequences of length O(log⁥(n))O(\log(n)) when the branch lengths are below \critksq, known as the Kesten-Stigum (KS) bound. Although for the CFN model \critmlq = \critksq, it is known that for the more general GTR models one has \critmlq \geq \critksq with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models QQ and a phylogenetic reconstruction algorithm which recovers the tree from O(log⁥n)O(\log n)-length sequences for some branch lengths in the range (\critksq,\critmlq). Second we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above \critmlq

    Phase transition in the sample complexity of likelihood-based phylogeny inference

    Full text link
    Reconstructing evolutionary trees from molecular sequence data is a fundamental problem in computational biology. Stochastic models of sequence evolution are closely related to spin systems that have been extensively studied in statistical physics and that connection has led to important insights on the theoretical properties of phylogenetic reconstruction algorithms as well as the development of new inference methods. Here, we study maximum likelihood, a classical statistical technique which is perhaps the most widely used in phylogenetic practice because of its superior empirical accuracy. At the theoretical level, except for its consistency, that is, the guarantee of eventual correct reconstruction as the size of the input data grows, much remains to be understood about the statistical properties of maximum likelihood in this context. In particular, the best bounds on the sample complexity or sequence-length requirement of maximum likelihood, that is, the amount of data required for correct reconstruction, are exponential in the number, nn, of tips---far from known lower bounds based on information-theoretic arguments. Here we close the gap by proving a new upper bound on the sequence-length requirement of maximum likelihood that matches up to constants the known lower bound for some standard models of evolution. More specifically, for the rr-state symmetric model of sequence evolution on a binary phylogeny with bounded edge lengths, we show that the sequence-length requirement behaves logarithmically in nn when the expected amount of mutation per edge is below what is known as the Kesten-Stigum threshold. In general, the sequence-length requirement is polynomial in nn. Our results imply moreover that the maximum likelihood estimator can be computed efficiently on randomly generated data provided sequences are as above.Comment: To appear in Probability Theory and Related Field

    Phase transitions in Phylogeny

    Get PDF
    We apply the theory of markov random fields on trees to derive a phase transition in the number of samples needed in order to reconstruct phylogenies. We consider the Cavender-Farris-Neyman model of evolution on trees, where all the inner nodes have degree at least 3, and the net transition on each edge is bounded by e. Motivated by a conjecture by M. Steel, we show that if 2 (1 - 2 e) (1 - 2e) > 1, then for balanced trees, the topology of the underlying tree, having n leaves, can be reconstructed from O(log n) samples (characters) at the leaves. On the other hand, we show that if 2 (1 - 2 e) (1 - 2 e) < 1, then there exist topologies which require at least poly(n) samples for reconstruction. Our results are the first rigorous results to establish the role of phase transitions for markov random fields on trees as studied in probability, statistical physics and information theory to the study of phylogenies in mathematical biology.Comment: To appear in Transactions of the AM

    On the variational distance of two trees

    Full text link
    A widely studied model for generating sequences is to ``evolve'' them on a tree according to a symmetric Markov process. We prove that model trees tend to be maximally ``far apart'' in terms of variational distance.Comment: Published at http://dx.doi.org/10.1214/105051606000000196 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Computational phylogenetics and the classification of South American languages

    Get PDF
    In recent years, South Americanist linguists have embraced computational phylogenetic methods to resolve the numerous outstanding questions about the genealogi- cal relationships among the languages of the continent. We provide a critical review of the methods and language classification results that have accumulated thus far, emphasizing the superiority of character-based methods over distance-based ones and the importance of develop- ing adequate comparative datasets for producing well- resolved classifications

    Global Alignment of Molecular Sequences via Ancestral State Reconstruction

    Get PDF
    Molecular phylogenetic techniques do not generally account for such common evolutionary events as site insertions and deletions (known as indels). Instead tree building algorithms and ancestral state inference procedures typically rely on substitution-only models of sequence evolution. In practice these methods are extended beyond this simplified setting with the use of heuristics that produce global alignments of the input sequences--an important problem which has no rigorous model-based solution. In this paper we consider a new version of the multiple sequence alignment in the context of stochastic indel models. More precisely, we introduce the following {\em trace reconstruction problem on a tree} (TRPT): a binary sequence is broadcast through a tree channel where we allow substitutions, deletions, and insertions; we seek to reconstruct the original sequence from the sequences received at the leaves of the tree. We give a recursive procedure for this problem with strong reconstruction guarantees at low mutation rates, providing also an alignment of the sequences at the leaves of the tree. The TRPT problem without indels has been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a bootstrapping step towards obtaining optimal phylogenetic reconstruction methods. The present work sets up a framework for extending these works to evolutionary models with indels

    Robust reconstruction on trees is determined by the second eigenvalue

    Get PDF
    Consider a Markov chain on an infinite tree T=(V,E) rooted at \rho. In such a chain, once the initial root state \sigma(\rho) is chosen, each vertex iteratively chooses its state from the one of its parent by an application of a Markov transition rule (and all such applications are independent). Let \mu_j denote the resulting measure for \sigma(\rho)=j. The resulting measure \mu_j is defined on configurations \sigma=(\sigma(x))_{x\in V}\in A^V, where A is some finite set. Let \mu_j^n denote the restriction of \mu to the sigma-algebra generated by the variables \sigma(x), where x is at distance exactly n from \rho. Letting \alpha_n=max_{i,j\in A}d_{TV}(\mu_i^n,\mu_j^n), where d_{TV} denotes total variation distance, we say that the reconstruction problem is solvable if lim inf_{n\to\infty}\alpha_n>0. Reconstruction solvability roughly means that the nth level of the tree contains a nonvanishing amount of information on the root of the tree as n\to\infty. In this paper we study the problem of robust reconstruction. Let \nu be a nondegenerate distribution on A and \epsilon >0. Let \sigma be chosen according to \mu_j^n and \sigma' be obtained from \sigma by letting for each node independently, \sigma(v)=\sigma'(v) with probability 1-\epsilon and \sigma'(v) be an independent sample from \nu otherwise. We denote by \mu_j^n[\nu,\epsilon ] the resulting measure on \sigma'. The measure \mu_j^n[\nu,\epsilon ] is a perturbation of the measure \mu_j^n.Comment: Published at http://dx.doi.org/10.1214/009117904000000153 in the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • 

    corecore