Search CORE

11,019 research outputs found

On the inference of large phylogenies with long branches: How long is too long?

Author: Mossel Elchanan
Roch Sebastien
Sly Allan
Publication venue
Publication date: 01/01/2010
Field of study

Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In [Daskalakis et al.'09], building on the work of [Mossel'04], a tight sequence-length requirement was obtained for the CFN model. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from

O(\log n)

\hbox{poly}(n)

, where

n

is the number of leaves) at the "critical" branch length \critmlq (if it exists) of the ancestral reconstruction problem. Here we consider the GTR model. For this model, recent results of [Roch'09] show that the tree can be accurately reconstructed with sequences of length

O(\log(n))

when the branch lengths are below \critksq, known as the Kesten-Stigum (KS) bound. Although for the CFN model \critmlq = \critksq, it is known that for the more general GTR models one has \critmlq \geq \critksq with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models

Q

and a phylogenetic reconstruction algorithm which recovers the tree from

O(\log n)

-length sequences for some branch lengths in the range (\critksq,\critmlq). Second we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above \critmlq

arXiv.org e-Print Archive

Springer - Publisher Connector

ScholarlyCommons@Penn

Phase transition in the sample complexity of likelihood-based phylogeny inference

Author: Roch Sebastien
Sly Allan
Publication venue
Publication date: 18/07/2017
Field of study

Reconstructing evolutionary trees from molecular sequence data is a fundamental problem in computational biology. Stochastic models of sequence evolution are closely related to spin systems that have been extensively studied in statistical physics and that connection has led to important insights on the theoretical properties of phylogenetic reconstruction algorithms as well as the development of new inference methods. Here, we study maximum likelihood, a classical statistical technique which is perhaps the most widely used in phylogenetic practice because of its superior empirical accuracy. At the theoretical level, except for its consistency, that is, the guarantee of eventual correct reconstruction as the size of the input data grows, much remains to be understood about the statistical properties of maximum likelihood in this context. In particular, the best bounds on the sample complexity or sequence-length requirement of maximum likelihood, that is, the amount of data required for correct reconstruction, are exponential in the number,

n

, of tips---far from known lower bounds based on information-theoretic arguments. Here we close the gap by proving a new upper bound on the sequence-length requirement of maximum likelihood that matches up to constants the known lower bound for some standard models of evolution. More specifically, for the

r

-state symmetric model of sequence evolution on a binary phylogeny with bounded edge lengths, we show that the sequence-length requirement behaves logarithmically in

n

when the expected amount of mutation per edge is below what is known as the Kesten-Stigum threshold. In general, the sequence-length requirement is polynomial in

n

. Our results imply moreover that the maximum likelihood estimator can be computed efficiently on randomly generated data provided sequences are as above.Comment: To appear in Probability Theory and Related Field

arXiv.org e-Print Archive

Princeton University Open Access Repository

Phase transitions in Phylogeny

Author: Mossel Elchanan
Publication venue
Publication date: 01/01/2003
Field of study

We apply the theory of markov random fields on trees to derive a phase transition in the number of samples needed in order to reconstruct phylogenies. We consider the Cavender-Farris-Neyman model of evolution on trees, where all the inner nodes have degree at least 3, and the net transition on each edge is bounded by e. Motivated by a conjecture by M. Steel, we show that if 2 (1 - 2 e) (1 - 2e) > 1, then for balanced trees, the topology of the underlying tree, having n leaves, can be reconstructed from O(log n) samples (characters) at the leaves. On the other hand, we show that if 2 (1 - 2 e) (1 - 2 e) < 1, then there exist topologies which require at least poly(n) samples for reconstruction. Our results are the first rigorous results to establish the role of phase transitions for markov random fields on trees as studied in probability, statistical physics and information theory to the study of phylogenies in mathematical biology.Comment: To appear in Transactions of the AM

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn

On the variational distance of two trees

Author: Steel M. A.
Székely L. A.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 10/10/2006
Field of study

A widely studied model for generating sequences is to ``evolve'' them on a tree according to a symmetric Markov process. We prove that model trees tend to be maximally ``far apart'' in terms of variational distance.Comment: Published at http://dx.doi.org/10.1214/105051606000000196 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Computational phylogenetics and the classification of South American languages

Author: Chousou‐Polydouri Natalia
Michael Lev
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

In recent years, South Americanist linguists have embraced computational phylogenetic methods to resolve the numerous outstanding questions about the genealogi- cal relationships among the languages of the continent. We provide a critical review of the methods and language classification results that have accumulated thus far, emphasizing the superiority of character-based methods over distance-based ones and the importance of develop- ing adequate comparative datasets for producing well- resolved classifications

Crossref

eScholarship - University of California

ZORA

Global Alignment of Molecular Sequences via Ancestral State Reconstruction

Author: Andoni Alexandr
Daskalakis Constantinos
Hassidim Avinatan
Roch Sebastien
Publication venue
Publication date: 01/01/2009
Field of study

Molecular phylogenetic techniques do not generally account for such common evolutionary events as site insertions and deletions (known as indels). Instead tree building algorithms and ancestral state inference procedures typically rely on substitution-only models of sequence evolution. In practice these methods are extended beyond this simplified setting with the use of heuristics that produce global alignments of the input sequences--an important problem which has no rigorous model-based solution. In this paper we consider a new version of the multiple sequence alignment in the context of stochastic indel models. More precisely, we introduce the following {\em trace reconstruction problem on a tree} (TRPT): a binary sequence is broadcast through a tree channel where we allow substitutions, deletions, and insertions; we seek to reconstruct the original sequence from the sequences received at the leaves of the tree. We give a recursive procedure for this problem with strong reconstruction guarantees at low mutation rates, providing also an alignment of the sequences at the leaves of the tree. The TRPT problem without indels has been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a bootstrapping step towards obtaining optimal phylogenetic reconstruction methods. The present work sets up a framework for extending these works to evolutionary models with indels

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Robust reconstruction on trees is determined by the second eigenvalue

Author: Janson Svante
Mossel Elchanan
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2003
Field of study

Consider a Markov chain on an infinite tree T=(V,E) rooted at \rho. In such a chain, once the initial root state \sigma(\rho) is chosen, each vertex iteratively chooses its state from the one of its parent by an application of a Markov transition rule (and all such applications are independent). Let \mu_j denote the resulting measure for \sigma(\rho)=j. The resulting measure \mu_j is defined on configurations \sigma=(\sigma(x))_{x\in V}\in A^V, where A is some finite set. Let \mu_j^n denote the restriction of \mu to the sigma-algebra generated by the variables \sigma(x), where x is at distance exactly n from \rho. Letting \alpha_n=max_{i,j\in A}d_{TV}(\mu_i^n,\mu_j^n), where d_{TV} denotes total variation distance, we say that the reconstruction problem is solvable if lim inf_{n\to\infty}\alpha_n>0. Reconstruction solvability roughly means that the nth level of the tree contains a nonvanishing amount of information on the root of the tree as n\to\infty. In this paper we study the problem of robust reconstruction. Let \nu be a nondegenerate distribution on A and \epsilon >0. Let \sigma be chosen according to \mu_j^n and \sigma' be obtained from \sigma by letting for each node independently, \sigma(v)=\sigma'(v) with probability 1-\epsilon and \sigma'(v) be an independent sample from \nu otherwise. We denote by \mu_j^n[\nu,\epsilon ] the resulting measure on \sigma'. The measure \mu_j^n[\nu,\epsilon ] is a perturbation of the measure \mu_j^n.Comment: Published at http://dx.doi.org/10.1214/009117904000000153 in the Annals of Probability (http://www.imstat.org/aop/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

ScholarlyCommons@Penn