11,019 research outputs found
On the inference of large phylogenies with long branches: How long is too long?
Recent work has highlighted deep connections between sequence-length
requirements for high-probability phylogeny reconstruction and the related
problem of the estimation of ancestral sequences. In [Daskalakis et al.'09],
building on the work of [Mossel'04], a tight sequence-length requirement was
obtained for the CFN model. In particular the required sequence length for
high-probability reconstruction was shown to undergo a sharp transition (from
to , where is the number of leaves) at the
"critical" branch length \critmlq (if it exists) of the ancestral
reconstruction problem.
Here we consider the GTR model. For this model, recent results of [Roch'09]
show that the tree can be accurately reconstructed with sequences of length
when the branch lengths are below \critksq, known as the
Kesten-Stigum (KS) bound. Although for the CFN model \critmlq = \critksq, it
is known that for the more general GTR models one has \critmlq \geq \critksq
with a strict inequality in many cases. Here, we show that this phenomenon also
holds for phylogenetic reconstruction by exhibiting a family of symmetric
models and a phylogenetic reconstruction algorithm which recovers the tree
from -length sequences for some branch lengths in the range
(\critksq,\critmlq). Second we prove that phylogenetic reconstruction under
GTR models requires a polynomial sequence-length for branch lengths above
\critmlq
Phase transition in the sample complexity of likelihood-based phylogeny inference
Reconstructing evolutionary trees from molecular sequence data is a
fundamental problem in computational biology. Stochastic models of sequence
evolution are closely related to spin systems that have been extensively
studied in statistical physics and that connection has led to important
insights on the theoretical properties of phylogenetic reconstruction
algorithms as well as the development of new inference methods. Here, we study
maximum likelihood, a classical statistical technique which is perhaps the most
widely used in phylogenetic practice because of its superior empirical
accuracy.
At the theoretical level, except for its consistency, that is, the guarantee
of eventual correct reconstruction as the size of the input data grows, much
remains to be understood about the statistical properties of maximum likelihood
in this context. In particular, the best bounds on the sample complexity or
sequence-length requirement of maximum likelihood, that is, the amount of data
required for correct reconstruction, are exponential in the number, , of
tips---far from known lower bounds based on information-theoretic arguments.
Here we close the gap by proving a new upper bound on the sequence-length
requirement of maximum likelihood that matches up to constants the known lower
bound for some standard models of evolution.
More specifically, for the -state symmetric model of sequence evolution on
a binary phylogeny with bounded edge lengths, we show that the sequence-length
requirement behaves logarithmically in when the expected amount of mutation
per edge is below what is known as the Kesten-Stigum threshold. In general, the
sequence-length requirement is polynomial in . Our results imply moreover
that the maximum likelihood estimator can be computed efficiently on randomly
generated data provided sequences are as above.Comment: To appear in Probability Theory and Related Field
Phase transitions in Phylogeny
We apply the theory of markov random fields on trees to derive a phase
transition in the number of samples needed in order to reconstruct phylogenies.
We consider the Cavender-Farris-Neyman model of evolution on trees, where all
the inner nodes have degree at least 3, and the net transition on each edge is
bounded by e. Motivated by a conjecture by M. Steel, we show that if 2 (1 - 2
e) (1 - 2e) > 1, then for balanced trees, the topology of the underlying tree,
having n leaves, can be reconstructed from O(log n) samples (characters) at the
leaves. On the other hand, we show that if 2 (1 - 2 e) (1 - 2 e) < 1, then
there exist topologies which require at least poly(n) samples for
reconstruction.
Our results are the first rigorous results to establish the role of phase
transitions for markov random fields on trees as studied in probability,
statistical physics and information theory to the study of phylogenies in
mathematical biology.Comment: To appear in Transactions of the AM
On the variational distance of two trees
A widely studied model for generating sequences is to ``evolve'' them on a
tree according to a symmetric Markov process. We prove that model trees tend to
be maximally ``far apart'' in terms of variational distance.Comment: Published at http://dx.doi.org/10.1214/105051606000000196 in the
Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute
of Mathematical Statistics (http://www.imstat.org
Computational phylogenetics and the classification of South American languages
In recent years, South Americanist linguists have embraced computational phylogenetic methods to resolve the numerous outstanding questions about the genealogi- cal relationships among the languages of the continent. We provide a critical review of the methods and language classification results that have accumulated thus far, emphasizing the superiority of character-based methods over distance-based ones and the importance of develop- ing adequate comparative datasets for producing well- resolved classifications
Global Alignment of Molecular Sequences via Ancestral State Reconstruction
Molecular phylogenetic techniques do not generally account for such common
evolutionary events as site insertions and deletions (known as indels). Instead
tree building algorithms and ancestral state inference procedures typically
rely on substitution-only models of sequence evolution. In practice these
methods are extended beyond this simplified setting with the use of heuristics
that produce global alignments of the input sequences--an important problem
which has no rigorous model-based solution. In this paper we consider a new
version of the multiple sequence alignment in the context of stochastic indel
models. More precisely, we introduce the following {\em trace reconstruction
problem on a tree} (TRPT): a binary sequence is broadcast through a tree
channel where we allow substitutions, deletions, and insertions; we seek to
reconstruct the original sequence from the sequences received at the leaves of
the tree. We give a recursive procedure for this problem with strong
reconstruction guarantees at low mutation rates, providing also an alignment of
the sequences at the leaves of the tree. The TRPT problem without indels has
been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a
bootstrapping step towards obtaining optimal phylogenetic reconstruction
methods. The present work sets up a framework for extending these works to
evolutionary models with indels
Robust reconstruction on trees is determined by the second eigenvalue
Consider a Markov chain on an infinite tree T=(V,E) rooted at \rho. In such a
chain, once the initial root state \sigma(\rho) is chosen, each vertex
iteratively chooses its state from the one of its parent by an application of a
Markov transition rule (and all such applications are independent). Let \mu_j
denote the resulting measure for \sigma(\rho)=j. The resulting measure \mu_j is
defined on configurations \sigma=(\sigma(x))_{x\in V}\in A^V, where A is some
finite set. Let \mu_j^n denote the restriction of \mu to the sigma-algebra
generated by the variables \sigma(x), where x is at distance exactly n from
\rho. Letting \alpha_n=max_{i,j\in A}d_{TV}(\mu_i^n,\mu_j^n), where d_{TV}
denotes total variation distance, we say that the reconstruction problem is
solvable if lim inf_{n\to\infty}\alpha_n>0. Reconstruction solvability roughly
means that the nth level of the tree contains a nonvanishing amount of
information on the root of the tree as n\to\infty. In this paper we study the
problem of robust reconstruction. Let \nu be a nondegenerate distribution on A
and \epsilon >0. Let \sigma be chosen according to \mu_j^n and \sigma' be
obtained from \sigma by letting for each node independently,
\sigma(v)=\sigma'(v) with probability 1-\epsilon and \sigma'(v) be an
independent sample from \nu otherwise. We denote by \mu_j^n[\nu,\epsilon ] the
resulting measure on \sigma'. The measure \mu_j^n[\nu,\epsilon ] is a
perturbation of the measure \mu_j^n.Comment: Published at http://dx.doi.org/10.1214/009117904000000153 in the
Annals of Probability (http://www.imstat.org/aop/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- âŠ