241 research outputs found
On the inference of large phylogenies with long branches: How long is too long?
Recent work has highlighted deep connections between sequence-length
requirements for high-probability phylogeny reconstruction and the related
problem of the estimation of ancestral sequences. In [Daskalakis et al.'09],
building on the work of [Mossel'04], a tight sequence-length requirement was
obtained for the CFN model. In particular the required sequence length for
high-probability reconstruction was shown to undergo a sharp transition (from
to , where is the number of leaves) at the
"critical" branch length \critmlq (if it exists) of the ancestral
reconstruction problem.
Here we consider the GTR model. For this model, recent results of [Roch'09]
show that the tree can be accurately reconstructed with sequences of length
when the branch lengths are below \critksq, known as the
Kesten-Stigum (KS) bound. Although for the CFN model \critmlq = \critksq, it
is known that for the more general GTR models one has \critmlq \geq \critksq
with a strict inequality in many cases. Here, we show that this phenomenon also
holds for phylogenetic reconstruction by exhibiting a family of symmetric
models and a phylogenetic reconstruction algorithm which recovers the tree
from -length sequences for some branch lengths in the range
(\critksq,\critmlq). Second we prove that phylogenetic reconstruction under
GTR models requires a polynomial sequence-length for branch lengths above
\critmlq
Inferring ancestral sequences in taxon-rich phylogenies
Statistical consistency in phylogenetics has traditionally referred to the
accuracy of estimating phylogenetic parameters for a fixed number of species as
we increase the number of characters. However, as sequences are often of fixed
length (e.g. for a gene) although we are often able to sample more taxa, it is
useful to consider a dual type of statistical consistency where we increase the
number of species, rather than characters. This raises some basic questions:
what can we learn about the evolutionary process as we increase the number of
species? In particular, does having more species allow us to infer the
ancestral state of characters accurately? This question is particularly
relevant when sequence site evolution varies in a complex way from character to
character, as well as for reconstructing ancestral sequences. In this paper, we
assemble a collection of results to analyse various approaches for inferring
ancestral information with increasing accuracy as the number of taxa increases.Comment: 32 pages, 5 figures, 1 table
Phase transition in the sample complexity of likelihood-based phylogeny inference
Reconstructing evolutionary trees from molecular sequence data is a
fundamental problem in computational biology. Stochastic models of sequence
evolution are closely related to spin systems that have been extensively
studied in statistical physics and that connection has led to important
insights on the theoretical properties of phylogenetic reconstruction
algorithms as well as the development of new inference methods. Here, we study
maximum likelihood, a classical statistical technique which is perhaps the most
widely used in phylogenetic practice because of its superior empirical
accuracy.
At the theoretical level, except for its consistency, that is, the guarantee
of eventual correct reconstruction as the size of the input data grows, much
remains to be understood about the statistical properties of maximum likelihood
in this context. In particular, the best bounds on the sample complexity or
sequence-length requirement of maximum likelihood, that is, the amount of data
required for correct reconstruction, are exponential in the number, , of
tips---far from known lower bounds based on information-theoretic arguments.
Here we close the gap by proving a new upper bound on the sequence-length
requirement of maximum likelihood that matches up to constants the known lower
bound for some standard models of evolution.
More specifically, for the -state symmetric model of sequence evolution on
a binary phylogeny with bounded edge lengths, we show that the sequence-length
requirement behaves logarithmically in when the expected amount of mutation
per edge is below what is known as the Kesten-Stigum threshold. In general, the
sequence-length requirement is polynomial in . Our results imply moreover
that the maximum likelihood estimator can be computed efficiently on randomly
generated data provided sequences are as above.Comment: To appear in Probability Theory and Related Field
BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction
A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN
Phylogenetic mixtures: Concentration of measure in the large-tree limit
The reconstruction of phylogenies from DNA or protein sequences is a major
task of computational evolutionary biology. Common phenomena, notably
variations in mutation rates across genomes and incongruences between gene
lineage histories, often make it necessary to model molecular data as
originating from a mixture of phylogenies. Such mixed models play an
increasingly important role in practice. Using concentration of measure
techniques, we show that mixtures of large trees are typically identifiable. We
also derive sequence-length requirements for high-probability reconstruction.Comment: Published in at http://dx.doi.org/10.1214/11-AAP837 the Annals of
Applied Probability (http://www.imstat.org/aap/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …