2,151 research outputs found
On the complexity of computing MP distance between binary phylogenetic trees
Within the field of phylogenetics there is great interest in distance
measures to quantify the dissimilarity of two trees. Recently, a new distance
measure has been proposed: the Maximum Parsimony (MP) distance. This is based
on the difference of the parsimony scores of a single character on both trees
under consideration, and the goal is to find the character which maximizes this
difference. Here we show that computation of MP distance on two \emph{binary}
phylogenetic trees is NP-hard. This is a highly nontrivial extension of an
earlier NP-hardness proof for two multifurcating phylogenetic trees, and it is
particularly relevant given the prominence of binary trees in the phylogenetics
literature. As a corollary to the main hardness result we show that computation
of MP distance is also hard on binary trees if the number of states available
is bounded. In fact, via a different reduction we show that it is hard even if
only two states are available. Finally, as a first response to this hardness we
give a simple Integer Linear Program (ILP) formulation which is capable of
computing the MP distance exactly for small trees (and for larger trees when
only a small number of character states are available) and which is used to
computationally verify several auxiliary results required by the hardness
proofs.Comment: 37 pages, 8 figure
On the Maximum Parsimony distance between phylogenetic trees
Within the field of phylogenetics there is great interest in distance
measures to quantify the dissimilarity of two trees. Here, based on an idea of
Bruen and Bryant, we propose and analyze a new distance measure: the Maximum
Parsimony (MP) distance. This is based on the difference of the parsimony
scores of a single character on both trees under consideration, and the goal is
to find the character which maximizes this difference. In this article we show
that this new distance is a metric and provides a lower bound to the well-known
Subtree Prune and Regraft (SPR) distance. We also show that to compute the MP
distance it is sufficient to consider only characters that are convex on one of
the trees, and prove several additional structural properties of the distance.
On the complexity side, we prove that calculating the MP distance is in general
NP-hard, and identify an interesting island of tractability in which the
distance can be calculated in polynomial time.Comment: 30 pages, 6 figure
Phylogenetic incongruence through the lens of Monadic Second Order logic
Within the field of phylogenetics there is growing interest in measures for
summarising the dissimilarity, or 'incongruence', of two or more phylogenetic
trees. Many of these measures are NP-hard to compute and this has stimulated a
considerable volume of research into fixed parameter tractable algorithms. In
this article we use Monadic Second Order logic (MSOL) to give alternative,
compact proofs of fixed parameter tractability for several well-known
incongruency measures. In doing so we wish to demonstrate the considerable
potential of MSOL - machinery still largely unknown outside the algorithmic
graph theory community - within phylogenetics. A crucial component of this work
is the observation that many of these measures, when bounded, imply the
existence of an 'agreement forest' of bounded size, which in turn implies that
an auxiliary graph structure, the display graph, has bounded treewidth. It is
this bound on treewidth that makes the machinery of MSOL available for proving
fixed parameter tractability. We give a variety of different MSOL formulations.
Some are based on explicitly encoding agreement forests, while some only use
them implicitly to generate the treewidth bound. Our formulations introduce a
number of "phylogenetics MSOL primitives" which will hopefully be of use to
other researchers
Impacts of terraces on phylogenetic inference
Terraces are potentially large sets of trees with precisely the same
likelihood or parsimony score, which can be induced by missing sequences in
partitioned multi-locus phylogenetic data matrices. The set of trees on a
terrace can be characterized by enumeration algorithms or consensus methods
that exploit the pattern of partial taxon coverage in the data, independent of
the sequence data themselves. Terraces add ambiguity and complexity to
phylogenetic inference particularly in settings where inference is already
challenging: data sets with many taxa and relatively few loci. In this paper we
present five new findings about terraces and their impacts on phylogenetic
inference. First we clarify assumptions about model parameters that are
necessary for the existence of terraces. Second, we explore the dependence of
terrace size on partitioning scheme and indicate how to find the partitioning
scheme associated with the largest terrace containing a given tree. Third, we
highlight the impact of terraces on bootstrap estimates of confidence limits in
clades, and characterize the surprising result that the bootstrap proportion
for a clade can be entirely determined by the frequency of bipartitions on a
terrace, with some bipartitions receiving high support even when incorrect.
Fourth, we dissect some effects of prior distributions of edge lengths on the
computed posterior probabilities of clades on terraces, to understand an
example in which long edges "attract" each other in Bayesian inference. Fifth,
we show that even if data are not partitioned, patterns of missing data studied
in the terrace problem can lead to instances of apparent statistical
inconsistency when even a small element of heterotachy is introduced to the
model generating the sequence data. Finally, we discuss strategies for
remediation of some of these problems.Comment: 50 pages, 9 figure
Species Tree Estimation Using ASTRAL: Practical Considerations
ASTRAL is a method for reconstructing species trees after inferring a set of
gene trees and is increasingly used in phylogenomic analyses. It is
statistically consistent under the multi-species coalescent model, is scalable,
and has shown high accuracy in simulated and empirical studies. This chapter
discusses practical considerations in using ASTRAL, starting with a review of
published results and pointing to the strengths and weaknesses of species tree
estimation using ASTRAL. It then continues to detail the best ways to prepare
input gene trees, interpret ASTRAL outputs, and perform follow-up analyses
Supertree Construction: Opportunities and Challenges
Supertree construction is the process by which a set of phylogenetic trees,
each on a subset of the overall set X of species, is combined into a tree on
the full set S. The traditional use of supertree methods is the assembly of a
large species tree from previously computed smaller species trees; however,
supertree methods are also used to address large-scale tree estimation using
divide-and-conquer (i.e., a dataset is divided into overlapping subsets, trees
are constructed on the subsets, and then combined using the supertree method).
Because most supertree methods are heuristics for NP-hard optimization
problems, the use of supertree estimation on large datasets is challenging,
both in terms of scalability and accuracy. In this paper, we describe the
current state of the art in supertree construction and the use of supertree
methods in divide-and-conquer strategies. Finally, we identify directions where
future research could lead to improved supertree methods.Comment: 28 pages, will be part of Festschrift volume for Bernard Mor
Most Compact Parsimonious Trees
Construction of phylogenetic trees has traditionally focused on binary trees
where all species appear on leaves, a problem for which numerous efficient
solutions have been developed. Certain application domains though, such as
viral evolution and transmission, paleontology, linguistics, and phylogenetic
stemmatics, often require phylogeny inference that involves placing input
species on ancestral tree nodes (live phylogeny), and polytomies. These
requirements, despite their prevalence, lead to computationally harder
algorithmic solutions and have been sparsely examined in the literature to
date. In this article we prove some unique properties of most parsimonious live
phylogenetic trees with polytomies, and describe novel algorithms to find the
such trees without resorting to exhaustive enumeration of all possible tree
topologies.Comment: 7 pages, 4 figures, 1 table, submitted for peer revie
Gap-weighted subsequences for automatic cognate identification and phylogenetic inference
In this paper, we describe the problem of cognate identification and its
relation to phylogenetic inference. We introduce subsequence based features for
discriminating cognates from non-cognates. We show that subsequence based
features perform better than the state-of-the-art string similarity measures
for the purpose of cognate identification. We use the cognate judgments for the
purpose of phylogenetic inference and observe that these classifiers infer a
tree which is close to the gold standard tree. The contribution of this paper
is the use of subsequence features for cognate identification and to employ the
cognate judgments for phylogenetic inference
Finding the most parsimonious or likely tree in a network with respect to an alignment
Phylogenetic networks are often constructed by merging multiple conflicting
phylogenetic signals into a directed acyclic graph. It is interesting to
explore whether a network constructed in this way induces biologically-relevant
phylogenetic signals that were not present in the input. Here we show that,
given a multiple alignment A for a set of taxa X and a rooted phylogenetic
network N whose leaves are labelled by X, it is NP-hard to locate the most
parsimonious phylogenetic tree displayed by N (with respect to A) even when the
level of N - the maximum number of reticulation nodes within a biconnected
component - is 1 and A contains only 2 distinct states. (If, additionally, gaps
are allowed the problem becomes APX-hard.) We also show that under the same
conditions, and assuming a simple binary symmetric model of character
evolution, finding the most likely tree displayed by the network is NP-hard.
These negative results contrast with earlier work on parsimony in which it is
shown that if A consists of a single column the problem is fixed parameter
tractable in the level. We conclude with a discussion of why, despite the
NP-hardness, both the parsimony and likelihood problem can likely be
well-solved in practice
Gromov meets Phylogenetics - new Animals for the Zoo of Biocomputable Metrics on Tree Space
We present a new class of metrics for unrooted phylogenetic -trees derived
from the Gromov-Hausdorff distance for (compact) metric spaces. These metrics
can be efficiently computed by linear or quadratic programming. They are robust
under NNI-operations, too. The local behavior of the metrics shows that they
are different from any formerly introduced metrics. The performance of the
metrics is briefly analised on random weighted and unweighted trees as well as
random caterpillars
- …