41 research outputs found
Optimizing Phylogenetic Supertrees Using Answer Set Programming
The supertree construction problem is about combining several phylogenetic
trees with possibly conflicting information into a single tree that has all the
leaves of the source trees as its leaves and the relationships between the
leaves are as consistent with the source trees as possible. This leads to an
optimization problem that is computationally challenging and typically
heuristic methods, such as matrix representation with parsimony (MRP), are
used. In this paper we consider the use of answer set programming to solve the
supertree construction problem in terms of two alternative encodings. The first
is based on an existing encoding of trees using substructures known as
quartets, while the other novel encoding captures the relationships present in
trees through direct projections. We use these encodings to compute a
genus-level supertree for the family of cats (Felidae). Furthermore, we compare
our results to recent supertrees obtained by the MRP method.Comment: To appear in Theory and Practice of Logic Programming (TPLP),
Proceedings of ICLP 201
Polynomial supertree methods in phylogenomics: algorithms, simulations and software
One of the objectives in modern biology, especially phylogenetics, is to build larger clades of the Tree of Life. Large-scale phylogenetic analysis involves several serious challenges. The aim of this thesis is to contribute to some of the open problems in this context. In computational phylogenetics, supertree methods provide a way to reconstruct larger clades of the Tree of Life. We present a novel polynomial time approach for the computation of supertrees called FlipCut supertree. Our method combines the computation of minimum cuts from graph-based methods with a matrix representation method, namely Minimum Flip Supertrees. Here, the input trees are encoded in a 0/1/?-matrix. We present a heuristic to search for a minimum set of 0/1-flips such that the resulting matrix admits a directed perfect phylogeny. In contrast to other polynomial time approaches, our results can be interpreted in the sense that we try to minimize a global objective function, namely the number of flips in the input matrix. We extend our approach by using edge weights to weight the columns of the 0/1/?-matrix. In order to compare our new FlipCut supertree method with other recent polynomial supertree methods and matrix representation methods, we present a large scale simulation study using two different data sets. Our findings illustrate the trade-off between accuracy and running time in supertree construction, as well as the pros and cons of different supertree approaches. Furthermore, we present EPoS, a modular software framework for phylogenetic analysis and visualization. It fills the gap between command line-based algorithmic packages and visual tools without sufficient support for computational methods. By combining a powerful graphical user interface with a plugin system that allows simple integration of new algorithms, visualizations and data structures, we created a framework that is easy to use, to extend and that covers all important steps of a phylogenetic analysis
Fixed-parameter algorithms for some combinatorial problems in bioinformatics
Fixed-parameterized algorithmics has been developed in 1990s as an approach to solve NP-hard problem optimally in a guaranteed running time. It offers a new opportunity to solve NP-hard problems exactly even on large problem instances.
In this thesis, we apply fixed-parameter algorithms to cope with three NP-hard problems in bioinformatics:
Flip Consensus Tree Problem is a combinatorial problem arising in computational phylogenetics. Using the formulation of the Flip Consensus Tree Problem as a graph-modification problem, we present a set of data reduction rules and two fixed-parameter algorithms with respect to the number of modifications. Additionally, we discuss several heuristic improvements to accelerate the running time of our algorithms in practice. We also report computational results on phylogenetic data.
Weighted Cluster Editing Problem is a graph-modification problem, that arises in computational biology when clustering objects with respect to a given similarity or distance measure. We present one of our fixed-parameter algorithms with respect to the minimum modification cost and describe the idea of our fastest algorithm for this problem and its unweighted counterpart.
Bond Order Assignment Problem asks for a bond order assignment of a molecule graph that minimizes a penalty function. We prove several complexity results on this problem and give two exact fixed-parameter algorithms for the problem. Our algorithms base on the dynamic programming approach on a tree decomposition of the molecule graph. Our algorithms are fixed-parameter with respect to the treewidth of the molecule graph and the maximum atom valence. We implemented one of our algorithms with several heuristic improvements and evaluate our algorithm on a set of real molecule graphs. It turns out that our algorithm is very fast on this dataset and even outperforms a heuristic algorithm that is usually used in practice
Reweaving the tapestry: a supertree of birds
Supertrees are a useful method of constructing large-scale phylogenies by assembling numerous smaller phylogenies that have some, but not necessarily all, taxa in common. Birds are an obvious candidate for supertree construction as they are the most abundant land vertebrates on the planet and no comprehensive phylogeny of both extinct and extant species currently exists. In order to construct supertrees, primary analysis of characters is required. One such study, presented here, describes two new partial specimens belonging to the Primobucconidae from the Green River Formation of Wyoming (USA), which were assigned to the species Primobucco mcgrewi. Although incomplete, these specimens had preserved anatomical features not seen in other material. An attempt to further constrain their phylogenetic position was inconclusive, showing only that the Primobucconidae belong in a clade containing the extant Coraciiformes and related taxa. Over 700 such studies were used to construct a species-level supertree of Aves containing over 5000 taxa. The resulting tree shows the relationships between the main avian groups, with only a few novel clades, some of which can be explained by a lack of information regarding those taxa. The tree was constructed using a strict protocol which ensures robust, accurate and efficient data collection and processing; extending previous work by other authors. Before creating the species-level supertree the protocol was tested on the order Galliformes in order to determine the most efficient method of removing non-independent data. It was found that combining non-independent source trees via a âmini-supertreeâ analysis produced results more consistent with the input source data and, in addition, significantly reduced computational load. Another method for constructing large-scale trees is via a supermatrix, which is constructed from primary data collated into a single, large matrix. A molecular-only tree was constructed using both supertree and supermatrix methods, from the same data, again of the order Galliformes. Both methods performed equally as well in producing trees that fit the source data. The two methods could be considered complementary rather than conflicting as the supertree took a long time to construct but was very quick to calculate, but the supermatrix took longer to calculate, but was quicker to construct. Dependent upon the data at hand and the other factors involved, the choice of which method to use appears, from this small study, to be of little consequence. Finally an updated species-level supertree of the Dinosauria was also constructed and used to look at diversification rates in order to elucidate the âCretaceous explosion of terrestrial lifeâ. Results from this study show that this apparent burst in diversity at the end of the Cretaceous is a sampling artefact and in fact, dinosaurs show most of their major diversification shifts in the first third of their history
Fast and accurate supertrees: towards large scale phylogenies
Phylogenetics is the study of evolutionary relationships between biological entities; phylogenetic trees (phylogenies) are a visualization of these evolutionary relationships. Accurate approaches to reconstruct hylogenies from sequence data usually result in NPhard optimization problems, hence local search heuristics have to be applied in practice. These methods are highly accurate and fast enough as long as the input data is not too large. Divide-and-conquer techniques are a promising approach to boost scalability and accuracy of those local search heuristics on very large datasets. A divide-and-conquer method breaks down a large phylogenetic problem into smaller sub-problems that are computationally easier to solve. The sub-problems (overlapping trees) are then combined using a supertree method. Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there are methods that encode the source trees in a matrix and the supertree is constructed applying a local search heuristic to optimize the respective objective function. The most widely used supertree methods use such local search heuristics. However, to really improve the scalability of accurate tree reconstruction by divide-and-conquer approaches, accurate polynomial time methods are needed for the supertree reconstruction step. In this work, we present approaches for accurate polynomial time supertree reconstruction in particular Bad Clade Deletion (BCD), a novel heuristic supertree algorithm with polynomial running time. BCD uses minimum cuts to greedily delete a locally minimal number of columns from a matrix representation to make it compatible. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees if one exists. BCD can take support values of the source trees into account without an increase in complexity. We show how reliable clades can be used to restrict the search space for BCD and how those clades can be collected from the input data using the Greedy Strict Consensus Merger. Finally, we introduce a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of BCD with beam search extension is still polynomial. We present an exact and a randomized subroutine to generate suboptimal partial solutions. In our thorough evaluation on several simulated and biological datasets against a representative set of supertree methods we found that BCD is more accurate than the most accurate supertree methods when using support values and search space restriction on simulated data. Simultaneously BCD is faster than any other evaluated method. The beam search approach improved the accuracy of BCD on all evaluated datasets at the cost of speed. We found that BCD supertrees can boost maximum likelihood tree reconstruction when used as starting tree. Further, BCD could handle large scale datasets where local search heuristics did not converge in reasonable time. Due to its combination of speed, accuracy, and the ability to reconstruct the parent tree if one exists, BCD is a promising approach to enable outstanding scalability of divide-and-conquer approaches.Die Phylogenetik studiert die evolutionĂ€ren Beziehungen zwischen biologischen EntitĂ€ten. Phylogenetische BĂ€ume sind eine Visualisierung dieser Beziehungen. Akkurate AnsĂ€tze zur Rekonstruktion von Phylogenien aus Sequenzdaten fĂŒhren in der Regel zu NP-schweren
Optimierungsproblemen, sodass in der Praxis lokale Suchheuristiken angewendet werden mĂŒssen. Diese Methoden liefern akkurate BĂ€ume und sind schnell genug, solange die Eingabedaten nicht zu groĂ werden. Teile-und-herrsche-Verfahren sind ein vielversprechender Ansatz, um Skalierbarkeit und Genauigkeit dieser lokalen Suchheuristiken auf sehr
groĂen DatensĂ€tzen zu verbessern. Beim Teile-und-herrsche-Ansatz zerlegt man ein groĂes phylogenetisches Problem in kleinere Teilprobleme, die einfacher und schneller zu lösen sind. Die Teilprobleme, in diesem Fall ĂŒberlappende TeilbĂ€ume, mĂŒssen dann zu einem gesamtheitlichen Baum kombiniert werden. Superbaummethoden verschmelzen solche ĂŒberlappenden phylogenetischen BĂ€ume zu einem Superbaum, der alle Taxa der EingangsbĂ€ume enthĂ€lt. Die Herausforderung bei der Superbaumrekonstruktion besteht darin, mit widersprĂŒchlichen EingabebĂ€umen umzugehen. Es wurden viele verschiedene Algorithmen mit unterschiedlichen Zielfunktionen entwickelt, um solche WidersprĂŒche möglichst sinnvoll aufzulösen. Verfahren, die auf der Kodierung der EingabebĂ€ume als MatrixreprĂ€sentation basieren, sind am weitesten verbreitet. Die zum Auflösen der Konflikte verwendeten Zielfunktionen fĂŒhren in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis auch hier lokale Suchheuristiken zum Einsatz kommen. Da diese AnsĂ€tze nicht wesentlich besser mit der GröĂe der Eingabedaten skalieren als die direkte Rekonstruktion aus Sequenzdaten, werden fĂŒr die Superbaumrekonstruktion in Teile-undherrsche-AnsĂ€tzen akkurate Polynomialzeitmethoden benötigt. Diese Arbeit beschĂ€ftigt sich mit der akkuraten Rekonstruktion von SuperbĂ€umen in Polynomialzeit. Wir prĂ€sentieren Bad Clade Deletion (BCD), eine neue Polynomialzeitheuristik zur Superbaumrekonstruktion. BCD verwendet minimale Schnitte in Graphen, um eine minimale Anzahl von Spalten aus der MatrixreprĂ€sentation zu löschen, sodass diese konfliktfrei wird. Im Gegensatz zu lokalen Suchheuristiken garantiert BCD die Rekonstruktion einer perfekten Phylogenie, sofern eine solche fĂŒr die Eingabematrix existiert. BCD ermöglicht es, GĂŒtekriterien der EingabebĂ€ume zu berĂŒcksichtigen, ohne dass sich dadurch die KomplexitĂ€t erhöht. Weiterhin zeigen wir, wie zuverlĂ€ssige Kladen verwendet werden können, um den Suchraum fĂŒr BCD einzuschrĂ€nken und wie man diese mit Hilfe des Greedy Strict Consensus Mergers aus den Eingabedaten gewinnen kann. SchlieĂlich stellen wir eine Strahlensuche fĂŒr BCD vor. Diese erlaubt es eine bestimmte Anzahl suboptimaler Teillösungen (anstatt nur der optimalen) zu berĂŒcksichtigen, um so das Gesamtergebnis zu verbessern. Die Worst-Case-Laufzeit der Strahlensuche ist immer noch polynomiell. Zur Berechnung suboptimaler Teillösungen stellen wir einen exakten und einen randomisierten Algorithmus vor. In einer ausfĂŒhrlichen Evaluation auf mehreren simulierten und biologischen DatensĂ€tzen vergleichen wir BCD mit einer reprĂ€sentativen Auswahl an Superbaummethoden. Wir haben
herausgefunden, dass BCD bei Verwendung von GĂŒtekriterien und SuchraumbeschrĂ€nkung auf simulierten Daten genauer ist als die akkuratesten evaluierten Superbaummethoden. Gleichzeitig ist BCD deutlich schneller als alle evaluierten Methoden. Die Strahlensuche
verbessert die QualitĂ€t der BCD-BĂ€ume auf allen DatensĂ€tzen, allerdings auf Kosten der Laufzeit. Weiterhin fanden wir heraus, dass ein BCD-Superbaum, der als Startbaum verwendet wird, die QualitĂ€t einer Maximum-Likelihood-Baumrekonstruktion verbessern kann. AuĂerdem kann BCD DatensĂ€tze verarbeiten, die so groĂ sind, dass lokale
Suchheuristiken auf diesen nicht mehr in angemessener Zeit konvergieren. Aufgrund der Kombination aus Geschwindigkeit, Genauigkeit und der FĂ€higkeit, den Elternbaum zu rekonstruieren, sofern ein solcher existiert, ist BCD ein vielversprechender Ansatz um die Skalierbarkeit von Teile-und-herrsche-Methoden entscheidend zu verbessern
Synthesizing species trees from gene trees using the parameterized and graph-theoretic approaches
Gene trees describe how parts of the species have evolved over time, and it is assumed that gene trees have evolved along the branches of the species tree. However, some of gene trees are often discordant with the corresponding species tree due to the complicated evolution history of genes. To overcome this obstacle, median problems have emerged as a major tool for synthesizing species trees by reconciling discordance in a given collection of gene trees. Given a collection of gene trees and a cost function, the median problem seeks a tree, called median tree, that minimizes the overall cost to the gene trees. Median tree problems are typically NP-hard, and there is an increased interest in making such median tree problems available for large-scale species tree construction.
In this thesis work, we first show that the gene duplication median tree problem satisfied the weaker version of the Pareto property and propose a parameterized algorithm to solve the gene duplication median tree problem. Second, we design two efficient methods to handle the issues of applying the parameterized algorithm to unrooted gene trees which are sampled from the different species. Third, we introduce the graph-theoretic formulation of the Robinson-Foulds median tree problem and a new tree edit operation. Fourth, we propose a new metric between two phylogenetic trees and examine the statistical properties of the metric. Finally, we propose a new clustering criteria in a bipartite network and propose a new NP-hard problem and its ILP formulation
Developing and applying supertree methods in Phylogenomics and Macroevolution
Supertrees
can
be
used
to
combine
partially
overalapping
trees
and
generate
more
inclusive
phylogenies.
It
has
been
proposed
that
Maximum
Likelihood
(ML)
supertrees
method
(SM)
could
be
developed
using
an
exponential
probability
distribution
to
model
errors
in
the
input
trees
(given
a
proposed
supertree).
When
the
tree-Ââto-Ââtree
distances
used
in
the
ML
computation
are
symmetric
differences,
the
ML
SM
has
been
shown
to
be
equivalent
to
a
Majority-ÂâRule
consensus
SM,
and
hence,
exactly
as
the
latter,
it
has
the
desirable
property
of
being
a
median
tree
(with
reference
to
the
set
of
input
trees).
The
ability
to
estimate
the
likelihood
of
supertrees,
allows
implementing
Bayesian
(MCMC)
approaches,
which
have
the
advantage
to
allow
the
support
for
the
clades
in
a
supertree
to
be
properly
estimated.
I
present
here
the
L.U.St
software
package;
it
contains
the
first
implementation
of
a
ML
SM
and
allows
for
the
first
time
statistical
tests
on
supertrees.
I
also
characterized
the
first
implementation
of
the
Bayesian
(MCMC)
SM.
Both
the
ML
and
the
Bayesian
(MCMC)
SMs
have
been
tested
for
and
found
to
be
immune
to
biases.
The
Bayesian
(MCMC)
SM
is
applied
to
the
reanalyses
of
a
variety
of
datasets
(i.e.
the
datasets
for
the
Metazoa
and
the
Carnivora),
and
I
have
also
recovered
the
first
Bayesian
supertree-Ââbased
phylogeny
of
the
Eubacteria
and
the
Archaebacteria.
These
new
SMs
are
discussed,
with
reference
to
other,
well-Ââ
known
SMs
like
Matrix
Representation
with
Parsimony.
Both
the
ML
and
Bayesian
SM
offer
multiple
attractive
advantages
over
current
alternatives