201 research outputs found
A Fast Quartet Tree Heuristic for Hierarchical Clustering
The Minimum Quartet Tree Cost problem is to construct an optimal weight tree
from the weighted quartet topologies on objects, where
optimality means that the summed weight of the embedded quartet topologies is
optimal (so it can be the case that the optimal tree embeds all quartets as
nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized
hill climbing, for approximating the optimal weight tree, given the quartet
topology weights. The method repeatedly transforms a dendrogram, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. The problem and the solution heuristic has been
extensively used for general hierarchical clustering of nontree-like
(non-phylogeny) data in various domains and across domains with heterogeneous
data. We also present a greatly improved heuristic, reducing the running time
by a factor of order a thousand to ten thousand. All this is implemented and
available, as part of the CompLearn package. We compare performance and running
time of the original and improved versions with those of UPGMA, BioNJ, and NJ,
as implemented in the SplitsTree package on genomic data for which the latter
are optimized.
Keywords: Data and knowledge visualization, Pattern
matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering,
Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with
arXiv:cs/0606048 in cs.D
A New Quartet Tree Heuristic for Hierarchical Clustering
We consider the problem of constructing an an optimal-weight tree from the
3*(n choose 4) weighted quartet topologies on n objects, where optimality means
that the summed weight of the embedded quartet topologiesis optimal (so it can
be the case that the optimal tree embeds all quartets as non-optimal
topologies). We present a heuristic for reconstructing the optimal-weight tree,
and a canonical manner to derive the quartet-topology weights from a given
distance matrix. The method repeatedly transforms a bifurcating tree, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. This contrasts to other heuristic search methods
from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly,
incrementally construct a solution from a random order of objects, and
subsequently add agreement values.Comment: 22 pages, 14 figure
Reconstructing phylogenies from noisy quartets in polynomial time with a high success probability
<p>Abstract</p> <p>Background</p> <p>In recent years, quartet-based phylogeny reconstruction methods have received considerable attentions in the computational biology community. Traditionally, the accuracy of a phylogeny reconstruction method is measured by simulations on synthetic datasets with known "true" phylogenies, while little theoretical analysis has been done. In this paper, we present a new model-based approach to measuring the accuracy of a quartet-based phylogeny reconstruction method. Under this model, we propose three efficient algorithms to reconstruct the "true" phylogeny with a high success probability.</p> <p>Results</p> <p>The first algorithm can reconstruct the "true" phylogeny from the input quartet topology set without quartet errors in <it>O</it>(<it>n</it><sup>2</sup>) time by querying at most (<it>n </it>- 4) log(<it>n </it>- 1) quartet topologies, where <it>n </it>is the number of the taxa. When the input quartet topology set contains errors, the second algorithm can reconstruct the "true" phylogeny with a probability approximately 1 - <it>p </it>in <it>O</it>(<it>n</it><sup>4 </sup>log <it>n</it>) time, where <it>p </it>is the probability for a quartet topology being an error. This probability is improved by the third algorithm to approximately <inline-formula><m:math name="1748-7188-3-1-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mn>1</m:mn><m:mrow><m:mn>1</m:mn><m:mo>+</m:mo><m:msup><m:mi>q</m:mi><m:mn>2</m:mn></m:msup><m:mo>+</m:mo><m:mfrac><m:mn>1</m:mn><m:mn>2</m:mn></m:mfrac><m:msup><m:mi>q</m:mi><m:mn>4</m:mn></m:msup><m:mo>+</m:mo><m:mfrac><m:mn>1</m:mn><m:mrow><m:mn>16</m:mn></m:mrow></m:mfrac><m:msup><m:mi>q</m:mi><m:mn>5</m:mn></m:msup></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqaIXaqmcqGHRaWkcqWGXbqCdaahaaqabeaacqaIYaGmaaGaey4kaSYaaSaaaeaacqaIXaqmaeaacqaIYaGmaaGaemyCae3aaWbaaeqabaGaeGinaqdaaiabgUcaRmaalaaabaGaeGymaedabaGaeGymaeJaeGOnaydaaiabdghaXnaaCaaabeqaaiabiwda1aaaaaaaaa@3D5A@</m:annotation></m:semantics></m:math></inline-formula>, where <inline-formula><m:math name="1748-7188-3-1-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>q</m:mi><m:mo>=</m:mo><m:mfrac><m:mi>p</m:mi><m:mrow><m:mn>1</m:mn><m:mo>−</m:mo><m:mi>p</m:mi></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemyCaeNaeyypa0tcfa4aaSaaaeaacqWGWbaCaeaacqaIXaqmcqGHsislcqWGWbaCaaaaaa@3391@</m:annotation></m:semantics></m:math></inline-formula>, with running time of <it>O</it>(<it>n</it><sup>5</sup>), which is at least 0.984 when <it>p </it>< 0.05.</p> <p>Conclusion</p> <p>The three proposed algorithms are mathematically guaranteed to reconstruct the "true" phylogeny with a high success probability. The experimental results showed that the third algorithm produced phylogenies with a higher probability than its aforementioned theoretical lower bound and outperformed some existing phylogeny reconstruction methods in both speed and accuracy.</p
Constructing level-2 phylogenetic networks from triplets
Jansson and Sung showed that, given a dense set of input triplets T
(representing hypotheses about the local evolutionary relationships of triplets
of species), it is possible to determine in polynomial time whether there
exists a level-1 network consistent with T, and if so to construct such a
network. They also showed that, unlike in the case of trees (i.e. level-0
networks), the problem becomes NP-hard when the input is non-dense. Here we
further extend this work by showing that, when the set of input triplets is
dense, the problem is even polynomial-time solvable for the construction of
level-2 networks. This shows that, assuming density, it is tractable to
construct plausible evolutionary histories from input triplets even when such
histories are heavily non-tree like. This further strengthens the case for the
use of triplet-based methods in the construction of phylogenetic networks. We
also show that, in the non-dense case, the level-2 problem remains NP-hard
Festparameter-Algorithmen fuer die Konsens-Analyse Genomischer Daten
Fixed-parameter algorithms offer a constructive and powerful approach
to efficiently obtain solutions for NP-hard problems combining two
important goals: Fixed-parameter algorithms compute optimal solutions
within provable time bounds despite the (almost inevitable)
computational intractability of NP-hard problems. The essential idea
is to identify one or more aspects of the input to a problem as the
parameters, and to confine the combinatorial explosion of
computational difficulty to a function of the parameters such that the
costs are polynomial in the non-parameterized part of the input. This
makes especially sense for parameters which have small values in
applications. Fixed-parameter algorithms have become an established
algorithmic tool in a variety of application areas, among them
computational biology where small values for problem parameters are
often observed. A number of design techniques for fixed-parameter
algorithms have been proposed and bounded search trees are one of
them. In computational biology, however, examples of bounded search
tree algorithms have been, so far, rare.
This thesis investigates the use of bounded search tree algorithms for
consensus problems in the analysis of DNA and RNA data. More
precisely, we investigate consensus problems in the contexts of
sequence analysis, of quartet methods for phylogenetic reconstruction,
of gene order analysis, and of RNA secondary structure comparison. In
all cases, we present new efficient algorithms that incorporate the
bounded search tree paradigm in novel ways. On our way, we also obtain
results of parameterized hardness, showing that the respective
problems are unlikely to allow for a fixed-parameter algorithm, and we
introduce integer linear programs (ILP's) as a tool for classifying
problems as fixed-parameter tractable, i.e., as having fixed-parameter
algorithms. Most of our algorithms were implemented and tested on
practical data.Festparameter-Algorithmen bieten einen konstruktiven Ansatz zur
Loesung von kombinatorisch schwierigen, in der Regel NP-harten
Problemen, der zwei Ziele beruecksichtigt: innerhalb von beweisbaren
Laufzeitschranken werden optimale Ergebnisse berechnet. Die
entscheidende Idee ist dabei, einen oder mehrere Aspekte der
Problemeingabe als Parameter der Problems aufzufassen und die
kombinatorische Explosion der algorithmischen Schwierigkeit auf diese
Parameter zu beschraenken, so dass die Laufzeitkosten polynomiell in
Bezug auf den nicht-parametrisierten Teil der Eingabe sind. Gibt es
einen Festparameter-Algorithmus fuer ein kombinatorisches Problem,
nennt man das Problem festparameter-handhabbar. Die Entwicklung von
Festparameter-Algorithmen macht vor allem dann Sinn, wenn die
betrachteten Parameter im Anwendungsfall nur kleine Werte
annehmen. Festparameter-Algorithmen sind zu einem algorithmischen
Standardwerkzeug in vielen Anwendungsbereichen geworden, unter anderem
in der algorithmischen Biologie, wo in vielen Anwendungen kleine
Parameterwerte beobachtet werden koennen. Zu den bekannten Techniken
fuer den Entwurf von Festparameter-Algorithmen gehoeren unter anderem
groessenbeschraenkte Suchbaeume. In der algorithmischen Biologie gibt
es bislang nur wenige Beispiele fuer die Anwendung von
groessenbeschraenkten Suchbaeumen.
Diese Arbeit untersucht den Einsatz groessenbeschraenkter Suchbaeume
fuer NP-harte Konsens-Probleme in der Analyse von DNS- und
RNS-Daten. Wir betrachten Konsens-Probleme in der Analyse von
DNS-Sequenzdaten, in der Analyse von sogenannten Quartettdaten zur
Erstellung von phylogenetischen Hypothesen, in der Analyse von Daten
ueber die Anordnung von Genen und beim Vergleich von
RNS-Strukturdaten. In allen Faellen stellen wir neue effiziente
Algorithmen vor, in denen das Paradigma der groessenbeschraenkten
Suchbaeume auf neuartige Weise realisiert wird. Auf diesem Weg zeigen
wir auch Ergebnisse parametrisierter Haerte, die zeigen, dass fuer
die dabei betrachteten Probleme ein Festparameter-Algorithmus
unwahrscheinlich ist. Ausserdem fuehren wir ganzzahliges lineares
Programmieren als eine neue Technik ein, um die
Festparameter-Handhabbarkeit eines Problems zu zeigen. Die Mehrzahl
der hier vorgestellten Algorithmen wurde implementiert und auf
Anwendungsdaten getestet
A list of parameterized problems in bioinformatics
In this report we present a list of problems that originated in bionformatics. Our aim is to collect information on such problems that have been analyzed from the point of view of Parameterized Complexity. For every problem we give its definition and biological motivation together with known complexity results.Postprint (published version
- …