140 research outputs found
A list of parameterized problems in bioinformatics
In this report we present a list of problems that originated in bionformatics. Our aim is to collect information on such problems that have been analyzed from the point of view of Parameterized Complexity. For every problem we give its definition and biological motivation together with known complexity results.Postprint (published version
Festparameter-Algorithmen fuer die Konsens-Analyse Genomischer Daten
Fixed-parameter algorithms offer a constructive and powerful approach
to efficiently obtain solutions for NP-hard problems combining two
important goals: Fixed-parameter algorithms compute optimal solutions
within provable time bounds despite the (almost inevitable)
computational intractability of NP-hard problems. The essential idea
is to identify one or more aspects of the input to a problem as the
parameters, and to confine the combinatorial explosion of
computational difficulty to a function of the parameters such that the
costs are polynomial in the non-parameterized part of the input. This
makes especially sense for parameters which have small values in
applications. Fixed-parameter algorithms have become an established
algorithmic tool in a variety of application areas, among them
computational biology where small values for problem parameters are
often observed. A number of design techniques for fixed-parameter
algorithms have been proposed and bounded search trees are one of
them. In computational biology, however, examples of bounded search
tree algorithms have been, so far, rare.
This thesis investigates the use of bounded search tree algorithms for
consensus problems in the analysis of DNA and RNA data. More
precisely, we investigate consensus problems in the contexts of
sequence analysis, of quartet methods for phylogenetic reconstruction,
of gene order analysis, and of RNA secondary structure comparison. In
all cases, we present new efficient algorithms that incorporate the
bounded search tree paradigm in novel ways. On our way, we also obtain
results of parameterized hardness, showing that the respective
problems are unlikely to allow for a fixed-parameter algorithm, and we
introduce integer linear programs (ILP's) as a tool for classifying
problems as fixed-parameter tractable, i.e., as having fixed-parameter
algorithms. Most of our algorithms were implemented and tested on
practical data.Festparameter-Algorithmen bieten einen konstruktiven Ansatz zur
Loesung von kombinatorisch schwierigen, in der Regel NP-harten
Problemen, der zwei Ziele beruecksichtigt: innerhalb von beweisbaren
Laufzeitschranken werden optimale Ergebnisse berechnet. Die
entscheidende Idee ist dabei, einen oder mehrere Aspekte der
Problemeingabe als Parameter der Problems aufzufassen und die
kombinatorische Explosion der algorithmischen Schwierigkeit auf diese
Parameter zu beschraenken, so dass die Laufzeitkosten polynomiell in
Bezug auf den nicht-parametrisierten Teil der Eingabe sind. Gibt es
einen Festparameter-Algorithmus fuer ein kombinatorisches Problem,
nennt man das Problem festparameter-handhabbar. Die Entwicklung von
Festparameter-Algorithmen macht vor allem dann Sinn, wenn die
betrachteten Parameter im Anwendungsfall nur kleine Werte
annehmen. Festparameter-Algorithmen sind zu einem algorithmischen
Standardwerkzeug in vielen Anwendungsbereichen geworden, unter anderem
in der algorithmischen Biologie, wo in vielen Anwendungen kleine
Parameterwerte beobachtet werden koennen. Zu den bekannten Techniken
fuer den Entwurf von Festparameter-Algorithmen gehoeren unter anderem
groessenbeschraenkte Suchbaeume. In der algorithmischen Biologie gibt
es bislang nur wenige Beispiele fuer die Anwendung von
groessenbeschraenkten Suchbaeumen.
Diese Arbeit untersucht den Einsatz groessenbeschraenkter Suchbaeume
fuer NP-harte Konsens-Probleme in der Analyse von DNS- und
RNS-Daten. Wir betrachten Konsens-Probleme in der Analyse von
DNS-Sequenzdaten, in der Analyse von sogenannten Quartettdaten zur
Erstellung von phylogenetischen Hypothesen, in der Analyse von Daten
ueber die Anordnung von Genen und beim Vergleich von
RNS-Strukturdaten. In allen Faellen stellen wir neue effiziente
Algorithmen vor, in denen das Paradigma der groessenbeschraenkten
Suchbaeume auf neuartige Weise realisiert wird. Auf diesem Weg zeigen
wir auch Ergebnisse parametrisierter Haerte, die zeigen, dass fuer
die dabei betrachteten Probleme ein Festparameter-Algorithmus
unwahrscheinlich ist. Ausserdem fuehren wir ganzzahliges lineares
Programmieren als eine neue Technik ein, um die
Festparameter-Handhabbarkeit eines Problems zu zeigen. Die Mehrzahl
der hier vorgestellten Algorithmen wurde implementiert und auf
Anwendungsdaten getestet
Polyhedral geometry of Phylogenetic Rogue Taxa
It is well known among phylogeneticists that adding an extra taxon (e.g.
species) to a data set can alter the structure of the optimal phylogenetic tree
in surprising ways. However, little is known about this "rogue taxon" effect.
In this paper we characterize the behavior of balanced minimum evolution (BME)
phylogenetics on data sets of this type using tools from polyhedral geometry.
First we show that for any distance matrix there exist distances to a "rogue
taxon" such that the BME-optimal tree for the data set with the new taxon does
not contain any nontrivial splits (bipartitions) of the optimal tree for the
original data. Second, we prove a theorem which restricts the topology of
BME-optimal trees for data sets of this type, thus showing that a rogue taxon
cannot have an arbitrary effect on the optimal tree. Third, we construct
polyhedral cones computationally which give complete answers for BME rogue
taxon behavior when our original data fits a tree on four, five, and six taxa.
We use these cones to derive sufficient conditions for rogue taxon behavior for
four taxa, and to understand the frequency of the rogue taxon effect via
simulation.Comment: In this version, we add quartet distances and fix Table 4
Reconstructing Phylogenetic Tree From Multipartite Quartet System
A phylogenetic tree is a graphical representation of an evolutionary history in a set of taxa in which the leaves correspond to taxa and the non-leaves correspond to speciations. One of important problems in phylogenetic analysis is to assemble a global phylogenetic tree from smaller pieces of phylogenetic trees, particularly, quartet trees. Quartet Compatibility is to decide whether there is a phylogenetic tree inducing a given collection of quartet trees, and to construct such a phylogenetic tree if it exists. It is known that Quartet Compatibility is NP-hard but there are only a few results known for polynomial-time solvable subclasses.
In this paper, we introduce two novel classes of quartet systems, called complete multipartite quartet system and full multipartite quartet system, and present polynomial time algorithms for Quartet Compatibility for these systems. We also see that complete/full multipartite quartet systems naturally arise from a limited situation of block-restricted measurement
Designing weights for quartet-based methods when data are heterogeneous across lineages
Homogeneity across lineages is a general assumption in phylogenetics according to which nucleotide substitution rates are common to all lineages. Many phylogenetic methods relax this hypothesis but keep a simple enough model to make the process of sequence evolution more tractable. On the other hand, dealing successfully with the general case (heterogeneity of rates across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus especially indicated to deal with data evolving under heterogeneous rates. This method combines the weights of two previous methods by means of a test based on the positivity of the branch lengths estimated with the paralinear distance. ASAQ is statistically consistent when applied to data generated under the general Markov model, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely QFM, wQFM, quartet puzzling, weight optimization and Willsonâs method) in combination with several systems of weights, including ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support weight optimization with ASAQ weights as a reliable and successful reconstruction method that improves upon the accuracy of global methods (such as neighbor-joining or maximum likelihood) in the presence of long branches or on mixtures of distributions on trees.We would like to thank the reviewers of the paper for important contributions that improved the final version of the manuscript. MC, JFS and MGL were partially supported by Spanish State Research Agency grant PID2019-103849GB-I00. MC and JFS were also supported by AEI through the Severo Ochoa and MarĂa de Maeztu Program for Centers and Units of Excellence in R &D (project CEX2020-001084-M) and by the AGAUR project 2021 SGR 00603 Geometry of Manifolds and Applications, GEOMVAP.Peer ReviewedPostprint (published version
Fixed-Parameter Algorithms for Computing Kemeny Scores - Theory and Practice
The central problem in this work is to compute a ranking of a set of elements
which is "closest to" a given set of input rankings of the elements. We define
"closest to" in an established way as having the minimum sum of Kendall-Tau
distances to each input ranking. Unfortunately, the resulting problem Kemeny
consensus is NP-hard for instances with n input rankings, n being an even
integer greater than three. Nevertheless this problem plays a central role in
many rank aggregation problems. It was shown that one can compute the
corresponding Kemeny consensus list in f(k) + poly(n) time, being f(k) a
computable function in one of the parameters "score of the consensus", "maximum
distance between two input rankings", "number of candidates" and "average
pairwise Kendall-Tau distance" and poly(n) a polynomial in the input size. This
work will demonstrate the practical usefulness of the corresponding algorithms
by applying them to randomly generated and several real-world data. Thus, we
show that these fixed-parameter algorithms are not only of theoretical
interest. In a more theoretical part of this work we will develop an improved
fixed-parameter algorithm for the parameter "score of the consensus" having a
better upper bound for the running time than previous algorithms.Comment: Studienarbei
StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates
Fully Bayesian multispecies coalescent (MSC) methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. An alternative is to use heuristic methods which compromise accuracy or completeness in return for speed. A common heuristic is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, a worse estimator of divergence times, and induces spurious substitution rate variation when incomplete lineage sorting is present. Another class of heuristics directly motivated by the MSC avoids many of the pitfalls of concatenation but cannot be used to estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations. Computational performance improved by 13.5Ăâand 13.8Ăârespectively when analyzing two empirical data sets, and an average of 33.1Ăâacross 30 simulated data sets. To enable accurate estimates of per-species substitution rates, we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.This work was supported by a Rutherford Discovery
Fellowship awarded to A.J.D. by the Royal Society of New
Zealand. H.A.O. was supported by an Australian Laureate
Fellowship awarded to Craig Moritz by the Australian
Research Council (FL110100104)
Genome-wide RAD sequencing resolves the evolutionary history of serrate leaf Juniperus and reveals discordance with chloroplast phylogeny
Juniper (Juniperus) is an ecologically important conifer genus of the Northern Hemisphere, the members of which are often foundational tree species of arid regions. The serrate leaf margin clade is native to topologically variable regions in North America, where hybridization has likely played a prominent role in their diversification. Here we use a reduced-representation sequencing approach (ddRADseq) to generate a phylogenomic data set for 68 accessions representing all 22 species in the serrate leaf margin clade, as well as a number of close and distant relatives, to improve understanding of diversification in this group. Phylogenetic analyses using three methods (SVDquartets, maximum likelihood, and Bayesian) yielded highly congruent and well-resolved topologies. These phylogenies provided improved resolution relative to past analyses based on Sanger sequencing of nuclear and chloroplast DNA, and were largely consistent with taxonomic expectations based on geography and morphology. Calibration of a Bayesian phylogeny with fossil evidence produced divergence time estimates for the clade consistent with a late Oligocene origin in North America, followed by a period of elevated diversification between 12 and 5 Mya. Comparison of the ddRADseq phylogenies with a phylogeny based on Sanger-sequenced chloroplast DNA revealed five instances of pronounced discordance, illustrating the potential for chloroplast introgression, chloroplast transfer, or incomplete lineage sorting to influence organellar phylogeny. Our results improve understanding of the pattern and tempo of diversification in Juniperus, and highlight the utility of reduced-representation sequencing for resolving phylogenetic relationships in non-model organisms with reticulation and recent divergence
Maximum likelihood estimation of species trees and anomaly zone detection using ranked gene trees
A phylogenetic tree represents the evolutionary relationships among a set of organisms. Gene trees can be used to reconstruct phylogenetic trees. The methods in this dissertation focus on the gene tree topologies with emphasis on ranked gene tree topologies. A ranked tree depicts the order in which nodes appear in the tree together with topological relationships among gene lineages. One challenge that arises during phylogenetic inference is the existence of the anomaly zones, the regions of branch-length space in the species tree that can produce gene trees that have topologies differing from the species tree topology but are more probable than the gene tree matching the species tree. In this work, we show how the parameters of a constant-rate birth-death process used to simulate species trees affect the probability that the species tree lies in the anomaly zone. We prove that the probability that a species tree is in an anomaly zone approaches 1 as the number of species and the birth rate go to infinity in a pure birth process. We propose a heuristic approach to infer whether species trees lie in the different types of anomaly zones trees when it is intractable to compute the entire distribution of gene tree topologies.
In this dissertation, we develop the first maximum likelihood (ML) method that infers a species tree from the ranked gene trees. We introduce the software PRANC, which can compute the probabilities of ranked gene trees under the coalescent process and infer an ML species tree. We propose methods to estimate a starting tree to be able to locate the ML species tree quickly. To illustrate the methods proposed, we analyze two experimental studies of skinks and gibbons
- âŠ