140 research outputs found

    A list of parameterized problems in bioinformatics

    Get PDF
    In this report we present a list of problems that originated in bionformatics. Our aim is to collect information on such problems that have been analyzed from the point of view of Parameterized Complexity. For every problem we give its definition and biological motivation together with known complexity results.Postprint (published version

    Festparameter-Algorithmen fuer die Konsens-Analyse Genomischer Daten

    Get PDF
    Fixed-parameter algorithms offer a constructive and powerful approach to efficiently obtain solutions for NP-hard problems combining two important goals: Fixed-parameter algorithms compute optimal solutions within provable time bounds despite the (almost inevitable) computational intractability of NP-hard problems. The essential idea is to identify one or more aspects of the input to a problem as the parameters, and to confine the combinatorial explosion of computational difficulty to a function of the parameters such that the costs are polynomial in the non-parameterized part of the input. This makes especially sense for parameters which have small values in applications. Fixed-parameter algorithms have become an established algorithmic tool in a variety of application areas, among them computational biology where small values for problem parameters are often observed. A number of design techniques for fixed-parameter algorithms have been proposed and bounded search trees are one of them. In computational biology, however, examples of bounded search tree algorithms have been, so far, rare. This thesis investigates the use of bounded search tree algorithms for consensus problems in the analysis of DNA and RNA data. More precisely, we investigate consensus problems in the contexts of sequence analysis, of quartet methods for phylogenetic reconstruction, of gene order analysis, and of RNA secondary structure comparison. In all cases, we present new efficient algorithms that incorporate the bounded search tree paradigm in novel ways. On our way, we also obtain results of parameterized hardness, showing that the respective problems are unlikely to allow for a fixed-parameter algorithm, and we introduce integer linear programs (ILP's) as a tool for classifying problems as fixed-parameter tractable, i.e., as having fixed-parameter algorithms. Most of our algorithms were implemented and tested on practical data.Festparameter-Algorithmen bieten einen konstruktiven Ansatz zur Loesung von kombinatorisch schwierigen, in der Regel NP-harten Problemen, der zwei Ziele beruecksichtigt: innerhalb von beweisbaren Laufzeitschranken werden optimale Ergebnisse berechnet. Die entscheidende Idee ist dabei, einen oder mehrere Aspekte der Problemeingabe als Parameter der Problems aufzufassen und die kombinatorische Explosion der algorithmischen Schwierigkeit auf diese Parameter zu beschraenken, so dass die Laufzeitkosten polynomiell in Bezug auf den nicht-parametrisierten Teil der Eingabe sind. Gibt es einen Festparameter-Algorithmus fuer ein kombinatorisches Problem, nennt man das Problem festparameter-handhabbar. Die Entwicklung von Festparameter-Algorithmen macht vor allem dann Sinn, wenn die betrachteten Parameter im Anwendungsfall nur kleine Werte annehmen. Festparameter-Algorithmen sind zu einem algorithmischen Standardwerkzeug in vielen Anwendungsbereichen geworden, unter anderem in der algorithmischen Biologie, wo in vielen Anwendungen kleine Parameterwerte beobachtet werden koennen. Zu den bekannten Techniken fuer den Entwurf von Festparameter-Algorithmen gehoeren unter anderem groessenbeschraenkte Suchbaeume. In der algorithmischen Biologie gibt es bislang nur wenige Beispiele fuer die Anwendung von groessenbeschraenkten Suchbaeumen. Diese Arbeit untersucht den Einsatz groessenbeschraenkter Suchbaeume fuer NP-harte Konsens-Probleme in der Analyse von DNS- und RNS-Daten. Wir betrachten Konsens-Probleme in der Analyse von DNS-Sequenzdaten, in der Analyse von sogenannten Quartettdaten zur Erstellung von phylogenetischen Hypothesen, in der Analyse von Daten ueber die Anordnung von Genen und beim Vergleich von RNS-Strukturdaten. In allen Faellen stellen wir neue effiziente Algorithmen vor, in denen das Paradigma der groessenbeschraenkten Suchbaeume auf neuartige Weise realisiert wird. Auf diesem Weg zeigen wir auch Ergebnisse parametrisierter Haerte, die zeigen, dass fuer die dabei betrachteten Probleme ein Festparameter-Algorithmus unwahrscheinlich ist. Ausserdem fuehren wir ganzzahliges lineares Programmieren als eine neue Technik ein, um die Festparameter-Handhabbarkeit eines Problems zu zeigen. Die Mehrzahl der hier vorgestellten Algorithmen wurde implementiert und auf Anwendungsdaten getestet

    Polyhedral geometry of Phylogenetic Rogue Taxa

    Get PDF
    It is well known among phylogeneticists that adding an extra taxon (e.g. species) to a data set can alter the structure of the optimal phylogenetic tree in surprising ways. However, little is known about this "rogue taxon" effect. In this paper we characterize the behavior of balanced minimum evolution (BME) phylogenetics on data sets of this type using tools from polyhedral geometry. First we show that for any distance matrix there exist distances to a "rogue taxon" such that the BME-optimal tree for the data set with the new taxon does not contain any nontrivial splits (bipartitions) of the optimal tree for the original data. Second, we prove a theorem which restricts the topology of BME-optimal trees for data sets of this type, thus showing that a rogue taxon cannot have an arbitrary effect on the optimal tree. Third, we construct polyhedral cones computationally which give complete answers for BME rogue taxon behavior when our original data fits a tree on four, five, and six taxa. We use these cones to derive sufficient conditions for rogue taxon behavior for four taxa, and to understand the frequency of the rogue taxon effect via simulation.Comment: In this version, we add quartet distances and fix Table 4

    Reconstructing Phylogenetic Tree From Multipartite Quartet System

    Get PDF
    A phylogenetic tree is a graphical representation of an evolutionary history in a set of taxa in which the leaves correspond to taxa and the non-leaves correspond to speciations. One of important problems in phylogenetic analysis is to assemble a global phylogenetic tree from smaller pieces of phylogenetic trees, particularly, quartet trees. Quartet Compatibility is to decide whether there is a phylogenetic tree inducing a given collection of quartet trees, and to construct such a phylogenetic tree if it exists. It is known that Quartet Compatibility is NP-hard but there are only a few results known for polynomial-time solvable subclasses. In this paper, we introduce two novel classes of quartet systems, called complete multipartite quartet system and full multipartite quartet system, and present polynomial time algorithms for Quartet Compatibility for these systems. We also see that complete/full multipartite quartet systems naturally arise from a limited situation of block-restricted measurement

    Designing weights for quartet-based methods when data are heterogeneous across lineages

    Get PDF
    Homogeneity across lineages is a general assumption in phylogenetics according to which nucleotide substitution rates are common to all lineages. Many phylogenetic methods relax this hypothesis but keep a simple enough model to make the process of sequence evolution more tractable. On the other hand, dealing successfully with the general case (heterogeneity of rates across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus especially indicated to deal with data evolving under heterogeneous rates. This method combines the weights of two previous methods by means of a test based on the positivity of the branch lengths estimated with the paralinear distance. ASAQ is statistically consistent when applied to data generated under the general Markov model, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely QFM, wQFM, quartet puzzling, weight optimization and Willson’s method) in combination with several systems of weights, including ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support weight optimization with ASAQ weights as a reliable and successful reconstruction method that improves upon the accuracy of global methods (such as neighbor-joining or maximum likelihood) in the presence of long branches or on mixtures of distributions on trees.We would like to thank the reviewers of the paper for important contributions that improved the final version of the manuscript. MC, JFS and MGL were partially supported by Spanish State Research Agency grant PID2019-103849GB-I00. MC and JFS were also supported by AEI through the Severo Ochoa and María de Maeztu Program for Centers and Units of Excellence in R &D (project CEX2020-001084-M) and by the AGAUR project 2021 SGR 00603 Geometry of Manifolds and Applications, GEOMVAP.Peer ReviewedPostprint (published version

    Fixed-Parameter Algorithms for Computing Kemeny Scores - Theory and Practice

    Full text link
    The central problem in this work is to compute a ranking of a set of elements which is "closest to" a given set of input rankings of the elements. We define "closest to" in an established way as having the minimum sum of Kendall-Tau distances to each input ranking. Unfortunately, the resulting problem Kemeny consensus is NP-hard for instances with n input rankings, n being an even integer greater than three. Nevertheless this problem plays a central role in many rank aggregation problems. It was shown that one can compute the corresponding Kemeny consensus list in f(k) + poly(n) time, being f(k) a computable function in one of the parameters "score of the consensus", "maximum distance between two input rankings", "number of candidates" and "average pairwise Kendall-Tau distance" and poly(n) a polynomial in the input size. This work will demonstrate the practical usefulness of the corresponding algorithms by applying them to randomly generated and several real-world data. Thus, we show that these fixed-parameter algorithms are not only of theoretical interest. In a more theoretical part of this work we will develop an improved fixed-parameter algorithm for the parameter "score of the consensus" having a better upper bound for the running time than previous algorithms.Comment: Studienarbei

    StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates

    Get PDF
    Fully Bayesian multispecies coalescent (MSC) methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. An alternative is to use heuristic methods which compromise accuracy or completeness in return for speed. A common heuristic is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, a worse estimator of divergence times, and induces spurious substitution rate variation when incomplete lineage sorting is present. Another class of heuristics directly motivated by the MSC avoids many of the pitfalls of concatenation but cannot be used to estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations. Computational performance improved by 13.5× and 13.8× respectively when analyzing two empirical data sets, and an average of 33.1× across 30 simulated data sets. To enable accurate estimates of per-species substitution rates, we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.This work was supported by a Rutherford Discovery Fellowship awarded to A.J.D. by the Royal Society of New Zealand. H.A.O. was supported by an Australian Laureate Fellowship awarded to Craig Moritz by the Australian Research Council (FL110100104)

    Genome-wide RAD sequencing resolves the evolutionary history of serrate leaf Juniperus and reveals discordance with chloroplast phylogeny

    Get PDF
    Juniper (Juniperus) is an ecologically important conifer genus of the Northern Hemisphere, the members of which are often foundational tree species of arid regions. The serrate leaf margin clade is native to topologically variable regions in North America, where hybridization has likely played a prominent role in their diversification. Here we use a reduced-representation sequencing approach (ddRADseq) to generate a phylogenomic data set for 68 accessions representing all 22 species in the serrate leaf margin clade, as well as a number of close and distant relatives, to improve understanding of diversification in this group. Phylogenetic analyses using three methods (SVDquartets, maximum likelihood, and Bayesian) yielded highly congruent and well-resolved topologies. These phylogenies provided improved resolution relative to past analyses based on Sanger sequencing of nuclear and chloroplast DNA, and were largely consistent with taxonomic expectations based on geography and morphology. Calibration of a Bayesian phylogeny with fossil evidence produced divergence time estimates for the clade consistent with a late Oligocene origin in North America, followed by a period of elevated diversification between 12 and 5 Mya. Comparison of the ddRADseq phylogenies with a phylogeny based on Sanger-sequenced chloroplast DNA revealed five instances of pronounced discordance, illustrating the potential for chloroplast introgression, chloroplast transfer, or incomplete lineage sorting to influence organellar phylogeny. Our results improve understanding of the pattern and tempo of diversification in Juniperus, and highlight the utility of reduced-representation sequencing for resolving phylogenetic relationships in non-model organisms with reticulation and recent divergence

    Maximum likelihood estimation of species trees and anomaly zone detection using ranked gene trees

    Get PDF
    A phylogenetic tree represents the evolutionary relationships among a set of organisms. Gene trees can be used to reconstruct phylogenetic trees. The methods in this dissertation focus on the gene tree topologies with emphasis on ranked gene tree topologies. A ranked tree depicts the order in which nodes appear in the tree together with topological relationships among gene lineages. One challenge that arises during phylogenetic inference is the existence of the anomaly zones, the regions of branch-length space in the species tree that can produce gene trees that have topologies differing from the species tree topology but are more probable than the gene tree matching the species tree. In this work, we show how the parameters of a constant-rate birth-death process used to simulate species trees affect the probability that the species tree lies in the anomaly zone. We prove that the probability that a species tree is in an anomaly zone approaches 1 as the number of species and the birth rate go to infinity in a pure birth process. We propose a heuristic approach to infer whether species trees lie in the different types of anomaly zones trees when it is intractable to compute the entire distribution of gene tree topologies. In this dissertation, we develop the first maximum likelihood (ML) method that infers a species tree from the ranked gene trees. We introduce the software PRANC, which can compute the probabilities of ranked gene trees under the coalescent process and infer an ML species tree. We propose methods to estimate a starting tree to be able to locate the ML species tree quickly. To illustrate the methods proposed, we analyze two experimental studies of skinks and gibbons
    • 

    corecore