4,122 research outputs found

    A Fast Quartet Tree Heuristic for Hierarchical Clustering

    Get PDF
    The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the 3(n4)3{n \choose 4} weighted quartet topologies on nn objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized. Keywords: Data and knowledge visualization, Pattern matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering, Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with arXiv:cs/0606048 in cs.D

    The similarity metric

    Full text link
    A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the {\em similarity metric}. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected, version to appear in IEEE Trans Inform. T

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

    Algorithms for Searching and Analyzing Sets of Evolutionary Trees

    Get PDF
    The evolutionary relationships between organisms are represented as phylogenetic trees. These trees have important implications for understanding biodiversity, tracking disease, and designing medicine. Since the evolutionary process that led to modern biodiversity was not directly recorded, phylogenetic trees are inferred from modern observations. Inferring accurate phylogenies is computationally difficult and many inference algorithms produce multiple phylogenetic trees of equal quality. The common method for presenting a set of trees is to summarize their common features into a single consensus tree. Consensus methods make it easy to tell which features are common to a set of trees, but how do you explore the hypotheses that are not the majority of trees? This question is best answered by a search algorithm. We present algorithms to query a set of trees based on their internal structure. Trees can be queried based on their bipartitions, quartets, clades, subtrees, or taxa, and we present a new concept which unifies edge based relationships for search functions. To extend the power of our search functions we provide the ability to combine the results of multiple searches using set operations. We also explore the differences between sets of trees. Clustering algorithms can detect if there are multiple distinct hypotheses within a set of trees. Decision tree depth and distinguishing bipartitions can be used to measure the similarity between sets of trees. For situations where a set of trees is made up of multiple distinct sets, we present p-support which is a measure to quantify the impact of the individual sets on a single consensus tree. The algorithms are presented within the context of TreeHouse. This is my open source platform for querying and analyzing sets of trees. One goal of TreeHouse was to unite query and analysis algorithms under a single user interface. The seamless interaction between fast filtering and analysis algorithms allows users to the explore their data in a way not easily accomplished elsewhere. We believe that the algorithms in this document and in TreeHouse can shed new light on often unexplored territory
    • …
    corecore