Search CORE

4,122 research outputs found

A Fast Quartet Tree Heuristic for Hierarchical Clustering

Author: Cilibrasi Rudi L.
Vitanyi Paul M. B.
Publication venue
Publication date: 12/09/2014
Field of study

The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the

3{n \choose 4}

weighted quartet topologies on

n

objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized. Keywords: Data and knowledge visualization, Pattern matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering, Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with arXiv:cs/0606048 in cs.D

arXiv.org e-Print Archive

CiteSeerX

CWI's Institutional Repository

The similarity metric

Author: Chen Xin
Li Ming
Li Xin
Ma Bin
Vitanyi Paul
Publication venue
Publication date: 01/01/2003
Field of study

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the {\em similarity metric}. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected, version to appear in IEEE Trans Inform. T

arXiv.org e-Print Archive

CiteSeerX

International Migration, Integration and Social Cohesion online publications

Recommended from our members

Improved methods for phylogenetics

Author: Nelesen Serita Marie
Publication venue
Publication date: 13/08/2010
Field of study

textPhylogenetics is the study of evolutionary relationships. It is a scientific endeavour to discover history, and it is not easy. Massive amounts of data together with computationally difficult optimization problems mean that heuristics are prevalent, and ever better techniques are sought. New approaches are valuable if they are more accurate, but are considered even more so if they are faster than pre-existing methods. Improvements to existing algorithms, whether in terms of space requirements, or faster running times, are also worthwhile. This dissertation explores three new techniques, each of which is valuable according to the previous definitions. The first contribution is TASPI, a system for storing collections of phylogenetic trees, and performing post-tree analyses. TASPI stores collections of trees more compactly than the previous method, and this compact structure lends itself to post-tree analyses. This results in the ability to compute strict and majority consensus trees faster than common alternatives. As an added benefit, TASPI is written in ACL2, which allows properties of the algorithms and data structures to be formally verified. The second contribution is an improved method to generate phylogenetic trees. A common methodology involves two steps, first estimating a Multiple Sequence Alignment (MSA), and then estimating a tree using that MSA. This method changes the way in which the MSA is estimated, and this leads to improved accuracy of the resultant trees. Also, in some cases, the time required is also reduced. The third contribution is BLuTGEN, a method by which a phylogenetic tree is estimated from sequence data, but without ever generating an MSA for the full dataset. BLuTGEN is as accurate as one of the best published tree estimation techniques (SATé), but takes a novel approach which allows it to be applied to much larger datasets.Computer Science

Texas ScholarWorks

A Fast Quartet Tree Heuristic for Hierarchical Clustering

Author: Cilibrasi R. (Rudi)
Vitányi P.M.B. (Paul)
Publication venue
Publication date: 12/09/2014
Field of study

CWI's Institutional Repository

Computational pan-genomics: status, promises and challenges

Author: Abeel Thomas
Alkan Can
Baaijens Jasmijn
Bakker Paul
Boeva Valentina
Bonnal Raoul
Chiaromonte Francesca
Chikhi Rayan
Ciccarelli Francesca
Cijvat Robin
Datema Erwin
Dijkstra Louis
Duijn Cornelia
Dutilh Bas
Eichler Evan
El-Kebir Mohammed
Ernst Corinna
Eskin Eleazar
Garrison Erik
Ghaffaari Ali
Guryev Victor
Kersey Paul
Klau Gunnar
Kloosterman Wigard
Korbel Jan
Lameijer Eric-Wubbo
Langmead Benjamin
Marschall Tobias
Martin Marcel
Marz Manja
Medvedev Paul
Mu John
Mäkinen Veli
Neerincx Pieter
Novak Adam
Ouwens Klaasjan
Paten Benedict
Peterlongo Pierre
Pisanti Nadia
Porubsky David
Rahmann Sven
Raphael Benjamin
Reinert Knut
Ridder Dick
Ridder Jeroen
Rivals Eric
Sanders Ashley
Schlesner Matthias
Schulz-Trieglaff Ole
Schönhuth Alexander
Sheikhizadeh Siavash
Shneider Carl
Smit Sandra
The Computational Pan-Genomics Consortium
Valenzuela Daniel
Vandin Fabio
Wang Jiayin
Wessels Lodewyk
Ye Kai
Zhang Ying
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

EUR Research Repository

HAL-MINES ParisTech

Archivio della ricerca della Scuola Superiore Sant'Anna

Radboud Repository

HAL-Rennes 1

Algorithms for Searching and Analyzing Sets of Evolutionary Trees

Author: Brammer Grant
Publication venue
Publication date: 09/01/2015
Field of study

The evolutionary relationships between organisms are represented as phylogenetic trees. These trees have important implications for understanding biodiversity, tracking disease, and designing medicine. Since the evolutionary process that led to modern biodiversity was not directly recorded, phylogenetic trees are inferred from modern observations. Inferring accurate phylogenies is computationally difficult and many inference algorithms produce multiple phylogenetic trees of equal quality. The common method for presenting a set of trees is to summarize their common features into a single consensus tree. Consensus methods make it easy to tell which features are common to a set of trees, but how do you explore the hypotheses that are not the majority of trees? This question is best answered by a search algorithm. We present algorithms to query a set of trees based on their internal structure. Trees can be queried based on their bipartitions, quartets, clades, subtrees, or taxa, and we present a new concept which unifies edge based relationships for search functions. To extend the power of our search functions we provide the ability to combine the results of multiple searches using set operations. We also explore the differences between sets of trees. Clustering algorithms can detect if there are multiple distinct hypotheses within a set of trees. Decision tree depth and distinguishing bipartitions can be used to measure the similarity between sets of trees. For situations where a set of trees is made up of multiple distinct sets, we present p-support which is a measure to quantify the impact of the individual sets on a single consensus tree. The algorithms are presented within the context of TreeHouse. This is my open source platform for querying and analyzing sets of trees. One goal of TreeHouse was to unite query and analysis algorithms under a single user interface. The seamless interaction between fast filtering and analysis algorithms allows users to the explore their data in a way not easily accomplished elsewhere. We believe that the algorithms in this document and in TreeHouse can shed new light on often unexplored territory

Texas A&M Repository

Challenges Building Online GIS Services to Support Global Biodiversity Mapping and Analysis: Lessons from the Mountain and Plains Database and Informatics project.

Author
Publication venue: 'The University of Kansas'
Publication date
Field of study

Crossref

Computational pan-genomics: status, promises and challenges

Author
Publication venue
Publication date: 01/01/2018
Field of study

Dissertations of the University of Groningen