Search CORE

7 research outputs found

A Fast Quartet Tree Heuristic for Hierarchical Clustering

Author: Cilibrasi Rudi L.
Vitanyi Paul M. B.
Publication venue
Publication date: 12/09/2014
Field of study

The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the

3{n \choose 4}

weighted quartet topologies on

n

objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized. Keywords: Data and knowledge visualization, Pattern matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering, Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with arXiv:cs/0606048 in cs.D

arXiv.org e-Print Archive

CiteSeerX

CWI's Institutional Repository

Normalized Web Distance and Word Similarity

Author: Cilibrasi Rudi L.
Vitanyi Paul M. B.
Publication venue
Publication date: 01/01/2009
Field of study

There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalizedis a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries. In the paper introducing the NWD it was called `normalized Google distance (NGD),' but since Google doesn't allow computer searches anymore, we opt for the more neutral and descriptive NWD. web distance (NWD) method to determine similarity between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN 978-142008592

arXiv.org e-Print Archive

CiteSeerX

CWI's Institutional Repository

International Migration, Integration and Social Cohesion online publications

Normalized Information Distance

Author: Balbach Frank J.
Cilibrasi Rudi L.
Li Ming
Vitanyi Paul M. B.
Publication venue
Publication date: 01/01/2008
Field of study

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

arXiv.org e-Print Archive

CiteSeerX

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Personalized Web Image Organization

Author: GA Carpenter
Lin Chen
Oswin Aichholzer
Rudi L. Cilibrasi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2019
Field of study

Due to the problem of semantic gap, i.e. the visual content of an image may not represent its semantics well, existing efforts on web image organization usually transform this task to clustering the surrounding text. However, because the surrounding text is usually short and the words therein usually appear only once, existing text clustering algorithms can hardly use the statistical information for image representation and may achieve downgraded performance with higher computational cost caused by learning from noisy tags. This chapter presents using the Probabilistic ART with user preference architecture, as introduced in Sects. 3.5 and 3.4, for personalized web image organization. This fused algorithm is named Probabilistic Fusion ART (PF-ART), which groups images of similar semantics together and simultaneously mines the key tags/topics of individual clusters. Moreover, it performs semi-supervised learning using the user-provided taggings for images to give users direct control of the generated clusters. An agglomerative merging strategy is further used to organize the clusters into a hierarchy, which is of a multi-branch tree structure rather than a binary tree generated by traditional hierarchical clustering algorithms. The entire two-step algorithm is called Personalized Hierarchical Theme-based Clustering (PHTC), for tag-based web image organization. Two large-scale real-world web image collections, namely the NUS-WIDE and the Flickr datasets, are used to evaluate PHTC and compare it with existing algorithms in terms of clustering performance and time cost

Crossref

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

A Fast Quartet tree heuristic for hierarchical clustering

Author: Bao
Ben-Dor
Bennett
Bennett
Berry
Braden
Buneman
Chen
Cilibrasi
Cilibrasi
Cilibrasi
Colonius
Colonius
Consoli
Costa Santos
Duda
Felsenstein
Gascuel
Huson
Jiang
Johnson
Kirkpatrick
Ksiazek
Li
Li
Li
Li
Metropolis
Paul M.B. Vitányi
Piaggio-Talice
Roshan
Rudi L. Cilibrasi
Saitou
Snir
Steel
Strimmer
Vitányi
Wehner
Weyer-Menkoff
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref