7 research outputs found

    A Fast Quartet Tree Heuristic for Hierarchical Clustering

    Get PDF
    The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the 3(n4)3{n \choose 4} weighted quartet topologies on nn objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized. Keywords: Data and knowledge visualization, Pattern matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering, Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with arXiv:cs/0606048 in cs.D

    Normalized Web Distance and Word Similarity

    Get PDF
    There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalizedis a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries. In the paper introducing the NWD it was called `normalized Google distance (NGD),' but since Google doesn't allow computer searches anymore, we opt for the more neutral and descriptive NWD. web distance (NWD) method to determine similarity between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN 978-142008592

    Normalized Information Distance

    Get PDF
    The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

    Personalized Web Image Organization

    No full text
    Due to the problem of semantic gap, i.e. the visual content of an image may not represent its semantics well, existing efforts on web image organization usually transform this task to clustering the surrounding text. However, because the surrounding text is usually short and the words therein usually appear only once, existing text clustering algorithms can hardly use the statistical information for image representation and may achieve downgraded performance with higher computational cost caused by learning from noisy tags. This chapter presents using the Probabilistic ART with user preference architecture, as introduced in Sects. 3.5 and 3.4, for personalized web image organization. This fused algorithm is named Probabilistic Fusion ART (PF-ART), which groups images of similar semantics together and simultaneously mines the key tags/topics of individual clusters. Moreover, it performs semi-supervised learning using the user-provided taggings for images to give users direct control of the generated clusters. An agglomerative merging strategy is further used to organize the clusters into a hierarchy, which is of a multi-branch tree structure rather than a binary tree generated by traditional hierarchical clustering algorithms. The entire two-step algorithm is called Personalized Hierarchical Theme-based Clustering (PHTC), for tag-based web image organization. Two large-scale real-world web image collections, namely the NUS-WIDE and the Flickr datasets, are used to evaluate PHTC and compare it with existing algorithms in terms of clustering performance and time cost
    corecore