166,527 research outputs found

    Hierarchical clustering with dot products recovers hidden tree structure

    Full text link
    In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN

    Soft clustering analysis of galaxy morphologies: A worked example with SDSS

    Full text link
    Context: The huge and still rapidly growing amount of galaxies in modern sky surveys raises the need of an automated and objective classification method. Unsupervised learning algorithms are of particular interest, since they discover classes automatically. Aims: We briefly discuss the pitfalls of oversimplified classification methods and outline an alternative approach called "clustering analysis". Methods: We categorise different classification methods according to their capabilities. Based on this categorisation, we present a probabilistic classification algorithm that automatically detects the optimal classes preferred by the data. We explore the reliability of this algorithm in systematic tests. Using a small sample of bright galaxies from the SDSS, we demonstrate the performance of this algorithm in practice. We are able to disentangle the problems of classification and parametrisation of galaxy morphologies in this case. Results: We give physical arguments that a probabilistic classification scheme is necessary. The algorithm we present produces reasonable morphological classes and object-to-class assignments without any prior assumptions. Conclusions: There are sophisticated automated classification algorithms that meet all necessary requirements, but a lot of work is still needed on the interpretation of the results.Comment: 18 pages, 19 figures, 2 tables, submitted to A

    Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm

    Full text link
    The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.Comment: 20 pages, 21 citations, 4 figure
    • …
    corecore