103 research outputs found

    Fast, Linear Time Hierarchical Clustering using the Baire Metric

    Get PDF
    The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorithm properties; (ii) generalized ultrametrics, in terms of definition; and (iii) fast clustering through k-means partititioning, in terms of quality of results. For the latter, we carry out an in depth astronomical study. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more costly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we use clusterwise regression for this.Comment: 27 pages, 6 tables, 10 figure

    Fast redshift clustering with the Baire (ultra) metric

    Full text link
    The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more cos\ tly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we develop a clusterwise nearest neighbor regression procedure for this.Comment: 14 pages, 6 figure

    Methods of Hierarchical Clustering

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference

    Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement

    Full text link
    We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of data measurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.Comment: 17 pages, 45 citations, 2 figure

    Mumford dendrograms and discrete p-adic symmetries

    Full text link
    In this article, we present an effective encoding of dendrograms by embedding them into the Bruhat-Tits trees associated to pp-adic number fields. As an application, we show how strings over a finite alphabet can be encoded in cyclotomic extensions of Qp\mathbb{Q}_p and discuss pp-adic DNA encoding. The application leads to fast pp-adic agglomerative hierarchic algorithms similar to the ones recently used e.g. by A. Khrennikov and others. From the viewpoint of pp-adic geometry, to encode a dendrogram XX in a pp-adic field KK means to fix a set SS of KK-rational punctures on the pp-adic projective line P1\mathbb{P}^1. To P1∖S\mathbb{P}^1\setminus S is associated in a natural way a subtree inside the Bruhat-Tits tree which recovers XX, a method first used by F. Kato in 1999 in the classification of discrete subgroups of PGL2(K)\textrm{PGL}_2(K). Next, we show how the pp-adic moduli space M0,n\mathfrak{M}_{0,n} of P1\mathbb{P}^1 with nn punctures can be applied to the study of time series of dendrograms and those symmetries arising from hyperbolic actions on P1\mathbb{P}^1. In this way, we can associate to certain classes of dynamical systems a Mumford curve, i.e. a pp-adic algebraic curve with totally degenerate reduction modulo pp. Finally, we indicate some of our results in the study of general discrete actions on P1\mathbb{P}^1, and their relation to pp-adic Hurwitz spaces.Comment: 14 pages, 6 figure

    Clustering through High Dimensional Data Scaling: Applications and Implementations

    Get PDF
    To analyse very high dimensional data, or large data volumes, we study random projection. Since hierarchically clustered data can be scaled in one dimension, seriation or unidimensional scaling is our primary objective. Having determined a unidimensional scaling of the multidimensional data cloud, this is followed by clustering. In many past case studies we carried out such clustering, using the Baire, or longest common prefix, metric and, simultaneously, ultrametric. In this paper, we examine properties of the seriation, and of the induction of the clustering on the data summarization, through seriation. Simulations are described as well as a small, illustrative example using Fisher’s iris data

    Algorithms for Hierarchical Clustering: An Overview, II

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. This review adds to the earlier version, Murtagh and Contreras (2012)
    • …
    corecore