103 research outputs found
Fast, Linear Time Hierarchical Clustering using the Baire Metric
The Baire metric induces an ultrametric on a dataset and is of linear
computational complexity, contrasted with the standard quadratic time
agglomerative hierarchical clustering algorithm. In this work we evaluate
empirically this new approach to hierarchical clustering. We compare
hierarchical clustering based on the Baire metric with (i) agglomerative
hierarchical clustering, in terms of algorithm properties; (ii) generalized
ultrametrics, in terms of definition; and (iii) fast clustering through k-means
partititioning, in terms of quality of results. For the latter, we carry out an
in depth astronomical study. We apply the Baire distance to spectrometric and
photometric redshifts from the Sloan Digital Sky Survey using, in this work,
about half a million astronomical objects. We want to know how well the (more
costly to determine) spectrometric redshifts can predict the (more easily
obtained) photometric redshifts, i.e. we seek to regress the spectrometric on
the photometric redshifts, and we use clusterwise regression for this.Comment: 27 pages, 6 tables, 10 figure
Fast redshift clustering with the Baire (ultra) metric
The Baire metric induces an ultrametric on a dataset and is of linear
computational complexity, contrasted with the standard quadratic time
agglomerative hierarchical clustering algorithm. We apply the Baire distance to
spectrometric and photometric redshifts from the Sloan Digital Sky Survey
using, in this work, about half a million astronomical objects. We want to know
how well the (more cos\ tly to determine) spectrometric redshifts can predict
the (more easily obtained) photometric redshifts, i.e. we seek to regress the
spectrometric on the photometric redshifts, and we develop a clusterwise
nearest neighbor regression procedure for this.Comment: 14 pages, 6 figure
Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference
Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement
We describe many vantage points on the Baire metric and its use in clustering
data, or its use in preprocessing and structuring data in order to support
search and retrieval operations. In some cases, we proceed directly to clusters
and do not directly determine the distances. We show how a hierarchical
clustering can be read directly from one pass through the data. We offer
insights also on practical implications of precision of data measurement. As a
mechanism for treating multidimensional data, including very high dimensional
data, we use random projections.Comment: 17 pages, 45 citations, 2 figure
Mumford dendrograms and discrete p-adic symmetries
In this article, we present an effective encoding of dendrograms by embedding
them into the Bruhat-Tits trees associated to -adic number fields. As an
application, we show how strings over a finite alphabet can be encoded in
cyclotomic extensions of and discuss -adic DNA encoding. The
application leads to fast -adic agglomerative hierarchic algorithms similar
to the ones recently used e.g. by A. Khrennikov and others. From the viewpoint
of -adic geometry, to encode a dendrogram in a -adic field means
to fix a set of -rational punctures on the -adic projective line
. To is associated in a natural way a
subtree inside the Bruhat-Tits tree which recovers , a method first used by
F. Kato in 1999 in the classification of discrete subgroups of
.
Next, we show how the -adic moduli space of
with punctures can be applied to the study of time series of
dendrograms and those symmetries arising from hyperbolic actions on
. In this way, we can associate to certain classes of dynamical
systems a Mumford curve, i.e. a -adic algebraic curve with totally
degenerate reduction modulo .
Finally, we indicate some of our results in the study of general discrete
actions on , and their relation to -adic Hurwitz spaces.Comment: 14 pages, 6 figure
Clustering through High Dimensional Data Scaling: Applications and Implementations
To analyse very high dimensional data, or large data volumes, we study random projection. Since hierarchically clustered data can be scaled in one dimension, seriation or unidimensional scaling is our primary objective. Having determined a unidimensional scaling of the multidimensional data cloud, this is followed by clustering. In many past case studies we carried out such clustering, using the Baire, or longest common prefix, metric and, simultaneously, ultrametric. In this paper, we examine properties of the seriation, and of the induction of the clustering on the data summarization, through seriation. Simulations are described as well as a small, illustrative example using Fisher’s iris data
Algorithms for Hierarchical Clustering: An Overview, II
We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. This review adds to the earlier version, Murtagh and Contreras (2012)
- …