13,381 research outputs found
The Diver\u27s Distance
The visual assessment of clustering tendency (VAT) method, which was developed by J. C. Bezdek, R. J. Hathaway and J. M. Huband uses a reordering of the rows and columns of a dissimilarity matrix; it then displays the ordered dissimilarity matrix (ODM) as a 2D gray-level image called an ordered dissimilarity image (ODI). Al- though successful in determining potential clustering structure of various data sets, the technique offers room for improvement. In this thesis, we propose a new proximity measure called the diver\u27s distance which is defined based on concepts in graph theory. We then theoretically study the diver\u27s distance and its properties. From the theoretical results, we develop an algorithm (ddVAT) to efficiently compute an ODM of diver\u27s distances; its corresponding ODI proves to be more informative than the ODI obtained from VAT. Moreover, ddVAT turns out to be very efficient with linear clusters and very useful in cases where there is difficulty to satisfactorily represent cluster point representatives
Kernel Metric Learning for Clustering Mixed-type Data
Distance-based clustering and classification are widely used in various
fields to group mixed numeric and categorical data. A predefined distance
measurement is used to cluster data points based on their dissimilarity. While
there exist numerous distance-based measures for data with pure numerical
attributes and several ordered and unordered categorical metrics, an optimal
distance for mixed-type data is an open problem. Many metrics convert numerical
attributes to categorical ones or vice versa. They handle the data points as a
single attribute type or calculate a distance between each attribute separately
and add them up. We propose a metric that uses mixed kernels to measure
dissimilarity, with cross-validated optimal kernel bandwidths. Our approach
improves clustering accuracy when utilized for existing distance-based
clustering algorithms on simulated and real-world datasets containing pure
continuous, categorical, and mixed-type data.Comment: 23 pages, 5 tables, 2 figure
On morphological hierarchical representations for image processing and spatial data clustering
Hierarchical data representations in the context of classi cation and data
clustering were put forward during the fties. Recently, hierarchical image
representations have gained renewed interest for segmentation purposes. In this
paper, we briefly survey fundamental results on hierarchical clustering and
then detail recent paradigms developed for the hierarchical representation of
images in the framework of mathematical morphology: constrained connectivity
and ultrametric watersheds. Constrained connectivity can be viewed as a way to
constrain an initial hierarchy in such a way that a set of desired constraints
are satis ed. The framework of ultrametric watersheds provides a generic scheme
for computing any hierarchical connected clustering, in particular when such a
hierarchy is constrained. The suitability of this framework for solving
practical problems is illustrated with applications in remote sensing
Anytime Hierarchical Clustering
We propose a new anytime hierarchical clustering method that iteratively
transforms an arbitrary initial hierarchy on the configuration of measurements
along a sequence of trees we prove for a fixed data set must terminate in a
chain of nested partitions that satisfies a natural homogeneity requirement.
Each recursive step re-edits the tree so as to improve a local measure of
cluster homogeneity that is compatible with a number of commonly used (e.g.,
single, average, complete) linkage functions. As an alternative to the standard
batch algorithms, we present numerical evidence to suggest that appropriate
adaptations of this method can yield decentralized, scalable algorithms
suitable for distributed/parallel computation of clustering hierarchies and
online tracking of clustering trees applicable to large, dynamically changing
databases and anomaly detection.Comment: 13 pages, 6 figures, 5 tables, in preparation for submission to a
conferenc
Fast Algorithm and Implementation of Dissimilarity Self-Organizing Maps
In many real world applications, data cannot be accurately represented by
vectors. In those situations, one possible solution is to rely on dissimilarity
measures that enable sensible comparison between observations. Kohonen's
Self-Organizing Map (SOM) has been adapted to data described only through their
dissimilarity matrix. This algorithm provides both non linear projection and
clustering of non vector data. Unfortunately, the algorithm suffers from a high
cost that makes it quite difficult to use with voluminous data sets. In this
paper, we propose a new algorithm that provides an important reduction of the
theoretical cost of the dissimilarity SOM without changing its outcome (the
results are exactly the same as the ones obtained with the original algorithm).
Moreover, we introduce implementation methods that result in very short running
times. Improvements deduced from the theoretical cost model are validated on
simulated and real world data (a word list clustering problem). We also
demonstrate that the proposed implementation methods reduce by a factor up to 3
the running time of the fast algorithm over a standard implementation
- …