489 research outputs found
Persistence-Based Clustering in Riemannian Manifolds
We present a novel clustering algorithm that combines a mode-seeking phase with a cluster merging phase. While mode detection is performed by a standard graph-based hill-climbing scheme, the novelty of our approach resides in its use of {\em topological persistence} theory to guide the merges between clusters. An interesting feature of our algorithm is to provide additional feedback in the form of a finite set of points in the plane, called a {\em persistence diagram}, which provably reflects the prominence of each of the modes of the density. Such feedback is an invaluable tool in practice, as it enables the user to determine a set of parameter values that will make the algorithm compute a relevant clustering on the next run. In terms of generality, our approach requires the sole knowledge of (approximate) pairwise distances between the data points, as well as of rough estimates of the density at these points. It is therefore virtually applicable in any arbitrary metric space. In the meantime, its complexity remains reasonable: although the size of the input distance matrix may be up to quadratic in the number of data points, a careful implementation only uses a linear amount of main memory and barely takes more time to run than the one spent reading the input. Taking advantage of recent advances in topological persistence theory, we are able to give a theoretically sound notion of what the {\em correct} number of clusters is, and to prove that under mild sampling conditions and a relevant choice of parameters (made possible in practice by the persistence diagram) our clustering scheme computes a set of clusters whose spatial locations are bound to the ones of the basins of attraction of the peaks of the density. These guarantess hold in a large variety of contexts, including when data points are distributed along some unknown Riemannian manifold
A Topological Approach to Spectral Clustering
We propose two related unsupervised clustering algorithms which, for input,
take data assumed to be sampled from a uniform distribution supported on a
metric space , and output a clustering of the data based on the selection of
a topological model for the connected components of . Both algorithms work
by selecting a graph on the samples from a natural one-parameter family of
graphs, using a geometric criterion in the first case and an information
theoretic criterion in the second. The estimated connected components of
are identified with the kernel of the associated graph Laplacian, which allows
the algorithm to work without requiring the number of expected clusters or
other auxiliary data as input.Comment: 21 Page
Optimal rates of convergence for persistence diagrams in Topological Data Analysis
Computational topology has recently known an important development toward
data analysis, giving birth to the field of topological data analysis.
Topological persistence, or persistent homology, appears as a fundamental tool
in this field. In this paper, we study topological persistence in general
metric spaces, with a statistical approach. We show that the use of persistent
homology can be naturally considered in general statistical frameworks and
persistence diagrams can be used as statistics with interesting convergence
properties. Some numerical experiments are performed in various contexts to
illustrate our results
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering
Hierarchical clustering is a popular method for analyzing data which
associates a tree to a dataset. Hartigan consistency has been used extensively
as a framework to analyze such clustering algorithms from a statistical point
of view. Still, as we show in the paper, a tree which is Hartigan consistent
with a given density can look very different than the correct limit tree.
Specifically, Hartigan consistency permits two types of undesirable
configurations which we term over-segmentation and improper nesting. Moreover,
Hartigan consistency is a limit property and does not directly quantify
difference between trees.
In this paper we identify two limit properties, separation and minimality,
which address both over-segmentation and improper nesting and together imply
(but are not implied by) Hartigan consistency. We proceed to introduce a merge
distortion metric between hierarchical clusterings and show that convergence in
our distance implies both separation and minimality. We also prove that uniform
separation and minimality imply convergence in the merge distortion metric.
Furthermore, we show that our merge distortion metric is stable under
perturbations of the density.
Finally, we demonstrate applicability of these concepts by proving
convergence results for two clustering algorithms. First, we show convergence
(and hence separation and minimality) of the recent robust single linkage
algorithm of Chaudhuri and Dasgupta (2010). Second, we provide convergence
results on manifolds for topological split tree clustering
Quantization and clustering on Riemannian manifolds with an application to air traffic analysis
International audienceThe goal of quantization is to find the best approximation of a probability distribution by a discrete measure with finite support. When dealing with empirical distributions, this boils down to finding the best summary of the data by a smaller number of points, and automatically yields a k-means-type clustering. In this paper, we introduce Competitive Learning Riemannian Quantization (CLRQ), an online quantization algorithm that applies when the data does not belong to a vector space, but rather a Riemannian manifold. It can be seen as a density approximation procedure as well as a clustering method. Compared to many clustering algorihtms, it requires few distance computations, which is particularly computationally advantageous in the manifold setting. We prove its convergence and show simulated examples on the sphere and the hyperbolic plane. We also provide an application to real data by using CLRQ to create summaries of images of covariance matrices estimated from air traffic images. These summaries are representative of the air traffic complexity and yield clusterings of the airspaces into zones that are homogeneous with respect to that criterion. They can then be compared using discrete optimal transport and be further used as inputs of a machine learning algorithm or as indexes in a traffic database
- …