688 research outputs found
Clustering Stability: An Overview
A popular method for selecting the number of clusters is based on stability
arguments: one chooses the number of clusters such that the corresponding
clustering results are "most stable". In recent years, a series of papers has
analyzed the behavior of this method from a theoretical point of view. However,
the results are very technical and difficult to interpret for non-experts. In
this paper we give a high-level overview about the existing literature on
clustering stability. In addition to presenting the results in a slightly
informal but accessible way, we relate them to each other and discuss their
different implications
Kernel functions based on triplet comparisons
Given only information in the form of similarity triplets "Object A is more
similar to object B than to object C" about a data set, we propose two ways of
defining a kernel function on the data set. While previous approaches construct
a low-dimensional Euclidean embedding of the data set that reflects the given
similarity triplets, we aim at defining kernel functions that correspond to
high-dimensional embeddings. These kernel functions can subsequently be used to
apply any kernel method to the data set
Shortest path distance in random k-nearest neighbor graphs
Consider a weighted or unweighted k-nearest neighbor graph that has been
built on n data points drawn randomly according to some density p on R^d. We
study the convergence of the shortest path distance in such graphs as the
sample size tends to infinity. We prove that for unweighted kNN graphs, this
distance converges to an unpleasant distance function on the underlying space
whose properties are detrimental to machine learning. We also study the
behavior of the shortest path distance in weighted kNN graphs.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
A Tutorial on Spectral Clustering
In recent years, spectral clustering has become one of the most popular
modern clustering algorithms. It is simple to implement, can be solved
efficiently by standard linear algebra software, and very often outperforms
traditional clustering algorithms such as the k-means algorithm. On the first
glance spectral clustering appears slightly mysterious, and it is not obvious
to see why it works at all and what it really does. The goal of this tutorial
is to give some intuition on those questions. We describe different graph
Laplacians and their basic properties, present the most common spectral
clustering algorithms, and derive those algorithms from scratch by several
different approaches. Advantages and disadvantages of the different spectral
clustering algorithms are discussed
Graph Laplacians and their convergence on random neighborhood graphs
Given a sample from a probability measure with support on a submanifold in
Euclidean space one can construct a neighborhood graph which can be seen as an
approximation of the submanifold. The graph Laplacian of such a graph is used
in several machine learning methods like semi-supervised learning,
dimensionality reduction and clustering. In this paper we determine the
pointwise limit of three different graph Laplacians used in the literature as
the sample size increases and the neighborhood size approaches zero. We show
that for a uniform measure on the submanifold all graph Laplacians have the
same limit up to constants. However in the case of a non-uniform measure on the
submanifold only the so called random walk graph Laplacian converges to the
weighted Laplace-Beltrami operator.Comment: Improved presentation, typos corrected, to appear in JML
Consistent procedures for cluster tree estimation and pruning
For a density on , a {\it high-density cluster} is any
connected component of , for some . The
set of all high-density clusters forms a hierarchy called the {\it cluster
tree} of . We present two procedures for estimating the cluster tree given
samples from . The first is a robust variant of the single linkage algorithm
for hierarchical clustering. The second is based on the -nearest neighbor
graph of the samples. We give finite-sample convergence rates for these
algorithms which also imply consistency, and we derive lower bounds on the
sample complexity of cluster tree estimation. Finally, we study a tree pruning
procedure that guarantees, under milder conditions than usual, to remove
clusters that are spurious while recovering those that are salient
- …