Search CORE

15,142 research outputs found

Fast k-means based on KNN Graph

Author: Deng Cheng-Hao
Zhao Wan-Lei
Publication venue
Publication date: 04/05/2017
Field of study

In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast

k

-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

arXiv.org e-Print Archive

Crossref

Consistent procedures for cluster tree estimation and pruning

Author: Chaudhuri Kamalika
Dasgupta Sanjoy
Kpotufe Samory
von Luxburg Ulrike
Publication venue
Publication date: 05/06/2014
Field of study

For a density

f

{\mathbb R}^d

, a {\it high-density cluster} is any connected component of

\{x: f(x) \geq \lambda\}

, for some

\lambda > 0

. The set of all high-density clusters forms a hierarchy called the {\it cluster tree} of

f

. We present two procedures for estimating the cluster tree given samples from

f

. The first is a robust variant of the single linkage algorithm for hierarchical clustering. The second is based on the

k

-nearest neighbor graph of the samples. We give finite-sample convergence rates for these algorithms which also imply consistency, and we derive lower bounds on the sample complexity of cluster tree estimation. Finally, we study a tree pruning procedure that guarantees, under milder conditions than usual, to remove clusters that are spurious while recovering those that are salient

arXiv.org e-Print Archive

Princeton University Open Access Repository