Search CORE

804 research outputs found

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Author: Cohen Michael B.
Elder Sam
Musco Cameron
Musco Christopher
Persu Madalina
Publication venue
Publication date: 02/04/2015
Field of study

We show how to approximate a data matrix

\mathbf{A}

with a much smaller sketch

\mathbf{\tilde A}

that can be used to solve a general class of constrained k-rank approximation problems to within

(1+\epsilon)

error. Importantly, this class of problems includes

k

-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just

O(k)

dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For

k

-means dimensionality reduction, we provide

(1+\epsilon)

relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for \bv{A}, but can be used directly to compute this subspace. Finally, for

k

-means clustering, we show how to achieve a

(9+\epsilon)

approximation by Johnson-Lindenstrauss projecting data points to just

O(\log k/\epsilon^2)

dimensions. This gives the first result that leverages the specific structure of

k

-means to achieve dimension independent of input size and sublinear in

k

arXiv.org e-Print Archive

CiteSeerX

Scalable and Robust Community Detection with Randomized Sketching

Author: Atia George
Beckus Andre
Karimian Adel
Rahmani Mostafa
Publication venue
Publication date: 24/05/2019
Field of study

This paper explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This algorithm improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion

arXiv.org e-Print Archive