804 research outputs found
Dimensionality Reduction for k-Means Clustering and Low Rank Approximation
We show how to approximate a data matrix with a much smaller
sketch that can be used to solve a general class of
constrained k-rank approximation problems to within error.
Importantly, this class of problems includes -means clustering and
unconstrained low rank approximation (i.e. principal component analysis). By
reducing data points to just dimensions, our methods generically
accelerate any exact, approximate, or heuristic algorithm for these ubiquitous
problems.
For -means dimensionality reduction, we provide relative
error results for many common sketching techniques, including random row
projection, column selection, and approximate SVD. For approximate principal
component analysis, we give a simple alternative to known algorithms that has
applications in the streaming setting. Additionally, we extend recent work on
column-based matrix reconstruction, giving column subsets that not only `cover'
a good subspace for \bv{A}, but can be used directly to compute this
subspace.
Finally, for -means clustering, we show how to achieve a
approximation by Johnson-Lindenstrauss projecting data points to just dimensions. This gives the first result that leverages the
specific structure of -means to achieve dimension independent of input size
and sublinear in
Scalable and Robust Community Detection with Randomized Sketching
This paper explores and analyzes the unsupervised clustering of large
partially observed graphs. We propose a scalable and provable randomized
framework for clustering graphs generated from the stochastic block model. The
clustering is first applied to a sub-matrix of the graph's adjacency matrix
associated with a reduced graph sketch constructed using random sampling. Then,
the clusters of the full graph are inferred based on the clusters extracted
from the sketch using a correlation-based retrieval step. Uniform random node
sampling is shown to improve the computational complexity over clustering of
the full graph when the cluster sizes are balanced. A new random degree-based
node sampling algorithm is presented which significantly improves upon the
performance of the clustering algorithm even when clusters are unbalanced. This
algorithm improves the phase transitions for matrix-decomposition-based
clustering with regard to computational complexity and minimum cluster size,
which are shown to be nearly dimension-free in the low inter-cluster
connectivity regime. A third sampling technique is shown to improve balance by
randomly sampling nodes based on spatial distribution. We provide analysis and
numerical results using a convex clustering algorithm based on matrix
completion
- …