43,589 research outputs found
Dimensionality Reduction for k-Means Clustering and Low Rank Approximation
We show how to approximate a data matrix with a much smaller
sketch that can be used to solve a general class of
constrained k-rank approximation problems to within error.
Importantly, this class of problems includes -means clustering and
unconstrained low rank approximation (i.e. principal component analysis). By
reducing data points to just dimensions, our methods generically
accelerate any exact, approximate, or heuristic algorithm for these ubiquitous
problems.
For -means dimensionality reduction, we provide relative
error results for many common sketching techniques, including random row
projection, column selection, and approximate SVD. For approximate principal
component analysis, we give a simple alternative to known algorithms that has
applications in the streaming setting. Additionally, we extend recent work on
column-based matrix reconstruction, giving column subsets that not only `cover'
a good subspace for \bv{A}, but can be used directly to compute this
subspace.
Finally, for -means clustering, we show how to achieve a
approximation by Johnson-Lindenstrauss projecting data points to just dimensions. This gives the first result that leverages the
specific structure of -means to achieve dimension independent of input size
and sublinear in
Coresets for Fuzzy K-Means with Applications
The fuzzy K-means problem is a popular generalization of the well-known K-means problem to soft clusterings. We present the first coresets for fuzzy K-means with size linear in the dimension, polynomial in the number of clusters, and poly-logarithmic in the number of points. We show that these coresets can be employed in the computation of a (1+epsilon)-approximation for fuzzy K-means, improving previously presented results. We further show that our coresets can be maintained in an insertion-only streaming setting, where data points arrive one-by-one
FPT Approximation for Constrained Metric k-Median/Means
The Metric -median problem over a metric space is
defined as follows: given a set of facility locations
and a set of clients, open a set of
facilities such that the total service cost, defined as , is minimised. The metric -means
problem is defined similarly using squared distances. In many applications
there are additional constraints that any solution needs to satisfy. This gives
rise to different constrained versions of the problem such as -gather,
fault-tolerant, outlier -means/-median problem. Surprisingly, for many of
these constrained problems, no constant-approximation algorithm is known. We
give FPT algorithms with constant approximation guarantee for a range of
constrained -median/means problems. For some of the constrained problems,
ours is the first constant factor approximation algorithm whereas for others,
we improve or match the approximation guarantee of previous works. We work
within the unified framework of Ding and Xu that allows us to simultaneously
obtain algorithms for a range of constrained problems. In particular, we obtain
a -approximation and -approximation for the
constrained versions of the -median and -means problem respectively in
FPT time. In many practical settings of the -median/means problem, one is
allowed to open a facility at any client location, i.e., . For
this special case, our algorithm gives a -approximation and
-approximation for the constrained versions of -median and
-means problem respectively in FPT time. Since our algorithm is based on
simple sampling technique, it can also be converted to a constant-pass
log-space streaming algorithm
- …