43,589 research outputs found

    Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

    Full text link
    We show how to approximate a data matrix A\mathbf{A} with a much smaller sketch A~\mathbf{\tilde A} that can be used to solve a general class of constrained k-rank approximation problems to within (1+ϵ)(1+\epsilon) error. Importantly, this class of problems includes kk-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k)O(k) dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For kk-means dimensionality reduction, we provide (1+ϵ)(1+\epsilon) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for \bv{A}, but can be used directly to compute this subspace. Finally, for kk-means clustering, we show how to achieve a (9+ϵ)(9+\epsilon) approximation by Johnson-Lindenstrauss projecting data points to just O(logk/ϵ2)O(\log k/\epsilon^2) dimensions. This gives the first result that leverages the specific structure of kk-means to achieve dimension independent of input size and sublinear in kk

    Coresets for Fuzzy K-Means with Applications

    Get PDF
    The fuzzy K-means problem is a popular generalization of the well-known K-means problem to soft clusterings. We present the first coresets for fuzzy K-means with size linear in the dimension, polynomial in the number of clusters, and poly-logarithmic in the number of points. We show that these coresets can be employed in the computation of a (1+epsilon)-approximation for fuzzy K-means, improving previously presented results. We further show that our coresets can be maintained in an insertion-only streaming setting, where data points arrive one-by-one

    FPT Approximation for Constrained Metric k-Median/Means

    Get PDF
    The Metric kk-median problem over a metric space (X,d)(\mathcal{X}, d) is defined as follows: given a set LXL \subseteq \mathcal{X} of facility locations and a set CXC \subseteq \mathcal{X} of clients, open a set FLF \subseteq L of kk facilities such that the total service cost, defined as Φ(F,C)xCminfFd(x,f)\Phi(F, C) \equiv \sum_{x \in C} \min_{f \in F} d(x, f), is minimised. The metric kk-means problem is defined similarly using squared distances. In many applications there are additional constraints that any solution needs to satisfy. This gives rise to different constrained versions of the problem such as rr-gather, fault-tolerant, outlier kk-means/kk-median problem. Surprisingly, for many of these constrained problems, no constant-approximation algorithm is known. We give FPT algorithms with constant approximation guarantee for a range of constrained kk-median/means problems. For some of the constrained problems, ours is the first constant factor approximation algorithm whereas for others, we improve or match the approximation guarantee of previous works. We work within the unified framework of Ding and Xu that allows us to simultaneously obtain algorithms for a range of constrained problems. In particular, we obtain a (3+ε)(3+\varepsilon)-approximation and (9+ε)(9+\varepsilon)-approximation for the constrained versions of the kk-median and kk-means problem respectively in FPT time. In many practical settings of the kk-median/means problem, one is allowed to open a facility at any client location, i.e., CLC \subseteq L. For this special case, our algorithm gives a (2+ε)(2+\varepsilon)-approximation and (4+ε)(4+\varepsilon)-approximation for the constrained versions of kk-median and kk-means problem respectively in FPT time. Since our algorithm is based on simple sampling technique, it can also be converted to a constant-pass log-space streaming algorithm
    corecore