662 research outputs found

    A simple D^2-sampling based PTAS for k-means and other Clustering Problems

    Full text link
    Given a set of points PRdP \subset \mathbb{R}^d, the kk-means clustering problem is to find a set of kk {\em centers} C={c1,...,ck},ciRd,C = \{c_1,...,c_k\}, c_i \in \mathbb{R}^d, such that the objective function xPd(x,C)2\sum_{x \in P} d(x,C)^2, where d(x,C)d(x,C) denotes the distance between xx and the closest center in CC, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering. D2D^2-sampling \cite{ArthurV07} is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points PRdP \subseteq \mathbb{R}^d, the first point is chosen uniformly at random from PP. Subsequently, a point from PP is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points. D2D^2-sampling has been shown to have nice properties with respect to the kk-means clustering problem. Arthur and Vassilvitskii \cite{ArthurV07} show that kk points chosen as centers from PP using D2D^2-sampling gives an O(logk)O(\log{k}) approximation in expectation. Ailon et. al. \cite{AJMonteleoni09} and Aggarwal et. al. \cite{AggarwalDK09} extended results of \cite{ArthurV07} to show that O(k)O(k) points chosen as centers using D2D^2-sampling give O(1)O(1) approximation to the kk-means objective function with high probability. In this paper, we further demonstrate the power of D2D^2-sampling by giving a simple randomized (1+ϵ)(1 + \epsilon)-approximation algorithm that uses the D2D^2-sampling in its core

    The Hardness of Approximation of Euclidean k-means

    Get PDF
    The Euclidean kk-means problem is a classical problem that has been extensively studied in the theoretical computer science, machine learning and the computational geometry communities. In this problem, we are given a set of nn points in Euclidean space RdR^d, and the goal is to choose kk centers in RdR^d so that the sum of squared distances of each point to its nearest center is minimized. The best approximation algorithms for this problem include a polynomial time constant factor approximation for general kk and a (1+ϵ)(1+\epsilon)-approximation which runs in time poly(n)2O(k/ϵ)poly(n) 2^{O(k/\epsilon)}. At the other extreme, the only known computational complexity result for this problem is NP-hardness [ADHP'09]. The main difficulty in obtaining hardness results stems from the Euclidean nature of the problem, and the fact that any point in RdR^d can be a potential center. This gap in understanding left open the intriguing possibility that the problem might admit a PTAS for all k,dk,d. In this paper we provide the first hardness of approximation for the Euclidean kk-means problem. Concretely, we show that there exists a constant ϵ>0\epsilon > 0 such that it is NP-hard to approximate the kk-means objective to within a factor of (1+ϵ)(1+\epsilon). We show this via an efficient reduction from the vertex cover problem on triangle-free graphs: given a triangle-free graph, the goal is to choose the fewest number of vertices which are incident on all the edges. Additionally, we give a proof that the current best hardness results for vertex cover can be carried over to triangle-free graphs. To show this we transform GG, a known hard vertex cover instance, by taking a graph product with a suitably chosen graph HH, and showing that the size of the (normalized) maximum independent set is almost exactly preserved in the product graph using a spectral analysis, which might be of independent interest
    corecore