52 research outputs found

    The Hardness of Approximation of Euclidean k-means

    Get PDF
    The Euclidean kk-means problem is a classical problem that has been extensively studied in the theoretical computer science, machine learning and the computational geometry communities. In this problem, we are given a set of nn points in Euclidean space RdR^d, and the goal is to choose kk centers in RdR^d so that the sum of squared distances of each point to its nearest center is minimized. The best approximation algorithms for this problem include a polynomial time constant factor approximation for general kk and a (1+ϵ)(1+\epsilon)-approximation which runs in time poly(n)2O(k/ϵ)poly(n) 2^{O(k/\epsilon)}. At the other extreme, the only known computational complexity result for this problem is NP-hardness [ADHP'09]. The main difficulty in obtaining hardness results stems from the Euclidean nature of the problem, and the fact that any point in RdR^d can be a potential center. This gap in understanding left open the intriguing possibility that the problem might admit a PTAS for all k,dk,d. In this paper we provide the first hardness of approximation for the Euclidean kk-means problem. Concretely, we show that there exists a constant ϵ>0\epsilon > 0 such that it is NP-hard to approximate the kk-means objective to within a factor of (1+ϵ)(1+\epsilon). We show this via an efficient reduction from the vertex cover problem on triangle-free graphs: given a triangle-free graph, the goal is to choose the fewest number of vertices which are incident on all the edges. Additionally, we give a proof that the current best hardness results for vertex cover can be carried over to triangle-free graphs. To show this we transform GG, a known hard vertex cover instance, by taking a graph product with a suitably chosen graph HH, and showing that the size of the (normalized) maximum independent set is almost exactly preserved in the product graph using a spectral analysis, which might be of independent interest

    On Variants of k-means Clustering

    Get PDF
    \textit{Clustering problems} often arise in the fields like data mining, machine learning etc. to group a collection of objects into similar groups with respect to a similarity (or dissimilarity) measure. Among the clustering problems, specifically \textit{kk-means} clustering has got much attention from the researchers. Despite the fact that kk-means is a very well studied problem its status in the plane is still an open problem. In particular, it is unknown whether it admits a PTAS in the plane. The best known approximation bound in polynomial time is 9+\eps. In this paper, we consider the following variant of kk-means. Given a set CC of points in Rd\mathcal{R}^d and a real f>0f > 0, find a finite set FF of points in Rd\mathcal{R}^d that minimizes the quantity fF+pCminqFpq2f*|F|+\sum_{p\in C} \min_{q \in F} {||p-q||}^2. For any fixed dimension dd, we design a local search PTAS for this problem. We also give a "bi-criterion" local search algorithm for kk-means which uses (1+\eps)k centers and yields a solution whose cost is at most (1+\eps) times the cost of an optimal kk-means solution. The algorithm runs in polynomial time for any fixed dimension. The contribution of this paper is two fold. On the one hand, we are being able to handle the square of distances in an elegant manner, which yields near optimal approximation bound. This leads us towards a better understanding of the kk-means problem. On the other hand, our analysis of local search might also be useful for other geometric problems. This is important considering that very little is known about the local search method for geometric approximation.Comment: 15 page

    A bi-criteria approximation algorithm for kk Means

    Get PDF
    We consider the classical kk-means clustering problem in the setting bi-criteria approximation, in which an algoithm is allowed to output βk>k\beta k > k clusters, and must produce a clustering with cost at most α\alpha times the to the cost of the optimal set of kk clusters. We argue that this approach is natural in many settings, for which the exact number of clusters is a priori unknown, or unimportant up to a constant factor. We give new bi-criteria approximation algorithms, based on linear programming and local search, respectively, which attain a guarantee α(β)\alpha(\beta) depending on the number βk\beta k of clusters that may be opened. Our gurantee α(β)\alpha(\beta) is always at most 9+ϵ9 + \epsilon and improves rapidly with β\beta (for example: α(2)<2.59\alpha(2)<2.59, and α(3)<1.4\alpha(3) < 1.4). Moreover, our algorithms have only polynomial dependence on the dimension of the input data, and so are applicable in high-dimensional settings