662 research outputs found
A simple D^2-sampling based PTAS for k-means and other Clustering Problems
Given a set of points , the -means clustering
problem is to find a set of {\em centers} such that the objective function ,
where denotes the distance between and the closest center in ,
is minimized. This is one of the most prominent objective functions that have
been studied with respect to clustering.
-sampling \cite{ArthurV07} is a simple non-uniform sampling technique
for choosing points from a set of points. It works as follows: given a set of
points , the first point is chosen uniformly at
random from . Subsequently, a point from is chosen as the next sample
with probability proportional to the square of the distance of this point to
the nearest previously sampled points.
-sampling has been shown to have nice properties with respect to the
-means clustering problem. Arthur and Vassilvitskii \cite{ArthurV07} show
that points chosen as centers from using -sampling gives an
approximation in expectation. Ailon et. al. \cite{AJMonteleoni09}
and Aggarwal et. al. \cite{AggarwalDK09} extended results of \cite{ArthurV07}
to show that points chosen as centers using -sampling give
approximation to the -means objective function with high probability. In
this paper, we further demonstrate the power of -sampling by giving a
simple randomized -approximation algorithm that uses the
-sampling in its core
The Hardness of Approximation of Euclidean k-means
The Euclidean -means problem is a classical problem that has been
extensively studied in the theoretical computer science, machine learning and
the computational geometry communities. In this problem, we are given a set of
points in Euclidean space , and the goal is to choose centers in
so that the sum of squared distances of each point to its nearest center
is minimized. The best approximation algorithms for this problem include a
polynomial time constant factor approximation for general and a
-approximation which runs in time . At
the other extreme, the only known computational complexity result for this
problem is NP-hardness [ADHP'09]. The main difficulty in obtaining hardness
results stems from the Euclidean nature of the problem, and the fact that any
point in can be a potential center. This gap in understanding left open
the intriguing possibility that the problem might admit a PTAS for all .
In this paper we provide the first hardness of approximation for the
Euclidean -means problem. Concretely, we show that there exists a constant
such that it is NP-hard to approximate the -means objective
to within a factor of . We show this via an efficient reduction
from the vertex cover problem on triangle-free graphs: given a triangle-free
graph, the goal is to choose the fewest number of vertices which are incident
on all the edges. Additionally, we give a proof that the current best hardness
results for vertex cover can be carried over to triangle-free graphs. To show
this we transform , a known hard vertex cover instance, by taking a graph
product with a suitably chosen graph , and showing that the size of the
(normalized) maximum independent set is almost exactly preserved in the product
graph using a spectral analysis, which might be of independent interest
- …