52 research outputs found
The Hardness of Approximation of Euclidean k-means
The Euclidean -means problem is a classical problem that has been
extensively studied in the theoretical computer science, machine learning and
the computational geometry communities. In this problem, we are given a set of
points in Euclidean space , and the goal is to choose centers in
so that the sum of squared distances of each point to its nearest center
is minimized. The best approximation algorithms for this problem include a
polynomial time constant factor approximation for general and a
-approximation which runs in time . At
the other extreme, the only known computational complexity result for this
problem is NP-hardness [ADHP'09]. The main difficulty in obtaining hardness
results stems from the Euclidean nature of the problem, and the fact that any
point in can be a potential center. This gap in understanding left open
the intriguing possibility that the problem might admit a PTAS for all .
In this paper we provide the first hardness of approximation for the
Euclidean -means problem. Concretely, we show that there exists a constant
such that it is NP-hard to approximate the -means objective
to within a factor of . We show this via an efficient reduction
from the vertex cover problem on triangle-free graphs: given a triangle-free
graph, the goal is to choose the fewest number of vertices which are incident
on all the edges. Additionally, we give a proof that the current best hardness
results for vertex cover can be carried over to triangle-free graphs. To show
this we transform , a known hard vertex cover instance, by taking a graph
product with a suitably chosen graph , and showing that the size of the
(normalized) maximum independent set is almost exactly preserved in the product
graph using a spectral analysis, which might be of independent interest
On Variants of k-means Clustering
\textit{Clustering problems} often arise in the fields like data mining,
machine learning etc. to group a collection of objects into similar groups with
respect to a similarity (or dissimilarity) measure. Among the clustering
problems, specifically \textit{-means} clustering has got much attention
from the researchers. Despite the fact that -means is a very well studied
problem its status in the plane is still an open problem. In particular, it is
unknown whether it admits a PTAS in the plane. The best known approximation
bound in polynomial time is 9+\eps.
In this paper, we consider the following variant of -means. Given a set
of points in and a real , find a finite set of
points in that minimizes the quantity . For any fixed dimension , we design a local
search PTAS for this problem. We also give a "bi-criterion" local search
algorithm for -means which uses (1+\eps)k centers and yields a solution
whose cost is at most (1+\eps) times the cost of an optimal -means
solution. The algorithm runs in polynomial time for any fixed dimension.
The contribution of this paper is two fold. On the one hand, we are being
able to handle the square of distances in an elegant manner, which yields near
optimal approximation bound. This leads us towards a better understanding of
the -means problem. On the other hand, our analysis of local search might
also be useful for other geometric problems. This is important considering that
very little is known about the local search method for geometric approximation.Comment: 15 page
A bi-criteria approximation algorithm for Means
We consider the classical -means clustering problem in the setting
bi-criteria approximation, in which an algoithm is allowed to output clusters, and must produce a clustering with cost at most times the
to the cost of the optimal set of clusters. We argue that this approach is
natural in many settings, for which the exact number of clusters is a priori
unknown, or unimportant up to a constant factor. We give new bi-criteria
approximation algorithms, based on linear programming and local search,
respectively, which attain a guarantee depending on the number
of clusters that may be opened. Our gurantee is
always at most and improves rapidly with (for example:
, and ). Moreover, our algorithms have only
polynomial dependence on the dimension of the input data, and so are applicable
in high-dimensional settings
- …