Search CORE

662 research outputs found

A simple D^2-sampling based PTAS for k-means and other Clustering Problems

Author: Jaiswal Ragesh
Kumar Amit
Sen Sandeep
Publication venue
Publication date: 20/01/2012
Field of study

Given a set of points

P \subset \mathbb{R}^d

, the

k

-means clustering problem is to find a set of

k

{\em centers}

C = \{c_1,...,c_k\}, c_i \in \mathbb{R}^d,

such that the objective function

\sum_{x \in P} d(x,C)^2

, where

d(x,C)

denotes the distance between

x

and the closest center in

C

, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering.

D^2

-sampling \cite{ArthurV07} is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points

P \subseteq \mathbb{R}^d

, the first point is chosen uniformly at random from

P

. Subsequently, a point from

P

is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points.

D^2

-sampling has been shown to have nice properties with respect to the

k

-means clustering problem. Arthur and Vassilvitskii \cite{ArthurV07} show that

k

points chosen as centers from

P

using

D^2

-sampling gives an

O(\log{k})

approximation in expectation. Ailon et. al. \cite{AJMonteleoni09} and Aggarwal et. al. \cite{AggarwalDK09} extended results of \cite{ArthurV07} to show that

O(k)

points chosen as centers using

D^2

-sampling give

O(1)

approximation to the

k

-means objective function with high probability. In this paper, we further demonstrate the power of

D^2

-sampling by giving a simple randomized

(1 + \epsilon)

-approximation algorithm that uses the

D^2

-sampling in its core

arXiv.org e-Print Archive

CiteSeerX

The Hardness of Approximation of Euclidean k-means

Author: Awasthi Pranjal
Charikar Moses
Krishnaswamy Ravishankar
Sinop Ali Kemal
Publication venue
Publication date: 01/01/2015
Field of study

The Euclidean

k

-means problem is a classical problem that has been extensively studied in the theoretical computer science, machine learning and the computational geometry communities. In this problem, we are given a set of

n

points in Euclidean space

R^d

, and the goal is to choose

k

centers in

R^d

so that the sum of squared distances of each point to its nearest center is minimized. The best approximation algorithms for this problem include a polynomial time constant factor approximation for general

k

and a

(1+\epsilon)

-approximation which runs in time

poly(n) 2^{O(k/\epsilon)}

. At the other extreme, the only known computational complexity result for this problem is NP-hardness [ADHP'09]. The main difficulty in obtaining hardness results stems from the Euclidean nature of the problem, and the fact that any point in

R^d

can be a potential center. This gap in understanding left open the intriguing possibility that the problem might admit a PTAS for all

k,d

. In this paper we provide the first hardness of approximation for the Euclidean

k

-means problem. Concretely, we show that there exists a constant

\epsilon > 0

such that it is NP-hard to approximate the

k

-means objective to within a factor of

(1+\epsilon)

. We show this via an efficient reduction from the vertex cover problem on triangle-free graphs: given a triangle-free graph, the goal is to choose the fewest number of vertices which are incident on all the edges. Additionally, we give a proof that the current best hardness results for vertex cover can be carried over to triangle-free graphs. To show this we transform

G

, a known hard vertex cover instance, by taking a graph product with a suitably chosen graph

H

, and showing that the size of the (normalized) maximum independent set is almost exactly preserved in the product graph using a spectral analysis, which might be of independent interest

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server