32 research outputs found
An -Time Algorithm for Computing Maximum Independent Set in Graphs with Bounded Degree 3
We give an -time, polynomial space algorithm for computing
Maximum Independent Set in graphs with bounded degree 3. This improves all the
previous running time bounds known for the problem
A Quantum Approximation Scheme for k-Means
We give a quantum approximation scheme (i.e., -approximation for every ) for the classical
-means clustering problem in the QRAM model with a running time that has
only polylogarithmic dependence on the number of data points. More
specifically, given a dataset with points in stored in
QRAM data structure, our quantum algorithm runs in time and with high probability
outputs a set of centers such that . Here denotes the optimal -centers,
denotes the standard -means cost function (i.e., the sum of the
squared distance of points to the closest center), and is the aspect
ratio (i.e., the ratio of maximum distance to minimum distance). This is the
first quantum algorithm with a polylogarithmic running time that gives a
provable approximation guarantee of for the -means
problem. Also, unlike previous works on unsupervised learning, our quantum
algorithm does not require quantum linear algebra subroutines and has a running
time independent of parameters (e.g., condition number) that appear in such
procedures
A simple D^2-sampling based PTAS for k-means and other Clustering Problems
Given a set of points , the -means clustering
problem is to find a set of {\em centers} such that the objective function ,
where denotes the distance between and the closest center in ,
is minimized. This is one of the most prominent objective functions that have
been studied with respect to clustering.
-sampling \cite{ArthurV07} is a simple non-uniform sampling technique
for choosing points from a set of points. It works as follows: given a set of
points , the first point is chosen uniformly at
random from . Subsequently, a point from is chosen as the next sample
with probability proportional to the square of the distance of this point to
the nearest previously sampled points.
-sampling has been shown to have nice properties with respect to the
-means clustering problem. Arthur and Vassilvitskii \cite{ArthurV07} show
that points chosen as centers from using -sampling gives an
approximation in expectation. Ailon et. al. \cite{AJMonteleoni09}
and Aggarwal et. al. \cite{AggarwalDK09} extended results of \cite{ArthurV07}
to show that points chosen as centers using -sampling give
approximation to the -means objective function with high probability. In
this paper, we further demonstrate the power of -sampling by giving a
simple randomized -approximation algorithm that uses the
-sampling in its core
Hardness of Approximation for Euclidean k-Median
The Euclidean k-median problem is defined in the following manner: given a set ? of n points in d-dimensional Euclidean space ?^d, and an integer k, find a set C ? ?^d of k points (called centers) such that the cost function ?(C,?) ? ?_{x ? ?} min_{c ? C} ?x-c?? is minimized. The Euclidean k-means problem is defined similarly by replacing the distance with squared Euclidean distance in the cost function. Various hardness of approximation results are known for the Euclidean k-means problem [Pranjal Awasthi et al., 2015; Euiwoong Lee et al., 2017; Vincent Cohen{-}Addad and {Karthik {C. S.}}, 2019]. However, no hardness of approximation result was known for the Euclidean k-median problem. In this work, assuming the unique games conjecture (UGC), we provide the hardness of approximation result for the Euclidean k-median problem in O(log k) dimensional space. This solves an open question posed explicitly in the work of Awasthi et al. [Pranjal Awasthi et al., 2015].
Furthermore, we study the hardness of approximation for the Euclidean k-means/k-median problems in the bi-criteria setting where an algorithm is allowed to choose more than k centers. That is, bi-criteria approximation algorithms are allowed to output ? k centers (for constant ? > 1) and the approximation ratio is computed with respect to the optimal k-means/k-median cost. We show the hardness of bi-criteria approximation result for the Euclidean k-median problem for any ? < 1.015, assuming UGC. We also show a similar hardness of bi-criteria approximation result for the Euclidean k-means problem with a stronger bound of ? < 1.28, again assuming UGC
FPT Approximation for Constrained Metric k-Median/Means
The Metric -median problem over a metric space is
defined as follows: given a set of facility locations
and a set of clients, open a set of
facilities such that the total service cost, defined as , is minimised. The metric -means
problem is defined similarly using squared distances. In many applications
there are additional constraints that any solution needs to satisfy. This gives
rise to different constrained versions of the problem such as -gather,
fault-tolerant, outlier -means/-median problem. Surprisingly, for many of
these constrained problems, no constant-approximation algorithm is known. We
give FPT algorithms with constant approximation guarantee for a range of
constrained -median/means problems. For some of the constrained problems,
ours is the first constant factor approximation algorithm whereas for others,
we improve or match the approximation guarantee of previous works. We work
within the unified framework of Ding and Xu that allows us to simultaneously
obtain algorithms for a range of constrained problems. In particular, we obtain
a -approximation and -approximation for the
constrained versions of the -median and -means problem respectively in
FPT time. In many practical settings of the -median/means problem, one is
allowed to open a facility at any client location, i.e., . For
this special case, our algorithm gives a -approximation and
-approximation for the constrained versions of -median and
-means problem respectively in FPT time. Since our algorithm is based on
simple sampling technique, it can also be converted to a constant-pass
log-space streaming algorithm
Faster Algorithms for the Constrained k-Means Problem
The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O_1, ..., O_k are an arbitrary partition of the dataset and the goal is to output k-centers c_1, ..., c_k such that the objective function sum_{i=1}^{k} sum_{x in O_{i}} ||x - c_{i}||^2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1+epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2^{~O(k/epsilon)} k-centers. We also give a closely matching lower bound of 2^{~Omega(k/sqrt{epsilon})}. Moreover, our algorithm runs in time O(n * d * 2^{~O(k/epsilon)}). This is a significant improvement over the previous result of Ding and Xu who gave an algorithm with running time O(n * d * (log{n})^{k} * 2^{poly(k/epsilon)}) and output a list of size O((log{n})^k * 2^{poly(k/epsilon)}). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved
On the Distribution of the Fourier Spectrum of Halfspaces
Bourgain showed that any noise stable Boolean function can be
well-approximated by a junta. In this note we give an exponential sharpening of
the parameters of Bourgain's result under the additional assumption that is
a halfspace
Approximate Clustering with Same-Cluster Queries
Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard.
In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm