    An O(1.0821n)O^*(1.0821^n)-Time Algorithm for Computing Maximum Independent Set in Graphs with Bounded Degree 3

    We give an O(1.0821n)O^*(1.0821^n)-time, polynomial space algorithm for computing Maximum Independent Set in graphs with bounded degree 3. This improves all the previous running time bounds known for the problem

    A Quantum Approximation Scheme for k-Means

    We give a quantum approximation scheme (i.e., (1+ε)(1 + \varepsilon)-approximation for every ε>0\varepsilon > 0) for the classical kk-means clustering problem in the QRAM model with a running time that has only polylogarithmic dependence on the number of data points. More specifically, given a dataset VV with NN points in Rd\mathbb{R}^d stored in QRAM data structure, our quantum algorithm runs in time O~(2O~(kε)η2d)\tilde{O} \left( 2^{\tilde{O}(\frac{k}{\varepsilon})} \eta^2 d\right) and with high probability outputs a set CC of kk centers such that cost(V,C)(1+ε)cost(V,COPT)cost(V, C) \leq (1+\varepsilon) \cdot cost(V, C_{OPT}). Here COPTC_{OPT} denotes the optimal kk-centers, cost(.)cost(.) denotes the standard kk-means cost function (i.e., the sum of the squared distance of points to the closest center), and η\eta is the aspect ratio (i.e., the ratio of maximum distance to minimum distance). This is the first quantum algorithm with a polylogarithmic running time that gives a provable approximation guarantee of (1+ε)(1+\varepsilon) for the kk-means problem. Also, unlike previous works on unsupervised learning, our quantum algorithm does not require quantum linear algebra subroutines and has a running time independent of parameters (e.g., condition number) that appear in such procedures

    A simple D^2-sampling based PTAS for k-means and other Clustering Problems

    Given a set of points PRdP \subset \mathbb{R}^d, the kk-means clustering problem is to find a set of kk {\em centers} C={c1,...,ck},ciRd,C = \{c_1,...,c_k\}, c_i \in \mathbb{R}^d, such that the objective function xPd(x,C)2\sum_{x \in P} d(x,C)^2, where d(x,C)d(x,C) denotes the distance between xx and the closest center in CC, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering. D2D^2-sampling \cite{ArthurV07} is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points PRdP \subseteq \mathbb{R}^d, the first point is chosen uniformly at random from PP. Subsequently, a point from PP is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points. D2D^2-sampling has been shown to have nice properties with respect to the kk-means clustering problem. Arthur and Vassilvitskii \cite{ArthurV07} show that kk points chosen as centers from PP using D2D^2-sampling gives an O(logk)O(\log{k}) approximation in expectation. Ailon et. al. \cite{AJMonteleoni09} and Aggarwal et. al. \cite{AggarwalDK09} extended results of \cite{ArthurV07} to show that O(k)O(k) points chosen as centers using D2D^2-sampling give O(1)O(1) approximation to the kk-means objective function with high probability. In this paper, we further demonstrate the power of D2D^2-sampling by giving a simple randomized (1+ϵ)(1 + \epsilon)-approximation algorithm that uses the D2D^2-sampling in its core

    Hardness of Approximation for Euclidean k-Median

    The Euclidean k-median problem is defined in the following manner: given a set ? of n points in d-dimensional Euclidean space ?^d, and an integer k, find a set C ? ?^d of k points (called centers) such that the cost function ?(C,?) ? ?_{x ? ?} min_{c ? C} ?x-c?? is minimized. The Euclidean k-means problem is defined similarly by replacing the distance with squared Euclidean distance in the cost function. Various hardness of approximation results are known for the Euclidean k-means problem [Pranjal Awasthi et al., 2015; Euiwoong Lee et al., 2017; Vincent Cohen{-}Addad and {Karthik {C. S.}}, 2019]. However, no hardness of approximation result was known for the Euclidean k-median problem. In this work, assuming the unique games conjecture (UGC), we provide the hardness of approximation result for the Euclidean k-median problem in O(log k) dimensional space. This solves an open question posed explicitly in the work of Awasthi et al. [Pranjal Awasthi et al., 2015]. Furthermore, we study the hardness of approximation for the Euclidean k-means/k-median problems in the bi-criteria setting where an algorithm is allowed to choose more than k centers. That is, bi-criteria approximation algorithms are allowed to output ? k centers (for constant ? > 1) and the approximation ratio is computed with respect to the optimal k-means/k-median cost. We show the hardness of bi-criteria approximation result for the Euclidean k-median problem for any ? < 1.015, assuming UGC. We also show a similar hardness of bi-criteria approximation result for the Euclidean k-means problem with a stronger bound of ? < 1.28, again assuming UGC

    FPT Approximation for Constrained Metric k-Median/Means

    The Metric kk-median problem over a metric space (X,d)(\mathcal{X}, d) is defined as follows: given a set LXL \subseteq \mathcal{X} of facility locations and a set CXC \subseteq \mathcal{X} of clients, open a set FLF \subseteq L of kk facilities such that the total service cost, defined as Φ(F,C)xCminfFd(x,f)\Phi(F, C) \equiv \sum_{x \in C} \min_{f \in F} d(x, f), is minimised. The metric kk-means problem is defined similarly using squared distances. In many applications there are additional constraints that any solution needs to satisfy. This gives rise to different constrained versions of the problem such as rr-gather, fault-tolerant, outlier kk-means/kk-median problem. Surprisingly, for many of these constrained problems, no constant-approximation algorithm is known. We give FPT algorithms with constant approximation guarantee for a range of constrained kk-median/means problems. For some of the constrained problems, ours is the first constant factor approximation algorithm whereas for others, we improve or match the approximation guarantee of previous works. We work within the unified framework of Ding and Xu that allows us to simultaneously obtain algorithms for a range of constrained problems. In particular, we obtain a (3+ε)(3+\varepsilon)-approximation and (9+ε)(9+\varepsilon)-approximation for the constrained versions of the kk-median and kk-means problem respectively in FPT time. In many practical settings of the kk-median/means problem, one is allowed to open a facility at any client location, i.e., CLC \subseteq L. For this special case, our algorithm gives a (2+ε)(2+\varepsilon)-approximation and (4+ε)(4+\varepsilon)-approximation for the constrained versions of kk-median and kk-means problem respectively in FPT time. Since our algorithm is based on simple sampling technique, it can also be converted to a constant-pass log-space streaming algorithm

    Faster Algorithms for the Constrained k-Means Problem

    The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O_1, ..., O_k are an arbitrary partition of the dataset and the goal is to output k-centers c_1, ..., c_k such that the objective function sum_{i=1}^{k} sum_{x in O_{i}} ||x - c_{i}||^2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1+epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2^{~O(k/epsilon)} k-centers. We also give a closely matching lower bound of 2^{~Omega(k/sqrt{epsilon})}. Moreover, our algorithm runs in time O(n * d * 2^{~O(k/epsilon)}). This is a significant improvement over the previous result of Ding and Xu who gave an algorithm with running time O(n * d * (log{n})^{k} * 2^{poly(k/epsilon)}) and output a list of size O((log{n})^k * 2^{poly(k/epsilon)}). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved

    On the Distribution of the Fourier Spectrum of Halfspaces

    Bourgain showed that any noise stable Boolean function ff can be well-approximated by a junta. In this note we give an exponential sharpening of the parameters of Bourgain's result under the additional assumption that ff is a halfspace

    Approximate Clustering with Same-Cluster Queries

    Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm