5 research outputs found
Improved analysis of D2-sampling based PTAS for k-means and other Clustering problems
We give an improved analysis of the simple -sampling based PTAS for the
-means clustering problem given by Jaiswal, Kumar, and Sen (Algorithmica,
2013). The improvement on the running time is from to .Comment: arXiv admin note: substantial text overlap with arXiv:1201.420
Faster Balanced Clusterings in High Dimension
The problem of constrained clustering has attracted significant attention in
the past decades. In this paper, we study the balanced -center, -median,
and -means clustering problems where the size of each cluster is constrained
by the given lower and upper bounds. The problems are motivated by the
applications in processing large-scale data in high dimension. Existing methods
often need to compute complicated matchings (or min cost flows) to satisfy the
balance constraint, and thus suffer from high complexities especially in high
dimension. We develop an effective framework for the three balanced clustering
problems to address this issue, and our method is based on a novel spatial
partition idea in geometry. For the balanced -center clustering, we provide
a -approximation algorithm that improves the existing approximation factors;
for the balanced -median and -means clusterings, our algorithms yield
constant and -approximation factors with any . More
importantly, our algorithms achieve linear or nearly linear running times when
is a constant, and significantly improve the existing ones. Our results can
be easily extended to metric balanced clusterings and the running times are
sub-linear in terms of the complexity of -point metric
Simple and sharp analysis of k-means||
We present a simple analysis of k-means|| (Bahmani et al., PVLDB 2012) -- a
distributed variant of the k-means++ algorithm (Arthur and Vassilvitskii, SODA
2007). Moreover, the bound on the number of rounds is improved from
to , which we show to be tight
Speeding Up Constrained -Means Through 2-Means
For the constrained 2-means problem, we present a
time
algorithm. It generates a collection of approximate center pairs such that one of pairs in can induce a -approximation
for the problem. The existing approximation scheme for the constrained 2-means
problem takes time, and the
existing approximation scheme for the constrained -means problem takes
time. Using the method
developed in this paper, we point out that every existing approximating scheme
for the constrained -means so far with time can be
transformed to a new approximation scheme with time complexity
Streaming PTAS for Constrained k-Means
We generalise the results of Bhattacharya et al. (Journal of Computing
Systems, 62(1):93-115, 2018) for the list--means problem defined as -- for a
(unknown) partition of the dataset ,
find a list of -center sets (each element in the list is a set of
centers) such that at least one of -center sets in the
list gives an -approximation with respect to the cost function
. The list--means problem is important for the
constrained -means problem since algorithms for the former can be converted
to PTAS for various versions of the latter. Following are the consequences of
our generalisations:
- Streaming algorithm: Our -sampling based algorithm running in a single
iteration allows us to design a 2-pass, logspace streaming algorithm for the
list--means problem. This can be converted to a 4-pass, logspace streaming
PTAS for various constrained versions of the -means problem.
- Faster PTAS under stability: Our generalisation is also useful in -means
clustering scenarios where finding good centers becomes easy once good centers
for a few "bad" clusters have been chosen. One such scenario is clustering
under stability where the number of such bad clusters is a constant. Using the
above idea, we significantly improve the running time of the known algorithm
from to .Comment: Changes from previous version: (i) added discussion on coreset, and
(ii) fixed few typo