5 research outputs found

    Improved analysis of D2-sampling based PTAS for k-means and other Clustering problems

    Full text link
    We give an improved analysis of the simple D2D^2-sampling based PTAS for the kk-means clustering problem given by Jaiswal, Kumar, and Sen (Algorithmica, 2013). The improvement on the running time is from O(nd2O~(k2/ϵ))O\left(nd \cdot 2^{\tilde{O}(k^2/\epsilon)}\right) to O(nd2O~(k/ϵ))O\left(nd \cdot 2^{\tilde{O}(k/\epsilon)}\right).Comment: arXiv admin note: substantial text overlap with arXiv:1201.420

    Faster Balanced Clusterings in High Dimension

    Full text link
    The problem of constrained clustering has attracted significant attention in the past decades. In this paper, we study the balanced kk-center, kk-median, and kk-means clustering problems where the size of each cluster is constrained by the given lower and upper bounds. The problems are motivated by the applications in processing large-scale data in high dimension. Existing methods often need to compute complicated matchings (or min cost flows) to satisfy the balance constraint, and thus suffer from high complexities especially in high dimension. We develop an effective framework for the three balanced clustering problems to address this issue, and our method is based on a novel spatial partition idea in geometry. For the balanced kk-center clustering, we provide a 44-approximation algorithm that improves the existing approximation factors; for the balanced kk-median and kk-means clusterings, our algorithms yield constant and (1+ϵ)(1+\epsilon)-approximation factors with any ϵ>0\epsilon>0. More importantly, our algorithms achieve linear or nearly linear running times when kk is a constant, and significantly improve the existing ones. Our results can be easily extended to metric balanced clusterings and the running times are sub-linear in terms of the complexity of nn-point metric

    Simple and sharp analysis of k-means||

    Full text link
    We present a simple analysis of k-means|| (Bahmani et al., PVLDB 2012) -- a distributed variant of the k-means++ algorithm (Arthur and Vassilvitskii, SODA 2007). Moreover, the bound on the number of rounds is improved from O(logn)O(\log n) to O(logn/loglogn)O(\log n / \log\log n), which we show to be tight

    Speeding Up Constrained kk-Means Through 2-Means

    Full text link
    For the constrained 2-means problem, we present a O(dn+d(1ϵ)O(1ϵ)logn)O\left(dn+d({1\over\epsilon})^{O({1\over \epsilon})}\log n\right) time algorithm. It generates a collection UU of approximate center pairs (c1,c2)(c_1, c_2) such that one of pairs in UU can induce a (1+ϵ)(1+\epsilon)-approximation for the problem. The existing approximation scheme for the constrained 2-means problem takes O((1ϵ)O(1ϵ)dn)O(({1\over\epsilon})^{O({1\over \epsilon})}dn) time, and the existing approximation scheme for the constrained kk-means problem takes O((kϵ)O(kϵ)dn)O(({k\over\epsilon})^{O({k\over \epsilon})}dn) time. Using the method developed in this paper, we point out that every existing approximating scheme for the constrained kk-means so far with time C(k,n,d,ϵ)C(k, n, d, \epsilon) can be transformed to a new approximation scheme with time complexity C(k,n,d,ϵ)/kΩ(1ϵ){C(k, n, d, \epsilon)/ k^{\Omega({1\over\epsilon})}}

    Streaming PTAS for Constrained k-Means

    Full text link
    We generalise the results of Bhattacharya et al. (Journal of Computing Systems, 62(1):93-115, 2018) for the list-kk-means problem defined as -- for a (unknown) partition X1,...,XkX_1, ..., X_k of the dataset XRdX \subseteq \mathbb{R}^d, find a list of kk-center sets (each element in the list is a set of kk centers) such that at least one of kk-center sets {c1,...,ck}\{c_1, ..., c_k\} in the list gives an (1+ε)(1+\varepsilon)-approximation with respect to the cost function minpermutation π[i=1kxXixcπ(i)2]\min_{\textrm{permutation } \pi} \left[ \sum_{i=1}^{k} \sum_{x \in X_i} ||x - c_{\pi(i)}||^2 \right]. The list-kk-means problem is important for the constrained kk-means problem since algorithms for the former can be converted to PTAS for various versions of the latter. Following are the consequences of our generalisations: - Streaming algorithm: Our D2D^2-sampling based algorithm running in a single iteration allows us to design a 2-pass, logspace streaming algorithm for the list-kk-means problem. This can be converted to a 4-pass, logspace streaming PTAS for various constrained versions of the kk-means problem. - Faster PTAS under stability: Our generalisation is also useful in kk-means clustering scenarios where finding good centers becomes easy once good centers for a few "bad" clusters have been chosen. One such scenario is clustering under stability where the number of such bad clusters is a constant. Using the above idea, we significantly improve the running time of the known algorithm from O(dn3)(klogn)poly(1β,1ε)O(dn^3) (k \log{n})^{poly(\frac{1}{\beta}, \frac{1}{\varepsilon})} to O(dn3kO~βε(1βε))O \left(dn^3 k^{\tilde{O}_{\beta \varepsilon}(\frac{1}{\beta \varepsilon})} \right).Comment: Changes from previous version: (i) added discussion on coreset, and (ii) fixed few typo
    corecore