21 research outputs found

    Noisy k-Means++ Revisited

    Get PDF

    Improved Algorithms for Clustering with Outliers

    Get PDF
    Clustering is a fundamental problem in unsupervised learning. In many real-world applications, the to-be-clustered data often contains various types of noises and thus needs to be removed from the learning process. To address this issue, we consider in this paper two variants of such clustering problems, called k-median with m outliers and k-means with m outliers. Existing techniques for both problems either incur relatively large approximation ratios or can only efficiently deal with a small number of outliers. In this paper, we present improved solution to each of them for the case where k is a fixed number and m could be quite large. Particularly, we gave the first PTAS for the k-median problem with outliers in Euclidean space R^d for possibly high m and d. Our algorithm runs in O(nd((1/epsilon)(k+m))^(k/epsilon)^O(1)) time, which considerably improves the previous result (with running time O(nd(m+k)^O(m+k) + (1/epsilon)k log n)^O(1))) given by [Feldman and Schulman, SODA 2012]. For the k-means with outliers problem, we introduce a (6+epsilon)-approximation algorithm for general metric space with running time O(n(beta (1/epsilon)(k+m))^k) for some constant beta>1. Our algorithm first uses the k-means++ technique to sample O((1/epsilon)(k+m)) points from input and then select the k centers from them. Compared to the more involving existing techniques, our algorithms are much simpler, i.e., using only random sampling, and achieving better performance ratios

    Distributed k-Means with Outliers in General Metrics

    Get PDF
    Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the k-means problem, which, given a set P of points from a metric space and a parameter k < |P|, requires finding a subset S ⊂ P of k points, dubbed centers, which minimizes the sum of all squared distances of points in P from their closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known polynomial-time sequential (possibly bicriteria) approximation algorithm, where γ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics

    Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces

    Get PDF
    Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces

    A Faster kk-means++ Algorithm

    Full text link
    K-means++ is an important algorithm to choose initial cluster centers for the k-means clustering algorithm. In this work, we present a new algorithm that can solve the kk-means++ problem with near optimal running time. Given nn data points in Rd\mathbb{R}^d, the current state-of-the-art algorithm runs in O~(k)\widetilde{O}(k ) iterations, and each iteration takes O~(ndk)\widetilde{O}(nd k) time. The overall running time is thus O~(ndk2)\widetilde{O}(n d k^2). We propose a new algorithm \textsc{FastKmeans++} that only takes in O~(nd+nk2)\widetilde{O}(nd + nk^2) time, in total

    An Empirical Evaluation of k-Means Coresets

    Get PDF
    Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice
    corecore