21 research outputs found
Improved Algorithms for Clustering with Outliers
Clustering is a fundamental problem in unsupervised learning. In many real-world applications, the to-be-clustered data often contains various types of noises and thus needs to be removed from the learning process. To address this issue, we consider in this paper two variants of such clustering problems, called k-median with m outliers and k-means with m outliers. Existing techniques for both problems either incur relatively large approximation ratios or can only efficiently deal with a small number of outliers. In this paper, we present improved solution to each of them for the case where k is a fixed number and m could be quite large. Particularly, we gave the first PTAS for the k-median problem with outliers in Euclidean space R^d for possibly high m and d. Our algorithm runs in O(nd((1/epsilon)(k+m))^(k/epsilon)^O(1)) time, which considerably improves the previous result (with running time O(nd(m+k)^O(m+k) + (1/epsilon)k log n)^O(1))) given by [Feldman and Schulman, SODA 2012]. For the k-means with outliers problem, we introduce a (6+epsilon)-approximation algorithm for general metric space with running time O(n(beta (1/epsilon)(k+m))^k) for some constant beta>1. Our algorithm first uses the k-means++ technique to sample O((1/epsilon)(k+m)) points from input and then select the k centers from them. Compared to the more involving existing techniques, our algorithms are much simpler, i.e., using only random sampling, and achieving better performance ratios
Distributed k-Means with Outliers in General Metrics
Center-based clustering is a pivotal primitive for unsupervised
learning and data analysis. A popular variant is the k-means problem,
which, given a set P of points from a metric space and a parameter
k < |P|, requires finding a subset S ⊂ P of k points, dubbed centers,
which minimizes the sum of all squared distances of points in P from
their closest center. A more general formulation, introduced to deal with
noisy datasets, features a further parameter z and allows up to z points of
P (outliers) to be disregarded when computing the aforementioned sum.
We present a distributed coreset-based 3-round approximation algorithm
for k-means with z outliers for general metric spaces, using MapReduce
as a computational model. Our distributed algorithm requires sublinear
local memory per reducer, and yields a solution whose approximation
ratio is an additive term O(Îł) away from the one achievable by the
best known polynomial-time sequential (possibly bicriteria) approximation
algorithm, where Îł can be made arbitrarily small. An important
feature of our algorithm is that it obliviously adapts to the intrinsic
complexity of the dataset, captured by its doubling dimension D. To the
best of our knowledge, no previous distributed approaches were able to
attain similar quality-performance tradeoffs for general metrics
Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces
Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces
A Faster -means++ Algorithm
K-means++ is an important algorithm to choose initial cluster centers for the
k-means clustering algorithm. In this work, we present a new algorithm that can
solve the -means++ problem with near optimal running time. Given data
points in , the current state-of-the-art algorithm runs in
iterations, and each iteration takes
time. The overall running time is thus . We propose a
new algorithm \textsc{FastKmeans++} that only takes in time, in total
An Empirical Evaluation of k-Means Coresets
Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets.
In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice