144 research outputs found
Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially.
Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-center variant which, given a set S of points from some metric space and a parameter k0, the algorithms yield solutions whose approximation ratios are a mere additive term \u3f5 away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) D. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones
Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction
We study the problem of k-center clustering with outliers in arbitrary metrics and Euclidean space. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez\u27s algorithm, for solving the problem of ordinary k-center clustering. Based on some novel observations, we show that this greedy strategy actually can handle k-center clustering with outliers efficiently, in terms of clustering quality and time complexity. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Our algorithms are easy to implement in practice. We test our method on both synthetic and real datasets. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower running times comparing with existing methods
Distributed k-Means with Outliers in General Metrics
Center-based clustering is a pivotal primitive for unsupervised
learning and data analysis. A popular variant is the k-means problem,
which, given a set P of points from a metric space and a parameter
k < |P|, requires finding a subset S ⊂ P of k points, dubbed centers,
which minimizes the sum of all squared distances of points in P from
their closest center. A more general formulation, introduced to deal with
noisy datasets, features a further parameter z and allows up to z points of
P (outliers) to be disregarded when computing the aforementioned sum.
We present a distributed coreset-based 3-round approximation algorithm
for k-means with z outliers for general metric spaces, using MapReduce
as a computational model. Our distributed algorithm requires sublinear
local memory per reducer, and yields a solution whose approximation
ratio is an additive term O(γ) away from the one achievable by the
best known polynomial-time sequential (possibly bicriteria) approximation
algorithm, where γ can be made arbitrarily small. An important
feature of our algorithm is that it obliviously adapts to the intrinsic
complexity of the dataset, captured by its doubling dimension D. To the
best of our knowledge, no previous distributed approaches were able to
attain similar quality-performance tradeoffs for general metrics
Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces
Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces
Scalable Distributed Approximation of Internal Measures for Clustering Evaluation
The most widely used internal measure for clustering evaluation is the
silhouette coefficient, whose naive computation requires a quadratic number of
distance calculations, which is clearly unfeasible for massive datasets.
Surprisingly, there are no known general methods to efficiently approximate the
silhouette coefficient of a clustering with rigorously provable high accuracy.
In this paper, we present the first scalable algorithm to compute such a
rigorous approximation for the evaluation of clusterings based on any metric
distances. Our algorithm hinges on a Probability Proportional to Size (PPS)
sampling scheme, and, for any fixed , it
approximates the silhouette coefficient within a mere additive error
with probability , using a very small number of
distance calculations. We also prove that the algorithm can be adapted to
obtain rigorous approximations of other internal measures of clustering
quality, such as cohesion and separation. Importantly, we provide a distributed
implementation of the algorithm using the MapReduce model, which runs in
constant rounds and requires only sublinear local space at each worker, which
makes our estimation approach applicable to big data scenarios. We perform an
extensive experimental evaluation of our silhouette approximation algorithm,
comparing its performance to a number of baseline heuristics on real and
synthetic datasets. The experiments provide evidence that, unlike other
heuristics, our estimation strategy not only provides tight theoretical
guarantees but is also able to return highly accurate estimations while running
in a fraction of the time required by the exact computation, and that its
distributed implementation is highly scalable, thus enabling the computation of
internal measures for very large datasets for which the exact computation is
prohibitive.Comment: 16 pages, 4 tables, 1 figur
- …