Center-based clustering is a pivotal primitive for unsupervised
learning and data analysis. A popular variant is the k-means problem,
which, given a set P of points from a metric space and a parameter
k < |P|, requires finding a subset S ⊂ P of k points, dubbed centers,
which minimizes the sum of all squared distances of points in P from
their closest center. A more general formulation, introduced to deal with
noisy datasets, features a further parameter z and allows up to z points of
P (outliers) to be disregarded when computing the aforementioned sum.
We present a distributed coreset-based 3-round approximation algorithm
for k-means with z outliers for general metric spaces, using MapReduce
as a computational model. Our distributed algorithm requires sublinear
local memory per reducer, and yields a solution whose approximation
ratio is an additive term O(γ) away from the one achievable by the
best known polynomial-time sequential (possibly bicriteria) approximation
algorithm, where γ can be made arbitrarily small. An important
feature of our algorithm is that it obliviously adapts to the intrinsic
complexity of the dataset, captured by its doubling dimension D. To the
best of our knowledge, no previous distributed approaches were able to
attain similar quality-performance tradeoffs for general metrics