Search CORE

20 research outputs found

Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction

Author: Ding Hu
Wang Zixiu
Yu Haikuo
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual European Symposium on Algorithms (ESA 2019)
Publication date: 01/01/2019
Field of study

We study the problem of k-center clustering with outliers in arbitrary metrics and Euclidean space. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez\u27s algorithm, for solving the problem of ordinary k-center clustering. Based on some novel observations, we show that this greedy strategy actually can handle k-center clustering with outliers efficiently, in terms of clustering quality and time complexity. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Our algorithms are easy to implement in practice. We test our method on both synthetic and real datasets. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower running times comparing with existing methods

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Improved Approximation and Scalability for Fair Max-Min Diversification

Author: Addanki Raghavendra
McGregor Andrew
Meliou Alexandra
Moumoulidou Zafeiria
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 25th International Conference on Database Theory (ICDT 2022)
Publication date: 01/01/2022
Field of study

Given an

n

-point metric space

(\mathcal{X},d)

where each point belongs to one of

m=O(1)

different categories or groups and a set of integers

k_1, \ldots, k_m

, the fair Max-Min diversification problem is to select

k_i

points belonging to category

i\in [m]

, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor

2

-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a

6

-approximation that is guaranteed to satisfy the fairness constraints up to a factor

1-\epsilon

for any constant

\epsilon

. We also present a linear time algorithm returning an

m+1

approximation with exact fairness. The best previous result was a

3m-1

approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant

\epsilon>0

, we present a

1+\epsilon

approximation algorithm that runs in

O(nk) + 2^{O(k)}

time where

k=k_1+\ldots+k_m

. We can improve the running time to

O(nk)+ poly(k)

at the expense of only picking

(1-\epsilon) k_i

points from category

i\in [m]

. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces

Author: Mazzetto Alessio
Pietracaprina Andrea
Pucci Geppino
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th International Symposium on Algorithms and Computation (ISAAC 2019)
Publication date: 01/01/2019
Field of study

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università di Padova

A New Coreset Framework for Clustering

Author: Cohen-Addad Vincent
Saulpic David
Schwiegelshohn Chris
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/12/2021
Field of study

Given a metric space, the

(k,z)

-clustering problem consists of finding

k

centers such that the sum of the of distances raised to the power

z

of every point to its closest center is minimized. This encapsulates the famous

k

-median (

z=1

) and

k

-means (

z=2

) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as \emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases

arXiv.org e-Print Archive

Distributed Clustering in General Metrics via Coresets

Author
Publication venue
Publication date
Field of study

Center-based clustering is a fundamental primitive for data analysis and is very challenging for large datasets. We developed coreset based space/round-efficient MapReduce algorithms to solve the k-center, k-median, and k-means variants in general metrics. Remarkably, the algorithms obliviously adapt to the doubling dimension of the metric space, and attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations

Padua Thesis and Dissertation Archive

MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

Author: Ceccarello Matteo
Pietracaprina Andrea
Pucci Geppino
Upfal Eli
Publication venue
Publication date: 01/01/2017
Field of study

Given a dataset of points in a metric space and an integer

k

, a diversity maximization problem requires determining a subset of

k

points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an

(\alpha+\epsilon)

-approximation ratio, for any constant

\epsilon>0

, where

\alpha

is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points.Comment: Extended version of http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf, PVLDB Volume 10, No. 5, January 201

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Padova