Search CORE

149 research outputs found

Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction

Author: Ding Hu
Wang Zixiu
Yu Haikuo
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual European Symposium on Algorithms (ESA 2019)
Publication date: 01/01/2019
Field of study

We study the problem of k-center clustering with outliers in arbitrary metrics and Euclidean space. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez\u27s algorithm, for solving the problem of ordinary k-center clustering. Based on some novel observations, we show that this greedy strategy actually can handle k-center clustering with outliers efficiently, in terms of clustering quality and time complexity. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Our algorithms are easy to implement in practice. We test our method on both synthetic and real datasets. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower running times comparing with existing methods

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications

Author: Bandyapadhyay Sayan
Fomin Fedor V.
Simonov Kirill
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)
Publication date: 20/07/2020
Field of study

Fair clustering is a constrained variant of clustering where the goal is to partition a set of colored points, such that the fraction of points of any color in every cluster is more or less equal to the fraction of points of this color in the dataset. This variant was recently introduced by Chierichetti et al. [NeurIPS, 2017] in a seminal work and became widely popular in the clustering literature. In this paper, we propose a new construction of coresets for fair clustering based on random sampling. The new construction allows us to obtain the first coreset for fair clustering in general metric spaces. For Euclidean spaces, we obtain the first coreset whose size does not depend exponentially on the dimension. Our coreset results solve open questions proposed by Schmidt et al. [WAOA, 2019] and Huang et al. [NeurIPS, 2019]. The new coreset construction helps to design several new approximation and streaming algorithms. In particular, we obtain the first true constant-approximation algorithm for metric fair clustering, whose running time is fixed-parameter tractable (FPT). In the Euclidean case, we derive the first

(1+\epsilon)

-approximation algorithm for fair clustering whose time complexity is near-linear and does not depend exponentially on the dimension of the space. Besides, our coreset construction scheme is fairly general and gives rise to coresets for a wide range of constrained clustering problems. This leads to improved constant-approximations for these problems in general metrics and near-linear time

(1+\epsilon)

-approximations in the Euclidean metric

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Improved Approximation and Scalability for Fair Max-Min Diversification

Author: Addanki Raghavendra
McGregor Andrew
Meliou Alexandra
Moumoulidou Zafeiria
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 25th International Conference on Database Theory (ICDT 2022)
Publication date: 01/01/2022
Field of study

Given an

n

-point metric space

(\mathcal{X},d)

where each point belongs to one of

m=O(1)

different categories or groups and a set of integers

k_1, \ldots, k_m

, the fair Max-Min diversification problem is to select

k_i

points belonging to category

i\in [m]

, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor

2

-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a

6

-approximation that is guaranteed to satisfy the fairness constraints up to a factor

1-\epsilon

for any constant

\epsilon

. We also present a linear time algorithm returning an

m+1

approximation with exact fairness. The best previous result was a

3m-1

approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant

\epsilon>0

, we present a

1+\epsilon

approximation algorithm that runs in

O(nk) + 2^{O(k)}

time where

k=k_1+\ldots+k_m

. We can improve the running time to

O(nk)+ poly(k)

at the expense of only picking

(1-\epsilon) k_i

points from category

i\in [m]

. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.Comment: To appear in ICDT 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

On Sampling Based Algorithms for k-Means

Author: Bhattacharya Anup
Goyal Dishant
Jaiswal Ragesh
Kumar Amit
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

An Empirical Evaluation of k-Means Coresets

Author: Schwiegelshohn Chris
Sheikh-Omar Omar Ali
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice

Dagstuhl Research Online Publication Server

Streaming Algorithms for Diversity Maximization with Fairness Constraints

Author: Fabbri Francesco
Mathioudakis Michael
Wang Yanhao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/07/2022
Field of study

Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set

X

n

elements, it asks to select a subset

S

k \ll n

elements with maximum \emph{diversity}, as quantified by the dissimilarities among the elements in

S

. In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset

S

that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set

X

is partitioned into

m

disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that the selected subset

S

contains

k_i

elements from each group

i \in [1,m]

. A streaming algorithm should process

X

sequentially in one pass and return a subset with maximum \emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is

\frac{1-\varepsilon}{4}

-approximate and specific for

m=2

, where

\varepsilon \in (0,1)

, and the second of which achieves a

\frac{1-\varepsilon}{3m+2}

-approximation for an arbitrary

m

. Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202

arXiv.org e-Print Archive