Search CORE

21 research outputs found

Noisy k-Means++ Revisited

Author: Grunau Christoph
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

Improved Algorithms for Clustering with Outliers

Author: Feng Qilong
Huang Ziyun
Wang Jianxin
Xu Jinhui
Zhang Zhen
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th International Symposium on Algorithms and Computation (ISAAC 2019)
Publication date: 01/01/2019
Field of study

Clustering is a fundamental problem in unsupervised learning. In many real-world applications, the to-be-clustered data often contains various types of noises and thus needs to be removed from the learning process. To address this issue, we consider in this paper two variants of such clustering problems, called k-median with m outliers and k-means with m outliers. Existing techniques for both problems either incur relatively large approximation ratios or can only efficiently deal with a small number of outliers. In this paper, we present improved solution to each of them for the case where k is a fixed number and m could be quite large. Particularly, we gave the first PTAS for the k-median problem with outliers in Euclidean space R^d for possibly high m and d. Our algorithm runs in O(nd((1/epsilon)(k+m))^(k/epsilon)^O(1)) time, which considerably improves the previous result (with running time O(nd(m+k)^O(m+k) + (1/epsilon)k log n)^O(1))) given by [Feldman and Schulman, SODA 2012]. For the k-means with outliers problem, we introduce a (6+epsilon)-approximation algorithm for general metric space with running time O(n(beta (1/epsilon)(k+m))^k) for some constant beta>1. Our algorithm first uses the k-means++ technique to sample O((1/epsilon)(k+m)) points from input and then select the k centers from them. Compared to the more involving existing techniques, our algorithms are much simpler, i.e., using only random sampling, and achieving better performance ratios

Dagstuhl Research Online Publication Server

Distributed k-Means with Outliers in General Metrics

Author: Andrea Pietracaprina
Enrico Dandolo
Geppino Pucci
Publication venue: Springer Verlag
Publication date: 18/02/2022
Field of study

Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the k-means problem, which, given a set P of points from a metric space and a parameter k < |P|, requires finding a subset S ⊂ P of k points, dubbed centers, which minimizes the sum of all squared distances of points in P from their closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known polynomial-time sequential (possibly bicriteria) approximation algorithm, where γ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Padova

Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces

Author: Mazzetto Alessio
Pietracaprina Andrea
Pucci Geppino
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th International Symposium on Algorithms and Computation (ISAAC 2019)
Publication date: 01/01/2019
Field of study

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università di Padova

A Faster $k$ -means++ Algorithm

Author: Liang Jiehao
Sarkhel Somdeb
Song Zhao
Yin Chenbo
Zhuo Danyang
Publication venue
Publication date: 28/11/2022
Field of study

K-means++ is an important algorithm to choose initial cluster centers for the k-means clustering algorithm. In this work, we present a new algorithm that can solve the

k

-means++ problem with near optimal running time. Given

n

data points in

\mathbb{R}^d

, the current state-of-the-art algorithm runs in

\widetilde{O}(k )

iterations, and each iteration takes

\widetilde{O}(nd k)

time. The overall running time is thus

\widetilde{O}(n d k^2)

. We propose a new algorithm \textsc{FastKmeans++} that only takes in

\widetilde{O}(nd + nk^2)

time, in total

arXiv.org e-Print Archive

An Empirical Evaluation of k-Means Coresets

Author: Schwiegelshohn Chris
Sheikh-Omar Omar Ali
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as k-means in both theory and practice. Curiously, there exists no work on comparing the quality of available k-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice

Dagstuhl Research Online Publication Server