Search CORE

10 research outputs found

Improved Algorithms for Time Decay Streams

Author: Braverman Vladimir
Lang Harry
Ullah Enayat
Zhou Samson
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)
Publication date: 01/01/2019
Field of study

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

On Generalization Bounds for Projective Clustering

Author: Bucarelli Maria Sofia
Larsen Matilde Fjeldsø
Schwiegelshohn Chris
Toftrup Mads Bech
Publication venue
Publication date: 13/10/2023
Field of study

Given a set of points, clustering consists of finding a partition of a point set into

k

clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous

k

-median and

k

-means objectives. One may also choose centers to be

j

dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of

n

samples

P

drawn independently from some unknown, but fixed distribution

\mathcal{D}

, how quickly does a solution computed on

P

converge to the optimal clustering of

\mathcal{D}

? We give several near optimal results. In particular, For center-based objectives, we show a convergence rate of

\tilde{O}\left(\sqrt{{k}/{n}}\right)

. This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for

k

-means and extends it to other important objectives such as

k

-median. For subspace clustering with

j

-dimensional subspaces, we show a convergence rate of

\tilde{O}\left(\sqrt{\frac{kj^2}{n}}\right)

. These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes

k

-means, we show a convergence rate of

\Omega\left(\sqrt{\frac{kj}{n}}\right)

is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal

arXiv.org e-Print Archive

Coresets and streaming algorithms for the k-means problem and related clustering objectives

Author: Schmidt Melanie
Publication venue
Publication date
Field of study

The k-means problem seeks a clustering that minimizes the sum of squared errors cost function: For input points P from the Euclidean space R^d and any solution consisting of k centers from R^d, the cost is the sum of the squared distances of any point to its closest center. This thesis studies concepts used for large input point sets. For inputs with many points, the term coreset refers to a reduced version with less but weighted points. For inputs with high-dimensional points, dimensionality reduction is used to reduce the number of dimensions. In both cases, the reduced version has to maintain the cost function up to an epsilon-fraction for all choices of k centers. We study coreset constructions and dimensionality reductions for the k-means problem. Further, we develop coreset constructions in the data stream model. Here, the data is so large that it should only be read once and cannot be stored in main memory. The input might even arrive as a stream of points in an arbitrary order. Thus, a data stream algorithm has to continuously process the input while it arrives and can only store small summaries. In the second part of the thesis, the obtained results are extended to related clustering objectives. Projective clustering minimizes the squared distances to k subspaces instead of k points. Kernel k-means is an extension of k-means to scenarios where the target clustering is not linearly separable. In addition to extensions to these objectives, we study coreset constructions for a probabilistic clustering problem where input points are given as distributions over a finite set of locations

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

LIPIcs, Volume 244, ESA 2022, Complete Volume

Author: Chechik Shiri
Herman Grzegorz
Navarro Gonzalo
Rotenberg Eva
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

LIPIcs, Volume 244, ESA 2022, Complete Volum

Dagstuhl Research Online Publication Server

LIPIcs, Volume 248, ISAAC 2022, Complete Volume

Author: Bae Sang Won
Park Heejin
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd International Symposium on Algorithms and Computation (ISAAC 2022)
Publication date: 01/01/2022
Field of study

LIPIcs, Volume 248, ISAAC 2022, Complete Volum

Dagstuhl Research Online Publication Server

LIPIcs, Volume 274, ESA 2023, Complete Volume

Author: Farach-Colton Martin
Herman Grzegorz
Puglisi Simon J.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

LIPIcs, Volume 274, ESA 2023, Complete Volum

Dagstuhl Research Online Publication Server