416 research outputs found
Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms
We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques
Approximation and Streaming Algorithms for Projective Clustering via Random Projections
Let be a set of points in . In the projective
clustering problem, given and norm , we have to
compute a set of -dimensional flats such that is minimized; here
represents the (Euclidean) distance of to the closest flat in
. We let denote the minimal value and interpret
to be . When and
and , the problem corresponds to the -median, -mean and the
-center clustering problems respectively.
For every , and , we show that the
orthogonal projection of onto a randomly chosen flat of dimension
will -approximate
. This result combines the concepts of geometric coresets and
subspace embeddings based on the Johnson-Lindenstrauss Lemma. As a consequence,
an orthogonal projection of to an dimensional randomly chosen subspace
-approximates projective clusterings for every and
simultaneously. Note that the dimension of this subspace is independent of the
number of clusters~.
Using this dimension reduction result, we obtain new approximation and
streaming algorithms for projective clustering problems. For example, given a
stream of points, we show how to compute an -approximate
projective clustering for every and simultaneously using only
space. Compared to
standard streaming algorithms with space requirement, our approach
is a significant improvement when the number of input points and their
dimensions are of the same order of magnitude.Comment: Canadian Conference on Computational Geometry (CCCG 2015
Coresets for minimum enclosing balls over sliding windows
\emph{Coresets} are important tools to generate concise summaries of massive
datasets for approximate analysis. A coreset is a small subset of points
extracted from the original point set such that certain geometric properties
are preserved with provable guarantees. This paper investigates the problem of
maintaining a coreset to preserve the minimum enclosing ball (MEB) for a
sliding window of points that are continuously updated in a data stream.
Although the problem has been extensively studied in batch and append-only
streaming settings, no efficient sliding-window solution is available yet. In
this work, we first introduce an algorithm, called AOMEB, to build a coreset
for MEB in an append-only stream. AOMEB improves the practical performance of
the state-of-the-art algorithm while having the same approximation ratio.
Furthermore, using AOMEB as a building block, we propose two novel algorithms,
namely SWMEB and SWMEB+, to maintain coresets for MEB over the sliding window
with constant approximation ratios. The proposed algorithms also support
coresets for MEB in a reproducing kernel Hilbert space (RKHS). Finally,
extensive experiments on real-world and synthetic datasets demonstrate that
SWMEB and SWMEB+ achieve speedups of up to four orders of magnitude over the
state-of-the-art batch algorithm while providing coresets for MEB with rather
small errors compared to the optimal ones.Comment: 28 pages, 10 figures, to appear in The 25th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining (KDD '19
- …