100 research outputs found
Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms
We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques
Scalable k-Means Clustering via Lightweight Coresets
Coresets are compact representations of data sets such that models trained on
a coreset are provably competitive with models trained on the full data set. As
such, they have been successfully used to scale up clustering models to massive
data sets. While existing approaches generally only allow for multiplicative
approximation errors, we propose a novel notion of lightweight coresets that
allows for both multiplicative and additive errors. We provide a single
algorithm to construct lightweight coresets for k-means clustering as well as
soft and hard Bregman clustering. The algorithm is substantially faster than
existing constructions, embarrassingly parallel, and the resulting coresets are
smaller. We further show that the proposed approach naturally generalizes to
statistical k-means clustering and that, compared to existing results, it can
be used to compute smaller summaries for empirical risk minimization. In
extensive experiments, we demonstrate that the proposed algorithm outperforms
existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD
Coreset Markov Chain Monte Carlo
A Bayesian coreset is a small, weighted subset of data that replaces the full
dataset during inference in order to reduce computational cost. However, state
of the art methods for tuning coreset weights are expensive, require nontrivial
user input, and impose constraints on the model. In this work, we propose a new
method -- Coreset MCMC -- that simulates a Markov chain targeting the coreset
posterior, while simultaneously updating the coreset weights using those same
draws. Coreset MCMC is simple to implement and tune, and can be used with any
existing MCMC kernel. We analyze Coreset MCMC in a representative setting to
obtain key insights about the convergence behaviour of the method. Empirical
results demonstrate that Coreset MCMC provides higher quality posterior
approximations and reduced computational cost compared with other coreset
construction methods. Further, compared with other general subsampling MCMC
methods, we find that Coreset MCMC has a higher sampling efficiency with
competitively accurate posterior approximations
A Novel Sequential Coreset Method for Gradient Descent Algorithms
A wide range of optimization problems arising in machine learning can be
solved by gradient descent algorithms, and a central question in this area is
how to efficiently compress a large-scale dataset so as to reduce the
computational complexity. {\em Coreset} is a popular data compression technique
that has been extensively studied before. However, most of existing coreset
methods are problem-dependent and cannot be used as a general tool for a
broader range of applications. A key obstacle is that they often rely on the
pseudo-dimension and total sensitivity bound that can be very high or hard to
obtain. In this paper, based on the ''locality'' property of gradient descent
algorithms, we propose a new framework, termed ''sequential coreset'', which
effectively avoids these obstacles. Moreover, our method is particularly
suitable for sparse optimization whence the coreset size can be further reduced
to be only poly-logarithmically dependent on the dimension. In practice, the
experimental results suggest that our method can save a large amount of running
time compared with the baseline algorithms
- …