591 research outputs found
Coreset Clustering on Small Quantum Computers
Many quantum algorithms for machine learning require access to classical data
in superposition. However, for many natural data sets and algorithms, the
overhead required to load the data set in superposition can erase any potential
quantum speedup over classical algorithms. Recent work by Harrow introduces a
new paradigm in hybrid quantum-classical computing to address this issue,
relying on coresets to minimize the data loading overhead of quantum
algorithms. We investigate using this paradigm to perform -means clustering
on near-term quantum computers, by casting it as a QAOA optimization instance
over a small coreset. We compare the performance of this approach to classical
-means clustering both numerically and experimentally on IBM Q hardware. We
are able to find data sets where coresets work well relative to random sampling
and where QAOA could potentially outperform standard -means on a coreset.
However, finding data sets where both coresets and QAOA work well--which is
necessary for a quantum advantage over -means on the entire data
set--appears to be challenging
Large Scale Clustering with Variational EM for Gaussian Mixture Models
How can we efficiently find large numbers of clusters in large data sets with
high-dimensional data points? Our aim is to explore the current efficiency and
large-scale limits in fitting a parametric model for clustering to data
distributions. To do so, we combine recent lines of research which have
previously focused on separate specific methods for complexity reduction. We
first show theoretically how the clustering objective of variational EM (which
reduces complexity for many clusters) can be combined with coreset objectives
(which reduce complexity for many data points). Secondly, we realize a concrete
highly efficient iterative procedure which combines and translates the
theoretical complexity gains of truncated variational EM and coresets into a
practical algorithm. For very large scales, the high efficiency of parameter
updates then requires (A) highly efficient coreset construction and (B) highly
efficient initialization procedures (seeding) in order to avoid computational
bottlenecks. Fortunately very efficient coreset construction has become
available in the form of light-weight coresets, and very efficient
initialization has become available in the form of AFK-MC seeding. The
resulting algorithm features balanced computational costs across all
constituting components. In applications to standard large-scale benchmarks for
clustering, we investigate the algorithm's efficiency/quality trade-off.
Compared to the best recent approaches, we observe speedups of up to one order
of magnitude, and up to two orders of magnitude compared to the -means++
baseline. To demonstrate that the observed efficiency enables previously
considered unfeasible applications, we cluster the entire and unscaled 80 Mio.
Tiny Images dataset into up to 32,000 clusters. To the knowledge of the
authors, this represents the largest scale fit of a parametric data model for
clustering reported so far
Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms
We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques
Scalable k-Means Clustering via Lightweight Coresets
Coresets are compact representations of data sets such that models trained on
a coreset are provably competitive with models trained on the full data set. As
such, they have been successfully used to scale up clustering models to massive
data sets. While existing approaches generally only allow for multiplicative
approximation errors, we propose a novel notion of lightweight coresets that
allows for both multiplicative and additive errors. We provide a single
algorithm to construct lightweight coresets for k-means clustering as well as
soft and hard Bregman clustering. The algorithm is substantially faster than
existing constructions, embarrassingly parallel, and the resulting coresets are
smaller. We further show that the proposed approach naturally generalizes to
statistical k-means clustering and that, compared to existing results, it can
be used to compute smaller summaries for empirical risk minimization. In
extensive experiments, we demonstrate that the proposed algorithm outperforms
existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD
Training Gaussian Mixture Models at Scale via Coresets
How can we train a statistical mixture model on a massive data set? In this
work we show how to construct coresets for mixtures of Gaussians. A coreset is
a weighted subset of the data, which guarantees that models fitting the coreset
also provide a good fit for the original data set. We show that, perhaps
surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension
and the number of mixture components, while being independent of the data set
size. Hence, one can harness computationally intensive algorithms to compute a
good approximation on a significantly smaller data set. More importantly, such
coresets can be efficiently constructed both in distributed and streaming
settings and do not impose restrictions on the data generating process. Our
results rely on a novel reduction of statistical estimation to problems in
computational geometry and new combinatorial complexity results for mixtures of
Gaussians. Empirical evaluation on several real-world datasets suggests that
our coreset-based approach enables significant reduction in training-time with
negligible approximation error
- …