187 research outputs found

    Training Gaussian Mixture Models at Scale via Coresets

    Get PDF
    How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world datasets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error

    Large Scale Clustering with Variational EM for Gaussian Mixture Models

    Full text link
    How can we efficiently find large numbers of clusters in large data sets with high-dimensional data points? Our aim is to explore the current efficiency and large-scale limits in fitting a parametric model for clustering to data distributions. To do so, we combine recent lines of research which have previously focused on separate specific methods for complexity reduction. We first show theoretically how the clustering objective of variational EM (which reduces complexity for many clusters) can be combined with coreset objectives (which reduce complexity for many data points). Secondly, we realize a concrete highly efficient iterative procedure which combines and translates the theoretical complexity gains of truncated variational EM and coresets into a practical algorithm. For very large scales, the high efficiency of parameter updates then requires (A) highly efficient coreset construction and (B) highly efficient initialization procedures (seeding) in order to avoid computational bottlenecks. Fortunately very efficient coreset construction has become available in the form of light-weight coresets, and very efficient initialization has become available in the form of AFK-MC2^2 seeding. The resulting algorithm features balanced computational costs across all constituting components. In applications to standard large-scale benchmarks for clustering, we investigate the algorithm's efficiency/quality trade-off. Compared to the best recent approaches, we observe speedups of up to one order of magnitude, and up to two orders of magnitude compared to the kk-means++ baseline. To demonstrate that the observed efficiency enables previously considered unfeasible applications, we cluster the entire and unscaled 80 Mio. Tiny Images dataset into up to 32,000 clusters. To the knowledge of the authors, this represents the largest scale fit of a parametric data model for clustering reported so far

    Scalable k-Means Clustering via Lightweight Coresets

    Full text link
    Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct lightweight coresets for k-means clustering as well as soft and hard Bregman clustering. The algorithm is substantially faster than existing constructions, embarrassingly parallel, and the resulting coresets are smaller. We further show that the proposed approach naturally generalizes to statistical k-means clustering and that, compared to existing results, it can be used to compute smaller summaries for empirical risk minimization. In extensive experiments, we demonstrate that the proposed algorithm outperforms existing data summarization strategies in practice.Comment: To appear in the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD
    • …
    corecore