377 research outputs found

    Training Support Vector Machines using Coresets

    Full text link
    We present a novel coreset construction algorithm for solving classification tasks using Support Vector Machines (SVMs) in a computationally efficient manner. A coreset is a weighted subset of the original data points that provably approximates the original set. We show that coresets of size polylogarithmic in nn and polynomial in dd exist for a set of nn input points with dd features and present an (ϵ,δ)(\epsilon,\delta)-FPRAS for constructing coresets for scalable SVM training. Our method leverages the insight that data points are often redundant and uses an importance sampling scheme based on the sensitivity of each data point to construct coresets efficiently. We evaluate the performance of our algorithm in accelerating SVM training against real-world data sets and compare our algorithm to state-of-the-art coreset approaches. Our empirical results show that our approach outperforms a state-of-the-art coreset approach and uniform sampling in enabling computational speedups while achieving low approximation error

    Coresets for Vector Summarization with Applications to Network Graphs

    Full text link
    We provide a deterministic data summarization algorithm that approximates the mean pˉ=1npPp\bar{p}=\frac{1}{n}\sum_{p\in P} p of a set PP of nn vectors in \REAL^d, by a weighted mean p~\tilde{p} of a \emph{subset} of O(1/\eps) vectors, i.e., independent of both nn and dd. We prove that the squared Euclidean distance between pˉ\bar{p} and p~\tilde{p} is at most \eps multiplied by the variance of PP. We use this algorithm to maintain an approximated sum of vectors from an unbounded stream, using memory that is independent of dd, and logarithmic in the nn vectors seen so far. Our main application is to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. For example, in the case of mobile networks, we can use GPS traces to identify meetings, in the case of social networks, we can use information exchange to identify friend groups. Our algorithm provably identifies the {\it Heavy Hitter} entries in a proximity (adjacency) matrix. The Heavy Hitters can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We evaluate the algorithm on several large data sets.Comment: ICML'201

    Practical Coreset Constructions for Machine Learning

    Full text link
    We investigate coresets - succinct, small summaries of large data sets - so that solutions found on the summary are provably competitive with solution found on the full data set. We provide an overview over the state-of-the-art in coreset construction for machine learning. In Section 2, we present both the intuition behind and a theoretically sound framework to construct coresets for general problems and apply it to kk-means clustering. In Section 3 we summarize existing coreset construction algorithms for a variety of machine learning problems such as maximum likelihood estimation of mixture models, Bayesian non-parametric models, principal component analysis, regression and general empirical risk minimization

    A Deterministic Nonsmooth Frank Wolfe Algorithm with Coreset Guarantees

    Full text link
    We present a new Frank-Wolfe (FW) type algorithm that is applicable to minimization problems with a nonsmooth convex objective. We provide convergence bounds and show that the scheme yields so-called coreset results for various Machine Learning problems including 1-median, Balanced Development, Sparse PCA, Graph Cuts, and the 1\ell_1-norm-regularized Support Vector Machine (SVM) among others. This means that the algorithm provides approximate solutions to these problems in time complexity bounds that are not dependent on the size of the input problem. Our framework, motivated by a growing body of work on sublinear algorithms for various data analysis problems, is entirely deterministic and makes no use of smoothing or proximal operators. Apart from these theoretical results, we show experimentally that the algorithm is very practical and in some cases also offers significant computational advantages on large problem instances. We provide an open source implementation that can be adapted for other problems that fit the overall structure

    Improved Coresets for Kernel Density Estimates

    Full text link
    We study the construction of coresets for kernel density estimates. That is we show how to approximate the kernel density estimate described by a large point set with another kernel density estimate with a much smaller point set. For characteristic kernels (including Gaussian and Laplace kernels), our approximation preserves the LL_\infty error between kernel density estimates within error ϵ\epsilon, with coreset size 2/ϵ22/\epsilon^2, but no other aspects of the data, including the dimension, the diameter of the point set, or the bandwidth of the kernel common to other approximations. When the dimension is unrestricted, we show this bound is tight for these kernels as well as a much broader set. This work provides a careful analysis of the iterative Frank-Wolfe algorithm adapted to this context, an algorithm called \emph{kernel herding}. This analysis unites a broad line of work that spans statistics, machine learning, and geometry. When the dimension dd is constant, we demonstrate much tighter bounds on the size of the coreset specifically for Gaussian kernels, showing that it is bounded by the size of the coreset for axis-aligned rectangles. Currently the best known constructive bound is O(1ϵlogd1ϵ)O(\frac{1}{\epsilon} \log^d \frac{1}{\epsilon}), and non-constructively, this can be improved by log1ϵ\sqrt{\log \frac{1}{\epsilon}}. This improves the best constant dimension bounds polynomially for d3d \geq 3

    Coresets and Sketches

    Full text link
    Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important classes of these summaries. We survey five types of coresets and sketches: shape-fitting, density estimation, high-dimensional vectors, high-dimensional point sets / matrices, and clustering.Comment: Near-final version of Chapter 49 in Handbook on Discrete and Computational Geometry, 3rd editio

    Automated Scalable Bayesian Inference via Hilbert Coresets

    Full text link
    The automation of posterior inference in Bayesian data analysis has enabled experts and nonexperts alike to use more sophisticated models, engage in faster exploratory modeling and analysis, and ensure experimental reproducibility. However, standard automated posterior inference algorithms are not tractable at the scale of massive modern datasets, and modifications to make them so are typically model-specific, require expert tuning, and can break theoretical guarantees on inferential quality. Building on the Bayesian coresets framework, this work instead takes advantage of data redundancy to shrink the dataset itself as a preprocessing step, providing fully-automated, scalable Bayesian inference with theoretical guarantees. We begin with an intuitive reformulation of Bayesian coreset construction as sparse vector sum approximation, and demonstrate that its automation and performance-based shortcomings arise from the use of the supremum norm. To address these shortcomings we develop Hilbert coresets, i.e., Bayesian coresets constructed under a norm induced by an inner-product on the log-likelihood function space. We propose two Hilbert coreset construction algorithms---one based on importance sampling, and one based on the Frank-Wolfe algorithm---along with theoretical guarantees on approximation quality as a function of coreset size. Since the exact computation of the proposed inner-products is model-specific, we automate the construction with a random finite-dimensional projection of the log-likelihood functions. The resulting automated coreset construction algorithm is simple to implement, and experiments on a variety of models with real and synthetic datasets show that it provides high-quality posterior approximations and a significant reduction in the computational cost of inference

    Coreset-Based Adaptive Tracking

    Full text link
    We propose a method for learning from streaming visual data using a compact, constant size representation of all the data that was seen until a given moment. Specifically, we construct a 'coreset' representation of streaming data using a parallelized algorithm, which is an approximation of a set with relation to the squared distances between this set and all other points in its ambient space. We learn an adaptive object appearance model from the coreset tree in constant time and logarithmic space and use it for object tracking by detection. Our method obtains excellent results for object tracking on three standard datasets over more than 100 videos. The ability to summarize data efficiently makes our method ideally suited for tracking in long videos in presence of space and time constraints. We demonstrate this ability by outperforming a variety of algorithms on the TLD dataset with 2685 frames on average. This coreset based learning approach can be applied for both real-time learning of small, varied data and fast learning of big data.Comment: 8 pages, 5 figures, In submission to IEEE TPAMI (Transactions on Pattern Analysis and Machine Intelligence

    k-Means Clustering of Lines for Big Data

    Full text link
    The input to the kk-median for lines problem is a set LL of nn lines in Rd\mathbb{R}^d, and the goal is to compute a set of kk centers (points) in Rd\mathbb{R}^d that minimizes the sum of squared distances over every line in LL and its nearest center. This is a straightforward generalization of the kk-median problem where the input is a set of nn points instead of lines. We suggest the first PTAS that computes a (1+ϵ)(1+\epsilon)-approximation to this problem in time O(nlogn)O(n \log n) for any constant approximation error ϵ(0,1)\epsilon \in (0, 1), and constant integers k,d1k, d \geq 1. This is by proving that there is always a weighted subset (called coreset) of dkO(k)log(n)/ϵ2dk^{O(k)}\log (n)/\epsilon^2 lines in LL that approximates the sum of squared distances from LL to any given set of kk points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to MM machines in one pass (e.g. cloud) using memory, update time and communication that is near-logarithmic in nn, as well as deletion of any line but using linear space. These results generalized for other distance functions such as kk-median (sum of distances) or ignoring farthest mm lines from the given centers to handle outliers. Experimental results on 10 machines on Amazon EC2 cloud show that the algorithm performs well in practice. Open source code for all the algorithms and experiments is also provided. This thesis is an extension of the following accepted paper: "kk-Means Clustering of Lines for Big Data", by Yair Marom & Dan Feldman, Proceedings of NeurIPS 2019 conference, to appear on December 2019

    Streamed Learning: One-Pass SVMs

    Full text link
    We present a streaming model for large-scale classification (in the context of 2\ell_2-SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The 2\ell_2-SVM is known to have an equivalent formulation in terms of the minimum enclosing ball (MEB) problem, and an efficient algorithm based on the idea of \emph{core sets} exists (Core Vector Machine, CVM). CVM learns a (1+ε)(1+\varepsilon)-approximate MEB for a set of points and yields an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. This paper presents a single-pass SVM which is based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector in a way similar to using online stochastic gradient updates. Our algorithm performs polylogarithmic computation at each example, and requires very small and constant storage. Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state-of-the-art SVM solvers (batch and online). We also give an analysis of the algorithm, and discuss some open issues and possible extensions
    corecore