Search CORE

377 research outputs found

Training Support Vector Machines using Coresets

Author: Baykal Cenk
Liebenwein Lucas
Schwarting Wilko
Publication venue
Publication date: 09/11/2017
Field of study

We present a novel coreset construction algorithm for solving classification tasks using Support Vector Machines (SVMs) in a computationally efficient manner. A coreset is a weighted subset of the original data points that provably approximates the original set. We show that coresets of size polylogarithmic in

n

and polynomial in

d

exist for a set of

n

input points with

d

features and present an

(\epsilon,\delta)

-FPRAS for constructing coresets for scalable SVM training. Our method leverages the insight that data points are often redundant and uses an importance sampling scheme based on the sensitivity of each data point to construct coresets efficiently. We evaluate the performance of our algorithm in accelerating SVM training against real-world data sets and compare our algorithm to state-of-the-art coreset approaches. Our empirical results show that our approach outperforms a state-of-the-art coreset approach and uniform sampling in enabling computational speedups while achieving low approximation error

arXiv.org e-Print Archive

Coresets for Vector Summarization with Applications to Network Graphs

Author: Feldman Dan
Ozer Sedat
Rus Daniela
Publication venue
Publication date: 17/06/2017
Field of study

We provide a deterministic data summarization algorithm that approximates the mean

\bar{p}=\frac{1}{n}\sum_{p\in P} p

of a set

P

n

vectors in \REAL^d, by a weighted mean

\tilde{p}

of a \emph{subset} of O(1/\eps) vectors, i.e., independent of both

n

and

d

. We prove that the squared Euclidean distance between

\bar{p}

and

\tilde{p}

is at most \eps multiplied by the variance of

P

. We use this algorithm to maintain an approximated sum of vectors from an unbounded stream, using memory that is independent of

d

, and logarithmic in the

n

vectors seen so far. Our main application is to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. For example, in the case of mobile networks, we can use GPS traces to identify meetings, in the case of social networks, we can use information exchange to identify friend groups. Our algorithm provably identifies the {\it Heavy Hitter} entries in a proximity (adjacency) matrix. The Heavy Hitters can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We evaluate the algorithm on several large data sets.Comment: ICML'201

arXiv.org e-Print Archive

Practical Coreset Constructions for Machine Learning

Author: Bachem Olivier
Krause Andreas
Lucic Mario
Publication venue
Publication date: 04/06/2017
Field of study

We investigate coresets - succinct, small summaries of large data sets - so that solutions found on the summary are provably competitive with solution found on the full data set. We provide an overview over the state-of-the-art in coreset construction for machine learning. In Section 2, we present both the intuition behind and a theoretically sound framework to construct coresets for general problems and apply it to

k

-means clustering. In Section 3 we summarize existing coreset construction algorithms for a variety of machine learning problems such as maximum likelihood estimation of mixture models, Bayesian non-parametric models, principal component analysis, regression and general empirical risk minimization

arXiv.org e-Print Archive

A Deterministic Nonsmooth Frank Wolfe Algorithm with Coreset Guarantees

Author: Collins Maxwell D.
Ravi Sathya N.
Singh Vikas
Publication venue
Publication date: 22/08/2017
Field of study

We present a new Frank-Wolfe (FW) type algorithm that is applicable to minimization problems with a nonsmooth convex objective. We provide convergence bounds and show that the scheme yields so-called coreset results for various Machine Learning problems including 1-median, Balanced Development, Sparse PCA, Graph Cuts, and the

\ell_1

-norm-regularized Support Vector Machine (SVM) among others. This means that the algorithm provides approximate solutions to these problems in time complexity bounds that are not dependent on the size of the input problem. Our framework, motivated by a growing body of work on sublinear algorithms for various data analysis problems, is entirely deterministic and makes no use of smoothing or proximal operators. Apart from these theoretical results, we show experimentally that the algorithm is very practical and in some cases also offers significant computational advantages on large problem instances. We provide an open source implementation that can be adapted for other problems that fit the overall structure

arXiv.org e-Print Archive

Improved Coresets for Kernel Density Estimates

Author: Phillips Jeff M.
Tai Wai Ming
Publication venue
Publication date: 11/10/2017
Field of study

We study the construction of coresets for kernel density estimates. That is we show how to approximate the kernel density estimate described by a large point set with another kernel density estimate with a much smaller point set. For characteristic kernels (including Gaussian and Laplace kernels), our approximation preserves the

L_\infty

error between kernel density estimates within error

\epsilon

, with coreset size

2/\epsilon^2

, but no other aspects of the data, including the dimension, the diameter of the point set, or the bandwidth of the kernel common to other approximations. When the dimension is unrestricted, we show this bound is tight for these kernels as well as a much broader set. This work provides a careful analysis of the iterative Frank-Wolfe algorithm adapted to this context, an algorithm called \emph{kernel herding}. This analysis unites a broad line of work that spans statistics, machine learning, and geometry. When the dimension

d

is constant, we demonstrate much tighter bounds on the size of the coreset specifically for Gaussian kernels, showing that it is bounded by the size of the coreset for axis-aligned rectangles. Currently the best known constructive bound is

O(\frac{1}{\epsilon} \log^d \frac{1}{\epsilon})

, and non-constructively, this can be improved by

\sqrt{\log \frac{1}{\epsilon}}

. This improves the best constant dimension bounds polynomially for

d \geq 3

arXiv.org e-Print Archive

Coresets and Sketches

Author: Phillips Jeff M.
Publication venue
Publication date: 12/06/2016
Field of study

Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important classes of these summaries. We survey five types of coresets and sketches: shape-fitting, density estimation, high-dimensional vectors, high-dimensional point sets / matrices, and clustering.Comment: Near-final version of Chapter 49 in Handbook on Discrete and Computational Geometry, 3rd editio

arXiv.org e-Print Archive

Automated Scalable Bayesian Inference via Hilbert Coresets

Author: Broderick Tamara
Campbell Trevor
Publication venue
Publication date: 28/02/2019
Field of study

The automation of posterior inference in Bayesian data analysis has enabled experts and nonexperts alike to use more sophisticated models, engage in faster exploratory modeling and analysis, and ensure experimental reproducibility. However, standard automated posterior inference algorithms are not tractable at the scale of massive modern datasets, and modifications to make them so are typically model-specific, require expert tuning, and can break theoretical guarantees on inferential quality. Building on the Bayesian coresets framework, this work instead takes advantage of data redundancy to shrink the dataset itself as a preprocessing step, providing fully-automated, scalable Bayesian inference with theoretical guarantees. We begin with an intuitive reformulation of Bayesian coreset construction as sparse vector sum approximation, and demonstrate that its automation and performance-based shortcomings arise from the use of the supremum norm. To address these shortcomings we develop Hilbert coresets, i.e., Bayesian coresets constructed under a norm induced by an inner-product on the log-likelihood function space. We propose two Hilbert coreset construction algorithms---one based on importance sampling, and one based on the Frank-Wolfe algorithm---along with theoretical guarantees on approximation quality as a function of coreset size. Since the exact computation of the proposed inner-products is model-specific, we automate the construction with a random finite-dimensional projection of the log-likelihood functions. The resulting automated coreset construction algorithm is simple to implement, and experiments on a variety of models with real and synthetic datasets show that it provides high-quality posterior approximations and a significant reduction in the computational cost of inference

arXiv.org e-Print Archive

Coreset-Based Adaptive Tracking

Author: Dubey Abhimanyu
Naik Nikhil
Raskar Ramesh
Raviv Dan
Sukthankar Rahul
Publication venue
Publication date: 19/11/2015
Field of study

We propose a method for learning from streaming visual data using a compact, constant size representation of all the data that was seen until a given moment. Specifically, we construct a 'coreset' representation of streaming data using a parallelized algorithm, which is an approximation of a set with relation to the squared distances between this set and all other points in its ambient space. We learn an adaptive object appearance model from the coreset tree in constant time and logarithmic space and use it for object tracking by detection. Our method obtains excellent results for object tracking on three standard datasets over more than 100 videos. The ability to summarize data efficiently makes our method ideally suited for tracking in long videos in presence of space and time constraints. We demonstrate this ability by outperforming a variety of algorithms on the TLD dataset with 2685 frames on average. This coreset based learning approach can be applied for both real-time learning of small, varied data and fast learning of big data.Comment: 8 pages, 5 figures, In submission to IEEE TPAMI (Transactions on Pattern Analysis and Machine Intelligence

arXiv.org e-Print Archive

k-Means Clustering of Lines for Big Data

Author: Feldman Dan
Marom Yair
Publication venue
Publication date: 25/11/2019
Field of study

The input to the

k

-median for lines problem is a set

L

n

lines in

\mathbb{R}^d

, and the goal is to compute a set of

k

centers (points) in

\mathbb{R}^d

that minimizes the sum of squared distances over every line in

L

and its nearest center. This is a straightforward generalization of the

k

-median problem where the input is a set of

n

points instead of lines. We suggest the first PTAS that computes a

(1+\epsilon)

-approximation to this problem in time

O(n \log n)

for any constant approximation error

\epsilon \in (0, 1)

, and constant integers

k, d \geq 1

. This is by proving that there is always a weighted subset (called coreset) of

dk^{O(k)}\log (n)/\epsilon^2

lines in

L

that approximates the sum of squared distances from

L

to any given set of

k

points. Using traditional merge-and-reduce technique, this coreset implies results for a streaming set (possibly infinite) of lines to

M

machines in one pass (e.g. cloud) using memory, update time and communication that is near-logarithmic in

n

, as well as deletion of any line but using linear space. These results generalized for other distance functions such as

k

-median (sum of distances) or ignoring farthest

m

lines from the given centers to handle outliers. Experimental results on 10 machines on Amazon EC2 cloud show that the algorithm performs well in practice. Open source code for all the algorithms and experiments is also provided. This thesis is an extension of the following accepted paper: "

k

-Means Clustering of Lines for Big Data", by Yair Marom & Dan Feldman, Proceedings of NeurIPS 2019 conference, to appear on December 2019

arXiv.org e-Print Archive

Streamed Learning: One-Pass SVMs

Author: Daumé III Hal
Rai Piyush
Venkatasubramanian Suresh
Publication venue
Publication date: 01/01/2009
Field of study

We present a streaming model for large-scale classification (in the context of

\ell_2

-SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The

\ell_2

-SVM is known to have an equivalent formulation in terms of the minimum enclosing ball (MEB) problem, and an efficient algorithm based on the idea of \emph{core sets} exists (Core Vector Machine, CVM). CVM learns a

(1+\varepsilon)

-approximate MEB for a set of points and yields an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. This paper presents a single-pass SVM which is based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector in a way similar to using online stochastic gradient updates. Our algorithm performs polylogarithmic computation at each example, and requires very small and constant storage. Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state-of-the-art SVM solvers (batch and online). We also give an analysis of the algorithm, and discuss some open issues and possible extensions

arXiv.org e-Print Archive

CiteSeerX