377 research outputs found
Training Support Vector Machines using Coresets
We present a novel coreset construction algorithm for solving classification
tasks using Support Vector Machines (SVMs) in a computationally efficient
manner. A coreset is a weighted subset of the original data points that
provably approximates the original set. We show that coresets of size
polylogarithmic in and polynomial in exist for a set of input
points with features and present an -FPRAS for
constructing coresets for scalable SVM training. Our method leverages the
insight that data points are often redundant and uses an importance sampling
scheme based on the sensitivity of each data point to construct coresets
efficiently. We evaluate the performance of our algorithm in accelerating SVM
training against real-world data sets and compare our algorithm to
state-of-the-art coreset approaches. Our empirical results show that our
approach outperforms a state-of-the-art coreset approach and uniform sampling
in enabling computational speedups while achieving low approximation error
Coresets for Vector Summarization with Applications to Network Graphs
We provide a deterministic data summarization algorithm that approximates the
mean of a set of vectors in
\REAL^d, by a weighted mean of a \emph{subset} of O(1/\eps)
vectors, i.e., independent of both and . We prove that the squared
Euclidean distance between and is at most \eps
multiplied by the variance of . We use this algorithm to maintain an
approximated sum of vectors from an unbounded stream, using memory that is
independent of , and logarithmic in the vectors seen so far. Our main
application is to extract and represent in a compact way friend groups and
activity summaries of users from underlying data exchanges. For example, in the
case of mobile networks, we can use GPS traces to identify meetings, in the
case of social networks, we can use information exchange to identify friend
groups. Our algorithm provably identifies the {\it Heavy Hitter} entries in a
proximity (adjacency) matrix. The Heavy Hitters can be used to extract and
represent in a compact way friend groups and activity summaries of users from
underlying data exchanges. We evaluate the algorithm on several large data
sets.Comment: ICML'201
Practical Coreset Constructions for Machine Learning
We investigate coresets - succinct, small summaries of large data sets - so
that solutions found on the summary are provably competitive with solution
found on the full data set. We provide an overview over the state-of-the-art in
coreset construction for machine learning. In Section 2, we present both the
intuition behind and a theoretically sound framework to construct coresets for
general problems and apply it to -means clustering. In Section 3 we
summarize existing coreset construction algorithms for a variety of machine
learning problems such as maximum likelihood estimation of mixture models,
Bayesian non-parametric models, principal component analysis, regression and
general empirical risk minimization
A Deterministic Nonsmooth Frank Wolfe Algorithm with Coreset Guarantees
We present a new Frank-Wolfe (FW) type algorithm that is applicable to
minimization problems with a nonsmooth convex objective. We provide convergence
bounds and show that the scheme yields so-called coreset results for various
Machine Learning problems including 1-median, Balanced Development, Sparse PCA,
Graph Cuts, and the -norm-regularized Support Vector Machine (SVM)
among others. This means that the algorithm provides approximate solutions to
these problems in time complexity bounds that are not dependent on the size of
the input problem. Our framework, motivated by a growing body of work on
sublinear algorithms for various data analysis problems, is entirely
deterministic and makes no use of smoothing or proximal operators. Apart from
these theoretical results, we show experimentally that the algorithm is very
practical and in some cases also offers significant computational advantages on
large problem instances. We provide an open source implementation that can be
adapted for other problems that fit the overall structure
Improved Coresets for Kernel Density Estimates
We study the construction of coresets for kernel density estimates. That is
we show how to approximate the kernel density estimate described by a large
point set with another kernel density estimate with a much smaller point set.
For characteristic kernels (including Gaussian and Laplace kernels), our
approximation preserves the error between kernel density estimates
within error , with coreset size , but no other aspects
of the data, including the dimension, the diameter of the point set, or the
bandwidth of the kernel common to other approximations. When the dimension is
unrestricted, we show this bound is tight for these kernels as well as a much
broader set.
This work provides a careful analysis of the iterative Frank-Wolfe algorithm
adapted to this context, an algorithm called \emph{kernel herding}. This
analysis unites a broad line of work that spans statistics, machine learning,
and geometry.
When the dimension is constant, we demonstrate much tighter bounds on the
size of the coreset specifically for Gaussian kernels, showing that it is
bounded by the size of the coreset for axis-aligned rectangles. Currently the
best known constructive bound is , and non-constructively, this can be improved by
. This improves the best constant dimension
bounds polynomially for
Coresets and Sketches
Geometric data summarization has become an essential tool in both geometric
approximation algorithms and where geometry intersects with big data problems.
In linear or near-linear time large data sets can be compressed into a summary,
and then more intricate algorithms can be run on the summaries whose results
approximate those of the full data set. Coresets and sketches are the two most
important classes of these summaries. We survey five types of coresets and
sketches: shape-fitting, density estimation, high-dimensional vectors,
high-dimensional point sets / matrices, and clustering.Comment: Near-final version of Chapter 49 in Handbook on Discrete and
Computational Geometry, 3rd editio
Automated Scalable Bayesian Inference via Hilbert Coresets
The automation of posterior inference in Bayesian data analysis has enabled
experts and nonexperts alike to use more sophisticated models, engage in faster
exploratory modeling and analysis, and ensure experimental reproducibility.
However, standard automated posterior inference algorithms are not tractable at
the scale of massive modern datasets, and modifications to make them so are
typically model-specific, require expert tuning, and can break theoretical
guarantees on inferential quality. Building on the Bayesian coresets framework,
this work instead takes advantage of data redundancy to shrink the dataset
itself as a preprocessing step, providing fully-automated, scalable Bayesian
inference with theoretical guarantees. We begin with an intuitive reformulation
of Bayesian coreset construction as sparse vector sum approximation, and
demonstrate that its automation and performance-based shortcomings arise from
the use of the supremum norm. To address these shortcomings we develop Hilbert
coresets, i.e., Bayesian coresets constructed under a norm induced by an
inner-product on the log-likelihood function space. We propose two Hilbert
coreset construction algorithms---one based on importance sampling, and one
based on the Frank-Wolfe algorithm---along with theoretical guarantees on
approximation quality as a function of coreset size. Since the exact
computation of the proposed inner-products is model-specific, we automate the
construction with a random finite-dimensional projection of the log-likelihood
functions. The resulting automated coreset construction algorithm is simple to
implement, and experiments on a variety of models with real and synthetic
datasets show that it provides high-quality posterior approximations and a
significant reduction in the computational cost of inference
Coreset-Based Adaptive Tracking
We propose a method for learning from streaming visual data using a compact,
constant size representation of all the data that was seen until a given
moment. Specifically, we construct a 'coreset' representation of streaming data
using a parallelized algorithm, which is an approximation of a set with
relation to the squared distances between this set and all other points in its
ambient space. We learn an adaptive object appearance model from the coreset
tree in constant time and logarithmic space and use it for object tracking by
detection. Our method obtains excellent results for object tracking on three
standard datasets over more than 100 videos. The ability to summarize data
efficiently makes our method ideally suited for tracking in long videos in
presence of space and time constraints. We demonstrate this ability by
outperforming a variety of algorithms on the TLD dataset with 2685 frames on
average. This coreset based learning approach can be applied for both real-time
learning of small, varied data and fast learning of big data.Comment: 8 pages, 5 figures, In submission to IEEE TPAMI (Transactions on
Pattern Analysis and Machine Intelligence
k-Means Clustering of Lines for Big Data
The input to the -median for lines problem is a set of lines in
, and the goal is to compute a set of centers (points) in
that minimizes the sum of squared distances over every line in
and its nearest center. This is a straightforward generalization of the
-median problem where the input is a set of points instead of lines.
We suggest the first PTAS that computes a -approximation to
this problem in time for any constant approximation error
, and constant integers . This is by proving
that there is always a weighted subset (called coreset) of lines in that approximates the sum of squared distances
from to any given set of points.
Using traditional merge-and-reduce technique, this coreset implies results
for a streaming set (possibly infinite) of lines to machines in one pass
(e.g. cloud) using memory, update time and communication that is
near-logarithmic in , as well as deletion of any line but using linear
space. These results generalized for other distance functions such as
-median (sum of distances) or ignoring farthest lines from the given
centers to handle outliers.
Experimental results on 10 machines on Amazon EC2 cloud show that the
algorithm performs well in practice. Open source code for all the algorithms
and experiments is also provided.
This thesis is an extension of the following accepted paper: "-Means
Clustering of Lines for Big Data", by Yair Marom & Dan Feldman, Proceedings of
NeurIPS 2019 conference, to appear on December 2019
Streamed Learning: One-Pass SVMs
We present a streaming model for large-scale classification (in the context
of -SVM) by leveraging connections between learning and computational
geometry. The streaming model imposes the constraint that only a single pass
over the data is allowed. The -SVM is known to have an equivalent
formulation in terms of the minimum enclosing ball (MEB) problem, and an
efficient algorithm based on the idea of \emph{core sets} exists (Core Vector
Machine, CVM). CVM learns a -approximate MEB for a set of
points and yields an approximate solution to corresponding SVM instance.
However CVM works in batch mode requiring multiple passes over the data. This
paper presents a single-pass SVM which is based on the minimum enclosing ball
of streaming data. We show that the MEB updates for the streaming case can be
easily adapted to learn the SVM weight vector in a way similar to using online
stochastic gradient updates. Our algorithm performs polylogarithmic computation
at each example, and requires very small and constant storage. Experimental
results show that, even in such restrictive settings, we can learn efficiently
in just one pass and get accuracies comparable to other state-of-the-art SVM
solvers (batch and online). We also give an analysis of the algorithm, and
discuss some open issues and possible extensions
- …