16 research outputs found
Resource Efficient Large-Scale Machine Learning
Non-parametric models provide a principled way to learn non-linear functions. In particular, kernel methods are accurate prediction tools that rely on solid theoretical foundations. Although they enjoy optimal statistical properties, they have limited applicability in real-world large-scale scenarios because of their stringent computational requirements in terms of time and memory. Indeed their computational costs scale at least quadratically with the number of points of the dataset and many of the modern machine learning challenges requires training on datasets of millions if not billions of points. In this thesis, we focus on scaling kernel methods, developing novel algorithmic solutions that incorporate budgeted computations. To derive these algorithms we mix ideas from statistics, optimization, and randomized linear algebra. We study the statistical and computational trade-offs for various non-parametric models, the key component to derive numerical solutions with resources tailored to the statistical accuracy allowed by the data. In particular, we study the estimator defined by stochastic gradients and random features, showing how all the free parameters provably govern both the statistical properties and the computational complexity of the algorithm. We then see how to blend the Nystr\uf6m approximation and preconditioned conjugate gradient to derive a provably statistically optimal solver that can easily scale on datasets of millions of points on a single machine. We also derive a provably accurate leverage score sampling algorithm that can further improve the latter solver. Finally, we see how the Nystr\uf6m approximation with leverage scores can be used to scale Gaussian processes in a bandit optimization setting deriving a provably accurate algorithm. The theoretical analysis and the new algorithms presented in this work represent a step towards building a new generation of efficient non-parametric algorithms with minimal time and memory footprints
FALKON: An Optimal Large Scale Kernel Method
Kernel methods provide a principled way to perform non linear, nonparametric
learning. They rely on solid functional analytic foundations and enjoy optimal
statistical properties. However, at least in their basic form, they have
limited applicability in large scale scenarios because of stringent
computational requirements in terms of time and especially memory. In this
paper, we take a substantial step in scaling up kernel methods, proposing
FALKON, a novel algorithm that allows to efficiently process millions of
points. FALKON is derived combining several algorithmic principles, namely
stochastic subsampling, iterative solvers and preconditioning. Our theoretical
analysis shows that optimal statistical accuracy is achieved requiring
essentially memory and time. An extensive experimental
analysis on large scale datasets shows that, even with a single machine, FALKON
outperforms previous state of the art solutions, which exploit
parallel/distributed architectures.Comment: NIPS 201
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common
techniques to derive efficient large scale learning algorithms. In this paper,
we investigate their application in the context of nonparametric statistical
learning. More precisely, we study the estimator defined by stochastic gradient
with mini batches and random features. The latter can be seen as form of
nonlinear sketching and used to define approximate kernel methods. The
considered estimator is not explicitly penalized/constrained and regularization
is implicit. Indeed, our study highlights how different parameters, such as
number of features, iterations, step-size and mini-batch size control the
learning properties of the solutions. We do this by deriving optimal finite
sample bounds, under standard assumptions. The obtained results are
corroborated and illustrated by numerical experiments
Learning with SGD and Random Features
SpotlightInternational audienceSketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments
Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret
Gaussian processes (GP) are a well studied Bayesian approach for the
optimization of black-box functions. Despite their effectiveness in simple
problems, GP-based algorithms hardly scale to high-dimensional functions, as
their per-iteration time and space cost is at least quadratic in the number of
dimensions and iterations . Given a set of alternatives to choose
from, the overall runtime is prohibitive. In this paper we introduce
BKB (budgeted kernelized bandit), a new approximate GP algorithm for
optimization under bandit feedback that achieves near-optimal regret (and hence
near-optimal convergence rate) with near-constant per-iteration complexity and
remarkably no assumption on the input space or covariance of the GP.
We combine a kernelized linear bandit algorithm (GP-UCB) with randomized
matrix sketching based on leverage score sampling, and we prove that randomly
sampling inducing points based on their posterior variance gives an accurate
low-rank approximation of the GP, preserving variance estimates and confidence
intervals. As a consequence, BKB does not suffer from variance starvation, an
important problem faced by many previous sparse GP approximations. Moreover, we
show that our procedure selects at most points, where
is the effective dimension of the explored space, which is typically
much smaller than both and . This greatly reduces the dimensionality of
the problem, thus leading to a runtime and
space complexity.Comment: Accepted at COLT 2019. Corrected typos and improved comparison with
existing method
On Fast Leverage Score Sampling and Optimal Learning
International audienceLeverage score sampling provides an appealing way to perform approximate computations for large matrices. Indeed, it allows to derive faithful approximations with a complexity adapted to the problem at hand. Yet, performing leverage scores sampling is a challenge in its own right requiring further approximations. In this paper, we study the problem of leverage score sampling for positive definite matrices defined by a kernel. Our contribution is twofold. First we provide a novel algorithm for leverage score sampling and second, we exploit the proposed method in statistical learning by deriving a novel solver for kernel ridge regression. Our main technical contribution is showing that the proposed algorithms are currently the most efficient and accurate for these problems