8 research outputs found
Statistical and Computational Trade-Offs in Kernel K-Means
We investigate the efficiency of k-means in terms of both statistical and computational requirements. More precisely, we study a Nystrom approach to kernel k-means. We analyze the statistical properties of the proposed method and show that it achieves the same accuracy of exact kernel k-means with only a fraction of computations. Indeed, we prove under basic assumptions that sampling
oot pn Nystrom landmarks allows to greatly reduce computational costs without incurring in any loss of accuracy. To the best of our knowledge this is the first result of this kind for unsupervised learning
Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret
Gaussian processes (GP) are a well studied Bayesian approach for the
optimization of black-box functions. Despite their effectiveness in simple
problems, GP-based algorithms hardly scale to high-dimensional functions, as
their per-iteration time and space cost is at least quadratic in the number of
dimensions and iterations . Given a set of alternatives to choose
from, the overall runtime is prohibitive. In this paper we introduce
BKB (budgeted kernelized bandit), a new approximate GP algorithm for
optimization under bandit feedback that achieves near-optimal regret (and hence
near-optimal convergence rate) with near-constant per-iteration complexity and
remarkably no assumption on the input space or covariance of the GP.
We combine a kernelized linear bandit algorithm (GP-UCB) with randomized
matrix sketching based on leverage score sampling, and we prove that randomly
sampling inducing points based on their posterior variance gives an accurate
low-rank approximation of the GP, preserving variance estimates and confidence
intervals. As a consequence, BKB does not suffer from variance starvation, an
important problem faced by many previous sparse GP approximations. Moreover, we
show that our procedure selects at most points, where
is the effective dimension of the explored space, which is typically
much smaller than both and . This greatly reduces the dimensionality of
the problem, thus leading to a runtime and
space complexity.Comment: Accepted at COLT 2019. Corrected typos and improved comparison with
existing method
On Generalization Bounds for Projective Clustering
Given a set of points, clustering consists of finding a partition of a point
set into clusters such that the center to which a point is assigned is as
close as possible. Most commonly, centers are points themselves, which leads to
the famous -median and -means objectives. One may also choose centers to
be dimensional subspaces, which gives rise to subspace clustering. In this
paper, we consider learning bounds for these problems. That is, given a set of
samples drawn independently from some unknown, but fixed distribution
, how quickly does a solution computed on converge to the
optimal clustering of ? We give several near optimal results. In
particular,
For center-based objectives, we show a convergence rate of
. This matches the known optimal bounds
of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016]
and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for -means
and extends it to other important objectives such as -median.
For subspace clustering with -dimensional subspaces, we show a convergence
rate of . These are the first
provable bounds for most of these problems. For the specific case of projective
clustering, which generalizes -means, we show a convergence rate of
is necessary, thereby proving that the
bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical
Society 2016] are essentially optimal
Resource Efficient Large-Scale Machine Learning
Non-parametric models provide a principled way to learn non-linear functions. In particular, kernel methods are accurate prediction tools that rely on solid theoretical foundations. Although they enjoy optimal statistical properties, they have limited applicability in real-world large-scale scenarios because of their stringent computational requirements in terms of time and memory. Indeed their computational costs scale at least quadratically with the number of points of the dataset and many of the modern machine learning challenges requires training on datasets of millions if not billions of points. In this thesis, we focus on scaling kernel methods, developing novel algorithmic solutions that incorporate budgeted computations. To derive these algorithms we mix ideas from statistics, optimization, and randomized linear algebra. We study the statistical and computational trade-offs for various non-parametric models, the key component to derive numerical solutions with resources tailored to the statistical accuracy allowed by the data. In particular, we study the estimator defined by stochastic gradients and random features, showing how all the free parameters provably govern both the statistical properties and the computational complexity of the algorithm. We then see how to blend the Nystr\uf6m approximation and preconditioned conjugate gradient to derive a provably statistically optimal solver that can easily scale on datasets of millions of points on a single machine. We also derive a provably accurate leverage score sampling algorithm that can further improve the latter solver. Finally, we see how the Nystr\uf6m approximation with leverage scores can be used to scale Gaussian processes in a bandit optimization setting deriving a provably accurate algorithm. The theoretical analysis and the new algorithms presented in this work represent a step towards building a new generation of efficient non-parametric algorithms with minimal time and memory footprints