549 research outputs found
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common
techniques to derive efficient large scale learning algorithms. In this paper,
we investigate their application in the context of nonparametric statistical
learning. More precisely, we study the estimator defined by stochastic gradient
with mini batches and random features. The latter can be seen as form of
nonlinear sketching and used to define approximate kernel methods. The
considered estimator is not explicitly penalized/constrained and regularization
is implicit. Indeed, our study highlights how different parameters, such as
number of features, iterations, step-size and mini-batch size control the
learning properties of the solutions. We do this by deriving optimal finite
sample bounds, under standard assumptions. The obtained results are
corroborated and illustrated by numerical experiments
Generalization Properties of Doubly Stochastic Learning Algorithms
Doubly stochastic learning algorithms are scalable kernel methods that
perform very well in practice. However, their generalization properties are not
well understood and their analysis is challenging since the corresponding
learning sequence may not be in the hypothesis space induced by the kernel. In
this paper, we provide an in-depth theoretical analysis for different variants
of doubly stochastic learning algorithms within the setting of nonparametric
regression in a reproducing kernel Hilbert space and considering the square
loss. Particularly, we derive convergence results on the generalization error
for the studied algorithms either with or without an explicit penalty term. To
the best of our knowledge, the derived results for the unregularized variants
are the first of this kind, while the results for the regularized variants
improve those in the literature. The novelties in our proof are a sample error
bound that requires controlling the trace norm of a cumulative operator, and a
refined analysis of bounding initial error.Comment: 24 pages. To appear in Journal of Complexit
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments
Leveraging Low-Rank Relations Between Surrogate Tasks in Structured Prediction
We study the interplay between surrogate methods for structured prediction
and techniques from multitask learning designed to leverage relationships
between surrogate outputs. We propose an efficient algorithm based on trace
norm regularization which, differently from previous methods, does not require
explicit knowledge of the coding/decoding functions of the surrogate framework.
As a result, our algorithm can be applied to the broad class of problems in
which the surrogate space is large or even infinite dimensional. We study
excess risk bounds for trace norm regularized structured prediction, implying
the consistency and learning rates for our estimator. We also identify relevant
regimes in which our approach can enjoy better generalization performance than
previous methods. Numerical experiments on ranking problems indicate that
enforcing low-rank relations among surrogate outputs may indeed provide a
significant advantage in practice.Comment: 42 pages, 1 tabl
Scalable Kernel Methods via Doubly Stochastic Gradients
The general perception is that kernel methods are not scalable, and neural
nets are the methods of choice for nonlinear learning problems. Or have we
simply not tried hard enough for kernel methods? Here we propose an approach
that scales up kernel methods using a novel concept called "doubly stochastic
functional gradients". Our approach relies on the fact that many kernel methods
can be expressed as convex optimization problems, and we solve the problems by
making two unbiased stochastic approximations to the functional gradient, one
using random training points and another using random functions associated with
the kernel, and then descending using this noisy functional gradient. We show
that a function produced by this procedure after iterations converges to
the optimal function in the reproducing kernel Hilbert space in rate ,
and achieves a generalization performance of . This doubly
stochasticity also allows us to avoid keeping the support vectors and to
implement the algorithm in a small memory footprint, which is linear in number
of iterations and independent of data dimension. Our approach can readily scale
kernel methods up to the regimes which are dominated by neural nets. We show
that our method can achieve competitive performance to neural nets in datasets
such as 8 million handwritten digits from MNIST, 2.3 million energy materials
from MolecularSpace, and 1 million photos from ImageNet.Comment: 32 pages, 22 figure
Density Estimation in Infinite Dimensional Exponential Families
In this paper, we consider an infinite dimensional exponential family,
of probability densities, which are parametrized by functions in
a reproducing kernel Hilbert space, and show it to be quite rich in the
sense that a broad class of densities on can be approximated
arbitrarily well in Kullback-Leibler (KL) divergence by elements in
. The main goal of the paper is to estimate an unknown density,
through an element in . Standard techniques like maximum
likelihood estimation (MLE) or pseudo MLE (based on the method of sieves),
which are based on minimizing the KL divergence between and
, do not yield practically useful estimators because of their
inability to efficiently handle the log-partition function. Instead, we propose
an estimator, based on minimizing the \emph{Fisher divergence},
between and , which involves solving a
simple finite-dimensional linear system. When , we show that
the proposed estimator is consistent, and provide a convergence rate of
in Fisher
divergence under the smoothness assumption that for some , where is a certain
Hilbert-Schmidt operator on and denotes the image of
. We also investigate the misspecified case of
and show that as , and provide a rate for this convergence under a
similar smoothness condition as above. Through numerical simulations we
demonstrate that the proposed estimator outperforms the non-parametric kernel
density estimator, and that the advantage with the proposed estimator grows as
increases.Comment: 58 pages, 8 figures; Fixed some errors and typo
Exponential convergence of testing error for stochastic gradient methods
We consider binary classification problems with positive definite kernels and
square loss, and study the convergence rates of stochastic gradient methods. We
show that while the excess testing loss (squared loss) converges slowly to zero
as the number of observations (and thus iterations) goes to infinity, the
testing error (classification error) converges exponentially fast if low-noise
conditions are assumed
- …