33 research outputs found
Optimal Rates for Multi-pass Stochastic Gradient Methods
We analyze the learning properties of the stochastic gradient method when multiple passes over
the data and mini-batches are allowed. We study how regularization properties are controlled by
the step-size, the number of passes and the mini-batch size. In particular, we consider the square
loss and show that for a universal step-size choice, the number of passes acts as a regularization
parameter, and optimal nite sample bounds can be achieved by early-stopping. Moreover, we
show that larger step-sizes are allowed when considering mini-batches. Our analysis is based
on a unifying approach, encompassing both batch and stochastic gradient methods as special
cases. As a byproduct, we derive optimal convergence results for batch gradient methods (even
in the non-attainable cases)
Generalization Properties of Doubly Stochastic Learning Algorithms
Doubly stochastic learning algorithms are scalable kernel methods that
perform very well in practice. However, their generalization properties are not
well understood and their analysis is challenging since the corresponding
learning sequence may not be in the hypothesis space induced by the kernel. In
this paper, we provide an in-depth theoretical analysis for different variants
of doubly stochastic learning algorithms within the setting of nonparametric
regression in a reproducing kernel Hilbert space and considering the square
loss. Particularly, we derive convergence results on the generalization error
for the studied algorithms either with or without an explicit penalty term. To
the best of our knowledge, the derived results for the unregularized variants
are the first of this kind, while the results for the regularized variants
improve those in the literature. The novelties in our proof are a sample error
bound that requires controlling the trace norm of a cumulative operator, and a
refined analysis of bounding initial error.Comment: 24 pages. To appear in Journal of Complexit
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common
techniques to derive efficient large scale learning algorithms. In this paper,
we investigate their application in the context of nonparametric statistical
learning. More precisely, we study the estimator defined by stochastic gradient
with mini batches and random features. The latter can be seen as form of
nonlinear sketching and used to define approximate kernel methods. The
considered estimator is not explicitly penalized/constrained and regularization
is implicit. Indeed, our study highlights how different parameters, such as
number of features, iterations, step-size and mini-batch size control the
learning properties of the solutions. We do this by deriving optimal finite
sample bounds, under standard assumptions. The obtained results are
corroborated and illustrated by numerical experiments
On the Regularizing Property of Stochastic Gradient Descent
Stochastic gradient descent is one of the most successful approaches for
solving large-scale problems, especially in machine learning and statistics. At
each iteration, it employs an unbiased estimator of the full gradient computed
from one single randomly selected data point. Hence, it scales well with
problem size and is very attractive for truly massive dataset, and holds
significant potentials for solving large-scale inverse problems. In the recent
literature of machine learning, it was empirically observed that when equipped
with early stopping, it has regularizing property. In this work, we rigorously
establish its regularizing property (under \textit{a priori} early stopping
rule), and also prove convergence rates under the canonical sourcewise
condition, for minimizing the quadratic functional for linear inverse problems.
This is achieved by combining tools from classical regularization theory and
stochastic analysis. Further, we analyze the preasymptotic weak and strong
convergence behavior of the algorithm. The theoretical findings shed insights
into the performance of the algorithm, and are complemented with illustrative
numerical experiments.Comment: 22 pages, better presentatio
Optimal Rates for Spectral Algorithms with Least-Squares Regression over Hilbert Spaces
In this paper, we study regression problems over a separable Hilbert space
with the square loss, covering non-parametric regression over a reproducing
kernel Hilbert space. We investigate a class of spectral-regularized
algorithms, including ridge regression, principal component analysis, and
gradient methods. We prove optimal, high-probability convergence results in
terms of variants of norms for the studied algorithms, considering a capacity
assumption on the hypothesis space and a general source condition on the target
function. Consequently, we obtain almost sure convergence results with optimal
rates. Our results improve and generalize previous results, filling a
theoretical gap for the non-attainable cases
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for least-squares regression
with potentially several passes over the data. While several passes have been
widely reported to perform practically better in terms of predictive
performance on unseen data, the existing theoretical analysis of SGD suggests
that a single pass is statistically optimal. While this is true for
low-dimensional easy problems, we show that for hard problems, multiple passes
lead to statistically optimal predictions while single pass does not; we also
show that in these hard models, the optimal number of passes over the data
increases with sample size. In order to define the notion of hardness and show
that our predictive performances are optimal, we consider potentially
infinite-dimensional models and notions typically associated to kernel methods,
namely, the decay of eigenvalues of the covariance matrix of the features and
the complexity of the optimal predictor as measured through the covariance
matrix. We illustrate our results on synthetic experiments with non-linear
kernel methods and on a classical benchmark with a linear model
Kernel Conjugate Gradient Methods with Random Projections
We propose and study kernel conjugate gradient methods (KCGM) with random
projections for least-squares regression over a separable Hilbert space.
Considering two types of random projections generated by randomized sketches
and Nystr\"{o}m subsampling, we prove optimal statistical results with respect
to variants of norms for the algorithms under a suitable stopping rule.
Particularly, our results show that if the projection dimension is proportional
to the effective dimension of the problem, KCGM with randomized sketches can
generalize optimally, while achieving a computational advantage. As a
corollary, we derive optimal rates for classic KCGM in the case that the target
function may not be in the hypothesis space, filling a theoretical gap.Comment: 43 pages, 2 figure