37,595 research outputs found
Optimal Learning for Multi-pass Stochastic Gradient Methods
We analyze the learning properties of the stochastic gradient method when multiple
passes over the data and mini-batches are allowed. In particular, we consider
the square loss and show that for a universal step-size choice, the number of
passes acts as a regularization parameter, and optimal finite sample bounds can be
achieved by early-stopping. Moreover, we show that larger step-sizes are allowed
when considering mini-batches. Our analysis is based on a unifying approach,
encompassing both batch and stochastic gradient methods as special cases
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
We consider stochastic gradient descent (SGD) for least-squares regression
with potentially several passes over the data. While several passes have been
widely reported to perform practically better in terms of predictive
performance on unseen data, the existing theoretical analysis of SGD suggests
that a single pass is statistically optimal. While this is true for
low-dimensional easy problems, we show that for hard problems, multiple passes
lead to statistically optimal predictions while single pass does not; we also
show that in these hard models, the optimal number of passes over the data
increases with sample size. In order to define the notion of hardness and show
that our predictive performances are optimal, we consider potentially
infinite-dimensional models and notions typically associated to kernel methods,
namely, the decay of eigenvalues of the covariance matrix of the features and
the complexity of the optimal predictor as measured through the covariance
matrix. We illustrate our results on synthetic experiments with non-linear
kernel methods and on a classical benchmark with a linear model
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common
techniques to derive efficient large scale learning algorithms. In this paper,
we investigate their application in the context of nonparametric statistical
learning. More precisely, we study the estimator defined by stochastic gradient
with mini batches and random features. The latter can be seen as form of
nonlinear sketching and used to define approximate kernel methods. The
considered estimator is not explicitly penalized/constrained and regularization
is implicit. Indeed, our study highlights how different parameters, such as
number of features, iterations, step-size and mini-batch size control the
learning properties of the solutions. We do this by deriving optimal finite
sample bounds, under standard assumptions. The obtained results are
corroborated and illustrated by numerical experiments
- …