2,899 research outputs found
Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent
Recently there are a considerable amount of work devoted to the study of the
algorithmic stability and generalization for stochastic gradient descent (SGD).
However, the existing stability analysis requires to impose restrictive
assumptions on the boundedness of gradients, strong smoothness and convexity of
loss functions. In this paper, we provide a fine-grained analysis of stability
and generalization for SGD by substantially relaxing these assumptions.
Firstly, we establish stability and generalization for SGD by removing the
existing bounded gradient assumptions. The key idea is the introduction of a
new stability measure called on-average model stability, for which we develop
novel bounds controlled by the risks of SGD iterates. This yields
generalization bounds depending on the behavior of the best model, and leads to
the first-ever-known fast bounds in the low-noise setting using stability
approach. Secondly, the smoothness assumption is relaxed by considering loss
functions with Holder continuous (sub)gradients for which we show that optimal
bounds are still achieved by balancing computation and stability. To our best
knowledge, this gives the first-ever-known stability and generalization bounds
for SGD with even non-differentiable loss functions. Finally, we study learning
problems with (strongly) convex objectives but non-convex loss functions.Comment: to appear in ICML 202
A continuous-time analysis of distributed stochastic gradient
We analyze the effect of synchronization on distributed stochastic gradient
algorithms. By exploiting an analogy with dynamical models of biological quorum
sensing -- where synchronization between agents is induced through
communication with a common signal -- we quantify how synchronization can
significantly reduce the magnitude of the noise felt by the individual
distributed agents and by their spatial mean. This noise reduction is in turn
associated with a reduction in the smoothing of the loss function imposed by
the stochastic gradient approximation. Through simulations on model non-convex
objectives, we demonstrate that coupling can stabilize higher noise levels and
improve convergence. We provide a convergence analysis for strongly convex
functions by deriving a bound on the expected deviation of the spatial mean of
the agents from the global minimizer for an algorithm based on quorum sensing,
the same algorithm with momentum, and the Elastic Averaging SGD (EASGD)
algorithm. We discuss extensions to new algorithms which allow each agent to
broadcast its current measure of success and shape the collective computation
accordingly. We supplement our theoretical analysis with numerical experiments
on convolutional neural networks trained on the CIFAR-10 dataset, where we note
a surprising regularizing property of EASGD even when applied to the
non-distributed case. This observation suggests alternative second-order
in-time algorithms for non-distributed optimization that are competitive with
momentum methods.Comment: 9/14/19 : Final version, accepted for publication in Neural
Computation. 4/7/19 : Significant edits: addition of simulations, deep
network results, and revisions throughout. 12/28/18: Initial submissio
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
Recent works have cast some light on the mystery of why deep nets fit any
data and generalize despite being very overparametrized. This paper analyzes
training and generalization for a simple 2-layer ReLU net with random
initialization, and provides the following improvements over recent works:
(i) Using a tighter characterization of training speed than recent papers, an
explanation for why training a neural net with random labels leads to slower
training, as originally observed in [Zhang et al. ICLR'17].
(ii) Generalization bound independent of network size, using a data-dependent
complexity measure. Our measure distinguishes clearly between random labels and
true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent
papers require sample complexity to increase (slowly) with the size, while our
sample complexity is completely independent of the network size.
(iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets
trained via gradient descent.
The key idea is to track dynamics of training and generalization via
properties of a related kernel.Comment: In ICML 201
Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
Empirical studies show that gradient-based methods can learn deep neural
networks (DNNs) with very good generalization performance in the
over-parameterization regime, where DNNs can easily fit a random labeling of
the training data. Very recently, a line of work explains in theory that with
over-parameterization and proper random initialization, gradient-based methods
can find the global minima of the training loss for DNNs. However, existing
generalization error bounds are unable to explain the good generalization
performance of over-parameterized DNNs. The major limitation of most existing
generalization bounds is that they are based on uniform convergence and are
independent of the training algorithm. In this work, we derive an
algorithm-dependent generalization error bound for deep ReLU networks, and show
that under certain assumptions on the data distribution, gradient descent (GD)
with proper random initialization is able to train a sufficiently
over-parameterized DNN to achieve arbitrarily small generalization error. Our
work sheds light on explaining the good generalization performance of
over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the
presentation in Version 3. In AAAI 202
Fast Convergence in Learning Two-Layer Neural Networks with Separable Data
Normalized gradient descent has shown substantial success in speeding up the
convergence of exponentially-tailed loss functions (which includes exponential
and logistic losses) on linear classifiers with separable data. In this paper,
we go beyond linear models by studying normalized GD on two-layer neural nets.
We prove for exponentially-tailed losses that using normalized GD leads to
linear rate of convergence of the training loss to the global optimum. This is
made possible by showing certain gradient self-boundedness conditions and a
log-Lipschitzness property. We also study generalization of normalized GD for
convex objectives via an algorithmic-stability analysis. In particular, we show
that normalized GD does not overfit during training by establishing finite-time
generalization bounds
- …