4,030 research outputs found
The Implicit Bias of Gradient Descent on Separable Data
We examine gradient descent on unregularized logistic regression problems,
with homogeneous linear predictors on linearly separable datasets. We show the
predictor converges to the direction of the max-margin (hard margin SVM)
solution. The result also generalizes to other monotone decreasing loss
functions with an infimum at infinity, to multi-class problems, and to training
a weight layer in a deep network in a certain restricted setting. Furthermore,
we show this convergence is very slow, and only logarithmic in the convergence
of the loss itself. This can help explain the benefit of continuing to optimize
the logistic or cross-entropy loss even after the training error is zero and
the training loss is extremely small, and, as we show, even if the validation
loss increases. Our methodology can also aid in understanding implicit
regularization n more complex models and with other optimization methods.Comment: Final JMLR version, with improved discussions over v3. Main
improvements in journal version over conference version (v2 appeared in
ICLR): We proved the measure zero case for main theorem (with implications
for the rates), and the multi-class cas
Decentralized Learning with Separable Data: Generalization and Fast Algorithms
Decentralized learning offers privacy and communication efficiency when data
are naturally distributed among agents communicating over an underlying graph.
Motivated by overparameterized learning settings, in which models are trained
to zero training loss, we study algorithmic and generalization properties of
decentralized learning with gradient descent on separable data. Specifically,
for decentralized gradient descent (DGD) and a variety of loss functions that
asymptote to zero at infinity (including exponential and logistic losses), we
derive novel finite-time generalization bounds. This complements a long line of
recent work that studies the generalization performance and the implicit bias
of gradient descent over separable data, but has thus far been limited to
centralized learning scenarios. Notably, our generalization bounds match in
order their centralized counterparts. Critical behind this, and of independent
interest, is establishing novel bounds on the training loss and the
rate-of-consensus of DGD for a class of self-bounded losses. Finally, on the
algorithmic front, we design improved gradient-based routines for decentralized
learning with separable data and empirically demonstrate orders-of-magnitude of
speed-up in terms of both training and generalization performance
Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
Empirical studies show that gradient-based methods can learn deep neural
networks (DNNs) with very good generalization performance in the
over-parameterization regime, where DNNs can easily fit a random labeling of
the training data. Very recently, a line of work explains in theory that with
over-parameterization and proper random initialization, gradient-based methods
can find the global minima of the training loss for DNNs. However, existing
generalization error bounds are unable to explain the good generalization
performance of over-parameterized DNNs. The major limitation of most existing
generalization bounds is that they are based on uniform convergence and are
independent of the training algorithm. In this work, we derive an
algorithm-dependent generalization error bound for deep ReLU networks, and show
that under certain assumptions on the data distribution, gradient descent (GD)
with proper random initialization is able to train a sufficiently
over-parameterized DNN to achieve arbitrarily small generalization error. Our
work sheds light on explaining the good generalization performance of
over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the
presentation in Version 3. In AAAI 202
- …