2 research outputs found
Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise
We consider a one-hidden-layer leaky ReLU network of arbitrary width trained
by stochastic gradient descent (SGD) following an arbitrary initialization. We
prove that SGD produces neural networks that have classification accuracy
competitive with that of the best halfspace over the distribution for a broad
class of distributions that includes log-concave isotropic and hard margin
distributions. Equivalently, such networks can generalize when the data
distribution is linearly separable but corrupted with adversarial label noise,
despite the capacity to overfit. To the best of our knowledge, this is the
first work to show that overparameterized neural networks trained by SGD can
generalize when the data is corrupted with adversarial label noise.Comment: 30 pages, 10 figure
The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks
Despite their overwhelming capacity to overfit, deep neural networks trained
by specific optimization algorithms tend to generalize well to unseen data.
Recently, researchers explained it by investigating the implicit regularization
effect of optimization algorithms. A remarkable progress is the work (Lyu&Li,
2019), which proves gradient descent (GD) maximizes the margin of homogeneous
deep neural networks. Except GD, adaptive algorithms such as AdaGrad, RMSProp
and Adam are popular owing to their rapid training process. However,
theoretical guarantee for the generalization of adaptive optimization
algorithms is still lacking. In this paper, we study the implicit
regularization of adaptive optimization algorithms when they are optimizing the
logistic loss on homogeneous deep neural networks. We prove that adaptive
algorithms that adopt exponential moving average strategy in conditioner (such
as Adam and RMSProp) can maximize the margin of the neural network, while
AdaGrad that directly sums historical squared gradients in conditioner can not.
It indicates superiority on generalization of exponential moving average
strategy in the design of the conditioner. Technically, we provide a unified
framework to analyze convergent direction of adaptive optimization algorithms
by constructing novel adaptive gradient flow and surrogate margin. Our
experiments can well support the theoretical findings on convergent direction
of adaptive optimization algorithms.Comment: ICML 2021 Long Tal