497 research outputs found
Generalization and Stability of Interpolating Neural Networks with Minimal Width
We investigate the generalization and optimization properties of shallow
neural-network classifiers trained by gradient descent in the interpolating
regime. Specifically, in a realizable scenario where model weights can achieve
arbitrarily small training error and their distance from
initialization is , we demonstrate that gradient descent with
training data achieves training error and generalization error
at iteration , provided there are at least
hidden neurons. We then show that our realizable setting
encompasses a special case where data are separable by the model's neural
tangent kernel. For this and logistic-loss minimization, we prove the training
loss decays at a rate of given polylogarithmic number of
neurons . Moreover, with neurons
and iterations, we bound the test loss by . Our
results differ from existing generalization outcomes using the
algorithmic-stability framework, which necessitate polynomial width and yield
suboptimal generalization rates. Central to our analysis is the use of a new
self-bounded weak-convexity property, which leads to a generalized local
quasi-convexity property for sufficiently parameterized neural-network
classifiers. Eventually, despite the objective's non-convexity, this leads to
convergence and generalization-gap bounds that resemble those found in the
convex setting of linear logistic regression.Comment: With significant changes: Stating results without homogeneity
assumption, Discussing results under NTK-separability in Section
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
Recent works have cast some light on the mystery of why deep nets fit any
data and generalize despite being very overparametrized. This paper analyzes
training and generalization for a simple 2-layer ReLU net with random
initialization, and provides the following improvements over recent works:
(i) Using a tighter characterization of training speed than recent papers, an
explanation for why training a neural net with random labels leads to slower
training, as originally observed in [Zhang et al. ICLR'17].
(ii) Generalization bound independent of network size, using a data-dependent
complexity measure. Our measure distinguishes clearly between random labels and
true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent
papers require sample complexity to increase (slowly) with the size, while our
sample complexity is completely independent of the network size.
(iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets
trained via gradient descent.
The key idea is to track dynamics of training and generalization via
properties of a related kernel.Comment: In ICML 201
A survey on modern trainable activation functions
In neural networks literature, there is a strong interest in identifying and
defining activation functions which can improve neural network performance. In
recent years there has been a renovated interest of the scientific community in
investigating activation functions which can be trained during the learning
process, usually referred to as "trainable", "learnable" or "adaptable"
activation functions. They appear to lead to better network performance.
Diverse and heterogeneous models of trainable activation function have been
proposed in the literature. In this paper, we present a survey of these models.
Starting from a discussion on the use of the term "activation function" in
literature, we propose a taxonomy of trainable activation functions, highlight
common and distinctive proprieties of recent and past models, and discuss main
advantages and limitations of this type of approach. We show that many of the
proposed approaches are equivalent to adding neuron layers which use fixed
(non-trainable) activation functions and some simple local rule that
constraints the corresponding weight layers.Comment: Published in "Neural Networks" journal (Elsevier
Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias
- …