11,931 research outputs found
Group Invariance, Stability to Deformations, and Complexity of Deep Convolutional Representations
The success of deep convolutional architectures is often attributed in part
to their ability to learn multiscale and invariant representations of natural
signals. However, a precise study of these properties and how they affect
learning guarantees is still missing. In this paper, we consider deep
convolutional representations of signals; we study their invariance to
translations and to more general groups of transformations, their stability to
the action of diffeomorphisms, and their ability to preserve signal
information. This analysis is carried by introducing a multilayer kernel based
on convolutional kernel networks and by studying the geometry induced by the
kernel mapping. We then characterize the corresponding reproducing kernel
Hilbert space (RKHS), showing that it contains a large class of convolutional
neural networks with homogeneous activation functions. This analysis allows us
to separate data representation from learning, and to provide a canonical
measure of model complexity, the RKHS norm, which controls both stability and
generalization of any learned model. In addition to models in the constructed
RKHS, our stability analysis also applies to convolutional networks with
generic activations such as rectified linear units, and we discuss its
relationship with recent generalization bounds based on spectral norms
On the Inductive Bias of Neural Tangent Kernels
State-of-the-art neural networks are heavily over-parameterized, making the
optimization algorithm a crucial ingredient for learning predictive models with
good generalization properties. A recent line of work has shown that in a
certain over-parameterized regime, the learning dynamics of gradient descent
are governed by a certain kernel obtained at initialization, called the neural
tangent kernel. We study the inductive bias of learning in such a regime by
analyzing this kernel and the corresponding function space (RKHS). In
particular, we study smoothness, approximation, and stability properties of
functions with finite norm, including stability to image deformations in the
case of convolutional networks, and compare to other known kernels for similar
architectures.Comment: NeurIPS 201
Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
Empirical studies show that gradient-based methods can learn deep neural
networks (DNNs) with very good generalization performance in the
over-parameterization regime, where DNNs can easily fit a random labeling of
the training data. Very recently, a line of work explains in theory that with
over-parameterization and proper random initialization, gradient-based methods
can find the global minima of the training loss for DNNs. However, existing
generalization error bounds are unable to explain the good generalization
performance of over-parameterized DNNs. The major limitation of most existing
generalization bounds is that they are based on uniform convergence and are
independent of the training algorithm. In this work, we derive an
algorithm-dependent generalization error bound for deep ReLU networks, and show
that under certain assumptions on the data distribution, gradient descent (GD)
with proper random initialization is able to train a sufficiently
over-parameterized DNN to achieve arbitrarily small generalization error. Our
work sheds light on explaining the good generalization performance of
over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the
presentation in Version 3. In AAAI 202
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
Recent works have cast some light on the mystery of why deep nets fit any
data and generalize despite being very overparametrized. This paper analyzes
training and generalization for a simple 2-layer ReLU net with random
initialization, and provides the following improvements over recent works:
(i) Using a tighter characterization of training speed than recent papers, an
explanation for why training a neural net with random labels leads to slower
training, as originally observed in [Zhang et al. ICLR'17].
(ii) Generalization bound independent of network size, using a data-dependent
complexity measure. Our measure distinguishes clearly between random labels and
true labels on MNIST and CIFAR, as shown by experiments. Moreover, recent
papers require sample complexity to increase (slowly) with the size, while our
sample complexity is completely independent of the network size.
(iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets
trained via gradient descent.
The key idea is to track dynamics of training and generalization via
properties of a related kernel.Comment: In ICML 201
- …