6 research outputs found
An Improved Analysis of Training Over-parameterized Deep Neural Networks
A recent line of research has shown that gradient-based algorithms with
random initialization can converge to the global minima of the training loss
for over-parameterized (i.e., sufficiently wide) deep neural networks. However,
the condition on the width of the neural network to ensure the global
convergence is very stringent, which is often a high-degree polynomial in the
training sample size (e.g., ). In this paper, we provide an
improved analysis of the global convergence of (stochastic) gradient descent
for training deep neural networks, which only requires a milder
over-parameterization condition than previous work in terms of the training
sample size and other problem-dependent parameters. The main technical
contributions of our analysis include (a) a tighter gradient lower bound that
leads to a faster convergence of the algorithm, and (b) a sharper
characterization of the trajectory length of the algorithm. By specializing our
result to two-layer (i.e., one-hidden-layer) neural networks, it also provides
a milder over-parameterization condition than the best-known result in prior
work.Comment: 30 pages, 1 figure, 1 tabl
On the Global Convergence of Training Deep Linear ResNets
We study the convergence of gradient descent (GD) and stochastic gradient
descent (SGD) for training -hidden-layer linear residual networks (ResNets).
We prove that for training deep residual networks with certain linear
transformations at input and output layers, which are fixed throughout
training, both GD and SGD with zero initialization on all hidden weights can
converge to the global minimum of the training loss. Moreover, when
specializing to appropriate Gaussian random linear transformations, GD and SGD
provably optimize wide enough deep linear ResNets. Compared with the global
convergence result of GD for training standard deep linear networks (Du & Hu
2019), our condition on the neural network width is sharper by a factor of
, where denotes the condition number of the covariance
matrix of the training data. We further propose a modified identity input and
output transformations, and show that a -wide neural network is
sufficient to guarantee the global convergence of GD/SGD, where are the
input and output dimensions respectively.Comment: 26 pages, 1 figure. In ICLR 202
Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks
The skip-connections used in residual networks have become a standard
architecture choice in deep learning due to the increased training stability
and generalization performance with this architecture, although there has been
limited theoretical understanding for this improvement. In this work, we
analyze overparameterized deep residual networks trained by gradient descent
following random initialization, and demonstrate that (i) the class of networks
learned by gradient descent constitutes a small subset of the entire neural
network function class, and (ii) this subclass of networks is sufficiently
large to guarantee small training error. By showing (i) we are able to
demonstrate that deep residual networks trained with gradient descent have a
small generalization gap between training and test error, and together with
(ii) this guarantees that the test error will be small. Our optimization and
generalization guarantees require overparameterization that is only logarithmic
in the depth of the network, while all known generalization bounds for deep
non-residual networks have overparameterization requirements that are at least
polynomial in the depth. This provides an explanation for why residual networks
are preferable to non-residual ones.Comment: 37 pages. In NeurIPS 201
A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks
Deep neural networks' remarkable ability to correctly fit training data when
optimized by gradient-based algorithms is yet to be fully understood. Recent
theoretical results explain the convergence for ReLU networks that are wider
than those used in practice by orders of magnitude. In this work, we take a
step towards closing the gap between theory and practice by significantly
improving the known theoretical bounds on both the network width and the
convergence time. We show that convergence to a global minimum is guaranteed
for networks with widths quadratic in the sample size and linear in their depth
at a time logarithmic in both. Our analysis and convergence bounds are derived
via the construction of a surrogate network with fixed activation patterns that
can be transformed at any time to an equivalent ReLU network of a reasonable
size. This construction can be viewed as a novel technique to accelerate
training, while its tight finite-width equivalence to Neural Tangent Kernel
(NTK) suggests it can be utilized to study generalization as well
Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
Modern neural network architectures often generalize well despite containing
many more parameters than the size of the training dataset. This paper explores
the generalization capabilities of neural networks trained via gradient
descent. We develop a data-dependent optimization and generalization theory
which leverages the low-rank structure of the Jacobian matrix associated with
the network. Our results help demystify why training and generalization is
easier on clean and structured datasets and harder on noisy and unstructured
datasets as well as how the network size affects the evolution of the train and
test errors during training. Specifically, we use a control knob to split the
Jacobian spectum into "information" and "nuisance" spaces associated with the
large and small singular values. We show that over the information space
learning is fast and one can quickly train a model with zero training loss that
can also generalize well. Over the nuisance space training is slower and early
stopping can help with generalization at the expense of some bias. We also show
that the overall generalization capability of the network is controlled by how
well the label vector is aligned with the information space. A key feature of
our results is that even constant width neural nets can provably generalize for
sufficiently nice datasets. We conduct various numerical experiments on deep
networks that corroborate our theoretical findings and demonstrate that: (i)
the Jacobian of typical neural networks exhibit low-rank structure with a few
large singular values and many small ones leading to a low-dimensional
information space, (ii) over the information space learning is fast and most of
the label vector falls on this space, and (iii) label noise falls on the
nuisance space and impedes optimization/generalization
Optimization for deep learning: theory and algorithms
When and why can a neural network be successfully trained? This article
provides an overview of optimization algorithms and theory for training neural
networks. First, we discuss the issue of gradient explosion/vanishing and the
more general issue of undesirable spectrum, and then discuss practical
solutions including careful initialization and normalization methods. Second,
we review generic optimization methods used in training neural networks, such
as SGD, adaptive gradient methods and distributed methods, and theoretical
results for these algorithms. Third, we review existing research on the global
issues of neural network training, including results on bad local minima, mode
connectivity, lottery ticket hypothesis and infinite-width analysis.Comment: 38 pages of main body; 5 pages of appendix; 12 pages of reference