8,301 research outputs found
An Improved Analysis of Training Over-parameterized Deep Neural Networks
A recent line of research has shown that gradient-based algorithms with
random initialization can converge to the global minima of the training loss
for over-parameterized (i.e., sufficiently wide) deep neural networks. However,
the condition on the width of the neural network to ensure the global
convergence is very stringent, which is often a high-degree polynomial in the
training sample size (e.g., ). In this paper, we provide an
improved analysis of the global convergence of (stochastic) gradient descent
for training deep neural networks, which only requires a milder
over-parameterization condition than previous work in terms of the training
sample size and other problem-dependent parameters. The main technical
contributions of our analysis include (a) a tighter gradient lower bound that
leads to a faster convergence of the algorithm, and (b) a sharper
characterization of the trajectory length of the algorithm. By specializing our
result to two-layer (i.e., one-hidden-layer) neural networks, it also provides
a milder over-parameterization condition than the best-known result in prior
work.Comment: 30 pages, 1 figure, 1 tabl
Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima
We consider the problem of learning a one-hidden-layer neural network with
non-overlapping convolutional layer and ReLU activation, i.e., , in which
both the convolutional weights and the output weights
are parameters to be learned. When the labels are the outputs from a teacher
network of the same architecture with fixed weights , we prove that with Gaussian input , there is a
spurious local minimizer. Surprisingly, in the presence of the spurious local
minimizer, gradient descent with weight normalization from randomly initialized
weights can still be proven to recover the true parameters with constant
probability, which can be boosted to probability with multiple restarts. We
also show that with constant probability, the same procedure could also
converge to the spurious local minimum, showing that the local minimum plays a
non-trivial role in the dynamics of gradient descent. Furthermore, a
quantitative analysis shows that the gradient descent dynamics has two phases:
it starts off slow, but converges much faster after several iterations.Comment: Accepted by ICML 201
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
At initialization, artificial neural networks (ANNs) are equivalent to
Gaussian processes in the infinite-width limit, thus connecting them to kernel
methods. We prove that the evolution of an ANN during training can also be
described by a kernel: during gradient descent on the parameters of an ANN, the
network function (which maps input vectors to output vectors)
follows the kernel gradient of the functional cost (which is convex, in
contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel
(NTK). This kernel is central to describe the generalization features of ANNs.
While the NTK is random at initialization and varies during training, in the
infinite-width limit it converges to an explicit limiting kernel and it stays
constant during training. This makes it possible to study the training of ANNs
in function space instead of parameter space. Convergence of the training can
then be related to the positive-definiteness of the limiting NTK. We prove the
positive-definiteness of the limiting NTK when the data is supported on the
sphere and the non-linearity is non-polynomial. We then focus on the setting of
least-squares regression and show that in the infinite-width limit, the network
function follows a linear differential equation during training. The
convergence is fastest along the largest kernel principal components of the
input data with respect to the NTK, hence suggesting a theoretical motivation
for early stopping. Finally we study the NTK numerically, observe its behavior
for wide networks, and compare it to the infinite-width limit
Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds
We study the complexity of training neural network models with one hidden
nonlinear activation layer and an output weighted sum layer. We analyze
Gradient Descent applied to learning a bounded target function on
real-valued inputs. We give an agnostic learning guarantee for GD: starting
from a randomly initialized network, it converges in mean squared loss to the
minimum error (in -norm) of the best approximation of the target function
using a polynomial of degree at most . Moreover, for any , the size of
the network and number of iterations needed are both bounded by
. In particular, this applies to training networks of
unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding
that gradient descent discovers lower frequency Fourier components before
higher frequency components.
We complement this result with nearly matching lower bounds in the
Statistical Query model. GD fits well in the SQ framework since each training
step is determined by an expectation over the input distribution. We show that
any SQ algorithm that achieves significant improvement over a constant function
with queries of tolerance some inverse polynomial in the input dimensionality
must use queries even when the target functions are
restricted to a set of degree- polynomials, and the input
distribution is uniform over the unit sphere; for this class the
information-theoretic lower bound is only .
Our approach for both parts is based on spherical harmonics. We view gradient
descent as an operator on the space of functions, and study its dynamics. An
essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of
this operator in the case of the mean squared loss.Comment: Revised version now includes matching lower bound
Width Provably Matters in Optimization for Deep Linear Neural Networks
We prove that for an -layer fully-connected linear neural network, if the
width of every hidden layer is , where and are the rank and the condition number
of the input data, and is the output dimension, then
gradient descent with Gaussian random initialization converges to a global
minimum at a linear rate. The number of iterations to find an
-suboptimal solution is . Our
polynomial upper bound on the total running time for wide deep linear networks
and the lower bound for narrow deep
linear neural networks [Shamir, 2018] together demonstrate that wide layers are
necessary for optimizing deep models.Comment: In ICML 201
A Convergence Theory for Deep Learning via Over-Parameterization
Deep neural networks (DNNs) have demonstrated dominating performance in many
fields; since AlexNet, networks used in practice are going wider and deeper. On
the theoretical side, a long line of works has been focusing on training neural
networks with one hidden layer. The theory of multi-layer networks remains
largely unsettled.
In this work, we prove why stochastic gradient descent (SGD) can find
on the training objective of DNNs in
. We only make two assumptions: the inputs are
non-degenerate and the network is over-parameterized. The latter means the
network width is sufficiently large: in , the number
of layers and in , the number of samples.
Our key technique is to derive that, in a sufficiently large neighborhood of
the random initialization, the optimization landscape is almost-convex and
semi-smooth even with ReLU activations. This implies an equivalence between
over-parameterized neural networks and neural tangent kernel (NTK) in the
finite (and polynomial) width setting.
As concrete examples, starting from randomly initialized weights, we prove
that SGD can attain 100% training accuracy in classification tasks, or minimize
regression loss in linear convergence speed, with running time polynomial in
. Our theory applies to the widely-used but non-smooth ReLU activation,
and to any smooth and possibly non-convex loss functions. In terms of network
architectures, our theory at least applies to fully-connected neural networks,
convolutional neural networks (CNN), and residual neural networks (ResNet).Comment: V2 adds citation and V3/V4/V5 polish writin
Theory III: Dynamics and Generalization in Deep Networks
The key to generalization is controlling the complexity of the network.
However, there is no obvious control of complexity -- such as an explicit
regularization term -- in the training of deep networks for classification. We
will show that a classical form of norm control -- but kind of hidden -- is
present in deep networks trained with gradient descent techniques on
exponential-type losses. In particular, gradient descent induces a dynamics of
the normalized weights which converge for to an equilibrium
which corresponds to a minimum norm (or maximum margin) solution. For
sufficiently large but finite -- and thus finite -- the dynamics
converges to one of several margin maximizers, with the margin monotonically
increasing towards a limit stationary point of the flow. In the usual case of
stochastic gradient descent, most of the stationary points are likely to be
convex minima corresponding to a constrained minimizer -- the network with
normalized weights-- which corresponds to vanishing regularization. The
solution has zero generalization gap, for fixed architecture, asymptotically
for , where is the number of training examples. Our approach
extends some of the original results of Srebro from linear networks to deep
networks and provides a new perspective on the implicit bias of gradient
descent. We believe that the elusive complexity control we describe is
responsible for the puzzling empirical finding of good predictive performance
by deep networks, despite overparametrization.Comment: 47 pages, 11 figures. This replaces previous versions of Theory III,
that appeared on Arxiv [arXiv:1806.11379, arXiv:1801.00173] or on the CBMM
site. v5: Changes throughout the paper to the presentation and tightening
some of the statement
Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
In this paper we study the problem of learning a shallow artificial neural
network that best fits a training data set. We study this problem in the
over-parameterized regime where the number of observations are fewer than the
number of parameters in the model. We show that with quadratic activations the
optimization landscape of training such shallow neural networks has certain
favorable characteristics that allow globally optimal models to be found
efficiently using a variety of local search heuristics. This result holds for
an arbitrary training data of input/output pairs. For differentiable activation
functions we also show that gradient descent, when suitably initialized,
converges at a linear rate to a globally optimal model. This result focuses on
a realizable model where the inputs are chosen i.i.d. from a Gaussian
distribution and the labels are generated according to planted weight
coefficients.Comment: Section 3 on numerical experiments is added. Theorems 2.1 and 2.2 are
improved to apply to almost all input data (not just Gaussian inputs).
Related work section is expanded. The paper is accepted for publication in
IEEE transaction on Information Theory (2018
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
We study the problem of training deep neural networks with Rectified Linear
Unit (ReLU) activation function using gradient descent and stochastic gradient
descent. In particular, we study the binary classification problem and show
that for a broad family of loss functions, with proper random weight
initialization, both gradient descent and stochastic gradient descent can find
the global minima of the training loss for an over-parameterized deep ReLU
network, under mild assumption on the training data. The key idea of our proof
is that Gaussian random initialization followed by (stochastic) gradient
descent produces a sequence of iterates that stay inside a small perturbation
region centering around the initial weights, in which the empirical loss
function of deep ReLU networks enjoys nice local curvature properties that
ensure the global convergence of (stochastic) gradient descent. Our theoretical
results shed light on understanding the optimization for deep learning, and
pave the way for studying the optimization dynamics of training modern deep
neural networks.Comment: 54 pages. This version relaxes the assumptions on the loss functions
and data distribution, and improves the dependency on the problem-specific
parameters in the main theor
Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via , and transformed- Penalties
Sparsification of neural networks is one of the effective complexity
reduction methods to improve efficiency and generalizability. We consider the
problem of learning a one hidden layer convolutional neural network with ReLU
activation function via gradient descent under sparsity promoting penalties. It
is known that when the input data is Gaussian distributed, no-overlap networks
(without penalties) in regression problems with ground truth can be learned in
polynomial time at high probability. We propose a relaxed variable splitting
method integrating thresholding and gradient descent to overcome the lack of
non-smoothness in the loss function. The sparsity in network weight is realized
during the optimization (training) process. We prove that under ; and transformed- penalties, no-overlap networks can be learned
with high probability, and the iterative weights converge to a global limit
which is a transformation of the true weight under a novel thresholding
operation. Numerical experiments confirm theoretical findings, and compare the
accuracy and sparsity trade-off among the penalties
- β¦