437 research outputs found
Small nonlinearities in activation functions create bad local minima in neural networks
We investigate the loss surface of neural networks. We prove that even for
one-hidden-layer networks with "slightest" nonlinearity, the empirical risks
have spurious local minima in most cases. Our results thus indicate that in
general "no spurious local minima" is a property limited to deep linear
networks, and insights obtained from linear networks may not be robust.
Specifically, for ReLU(-like) networks we constructively prove that for almost
all practical datasets there exist infinitely many local minima. We also
present a counterexample for more general activations (sigmoid, tanh, arctan,
ReLU, etc.), for which there exists a bad local minimum. Our results make the
least restrictive assumptions relative to existing results on spurious local
optima in neural networks. We complete our discussion by presenting a
comprehensive characterization of global optimality for deep linear networks,
which unifies other results on this topic.Comment: 33 pages, appeared at ICLR 201
Depth creates no more spurious local minima
We show that for any convex differentiable loss, a deep linear network has no
spurious local minima as long as it is true for the two layer case. This
reduction greatly simplifies the study on the existence of spurious local
minima in deep linear networks. When applied to the quadratic loss, our result
immediately implies the powerful result in [Kawaguchi 2016]. Further, with the
work in [Zhou and Liang 2018], we can remove all the assumptions in [Kawaguchi
2016]. This property holds for more general "multi-tower" linear networks too.
Our proof builds on [Laurent and von Brecht 2018] and develops a new
perturbation argument to show that any spurious local minimum must have full
rank, a structural property which can be useful more generally
Deep Neural Networks
Deep Neural Networks (DNNs) are universal function approximators providing
state-of- the-art solutions on wide range of applications. Common perceptual
tasks such as speech recognition, image classification, and object tracking are
now commonly tackled via DNNs. Some fundamental problems remain: (1) the lack
of a mathematical framework providing an explicit and interpretable
input-output formula for any topology, (2) quantification of DNNs stability
regarding adversarial examples (i.e. modified inputs fooling DNN predictions
whilst undetectable to humans), (3) absence of generalization guarantees and
controllable behaviors for ambiguous patterns, (4) leverage unlabeled data to
apply DNNs to domains where expert labeling is scarce as in the medical field.
Answering those points would provide theoretical perspectives for further
developments based on a common ground. Furthermore, DNNs are now deployed in
tremendous societal applications, pushing the need to fill this theoretical gap
to ensure control, reliability, and interpretability.Comment: Technical Repor
Traversing the noise of dynamic mini-batch sub-sampled loss functions: A visual guide
Mini-batch sub-sampling in neural network training is unavoidable, due to
growing data demands, memory-limited computational resources such as graphical
processing units (GPUs), and the dynamics of on-line learning. In this study we
specifically distinguish between static mini-batch sub-sampled loss functions,
where mini-batches are intermittently fixed during training, resulting in
smooth but biased loss functions; and the dynamic sub-sampling equivalent,
where new mini-batches are sampled at every loss evaluation, trading bias for
variance in sampling induced discontinuities. These render automated
optimization strategies such as minimization line searches ineffective, since
critical points may not exist and function minimizers find spurious,
discontinuity induced minima.
This paper suggests recasting the optimization problem to find stochastic
non-negative associated gradient projection points (SNN-GPPs). We demonstrate
that the SNN-GPP optimality criterion is less susceptible to sub-sampling
induced discontinuities than critical points or minimizers. We conduct a visual
investigation, comparing local minimum and SNN-GPP optimality criteria in the
loss functions of a simple neural network training problem for a variety of
popular activation functions. Since SNN-GPPs better approximate the location of
true optima, particularly when using smooth activation functions with high
curvature characteristics, we postulate that line searches locating SNN-GPPs
can contribute significantly to automating neural network trainingComment: 43 pages, 22 Figures, to be submitted to a journa
A Note on Connectivity of Sublevel Sets in Deep Learning
It is shown that for deep neural networks, a single wide layer of width
( being the number of training samples) suffices to prove the connectivity
of sublevel sets of the training loss function. In the two-layer setting, the
same property may not hold even if one has just one neuron less (i.e. width
can lead to disconnected sublevel sets)
Efficiently testing local optimality and escaping saddles for ReLU networks
We provide a theoretical algorithm for checking local optimality and escaping
saddles at nondifferentiable points of empirical risks of two-layer ReLU
networks. Our algorithm receives any parameter value and returns: local
minimum, second-order stationary point, or a strict descent direction. The
presence of data points on the nondifferentiability of the ReLU divides the
parameter space into at most regions, which makes analysis difficult. By
exploiting polyhedral geometry, we reduce the total computation down to one
convex quadratic program (QP) for each hidden node, (in)equality tests,
and one (or a few) nonconvex QP. For the last QP, we show that our specific
problem can be solved efficiently, in spite of nonconvexity. In the benign
case, we solve one equality constrained QP, and we prove that projected
gradient descent solves it exponentially fast. In the bad case, we have to
solve a few more inequality constrained QPs, but we prove that the time
complexity is exponential only in the number of inequality constraints. Our
experiments show that either benign case or bad case with very few inequality
constraints occurs, implying that our algorithm is efficient in most cases.Comment: 23 pages, appeared at ICLR 201
Understanding Global Loss Landscape of One-hidden-layer ReLU Networks, Part 1: Theory
For one-hidden-layer ReLU networks, we prove that all differentiable local
minima are global inside differentiable regions. We give the locations and
losses of differentiable local minima, and show that these local minima can be
isolated points or continuous hyperplanes, depending on an interplay between
data, activation pattern of hidden neurons and network size. Furthermore, we
give necessary and sufficient conditions for the existence of saddle points as
well as non-differentiable local minima, and their locations if they exist
On Convergence and Stability of GANs
We propose studying GAN training dynamics as regret minimization, which is in
contrast to the popular view that there is consistent minimization of a
divergence between real and generated distributions. We analyze the convergence
of GAN training from this new point of view to understand why mode collapse
happens. We hypothesize the existence of undesirable local equilibria in this
non-convex game to be responsible for mode collapse. We observe that these
local equilibria often exhibit sharp gradients of the discriminator function
around some real data points. We demonstrate that these degenerate local
equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show
that DRAGAN enables faster training, achieves improved stability with fewer
mode collapses, and leads to generator networks with better modeling
performance across a variety of architectures and objective functions.Comment: Analysis of convergence and mode collapse by studying GAN training
process as regret minimization. Some new result
Neural Networks with Complex-Valued Weights Have No Spurious Local Minima
We study the benefits of complex-valued weights for neural networks. We prove
that shallow complex neural networks with quadratic activations have no
spurious local minima. In contrast, shallow real neural networks with quadratic
activations have infinitely many spurious local minima under the same
conditions. In addition, we provide specific examples to demonstrate that
complex-valued weights turn poor local minima into saddle points. The
activation function CReLU is also discussed to illustrate the superiority of
analytic activations in complex-valued neural networks
Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity
We study finite sample expressivity, i.e., memorization power of ReLU
networks. Recent results require hidden nodes to memorize/interpolate
arbitrary data points. In contrast, by exploiting depth, we show that
3-layer ReLU networks with hidden nodes can perfectly
memorize most datasets with points. We also prove that width
is necessary and sufficient for memorizing data points,
proving tight bounds on memorization capacity. The sufficiency result can be
extended to deeper networks; we show that an -layer network with
parameters in the hidden layers can memorize data points if . Combined with a recent upper bound on VC dimension,
our construction is nearly tight for any fixed . Subsequently, we analyze
memorization capacity of residual networks under a general position assumption;
we prove results that substantially reduce the known requirement of hidden
nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and
show that when initialized near a memorizing global minimum of the empirical
risk, SGD quickly finds a nearby point with much smaller empirical risk.Comment: 28 pages, 2 figures. NeurIPS 2019 Camera-ready versio
- …