Search CORE

437 research outputs found

Small nonlinearities in activation functions create bad local minima in neural networks

Author: Jadbabaie Ali
Sra Suvrit
Yun Chulhee
Publication venue
Publication date: 28/05/2019
Field of study

We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic.Comment: 33 pages, appeared at ICLR 201

arXiv.org e-Print Archive

Depth creates no more spurious local minima

Author: Zhang Li
Publication venue
Publication date: 05/01/2020
Field of study

We show that for any convex differentiable loss, a deep linear network has no spurious local minima as long as it is true for the two layer case. This reduction greatly simplifies the study on the existence of spurious local minima in deep linear networks. When applied to the quadratic loss, our result immediately implies the powerful result in [Kawaguchi 2016]. Further, with the work in [Zhou and Liang 2018], we can remove all the assumptions in [Kawaguchi 2016]. This property holds for more general "multi-tower" linear networks too. Our proof builds on [Laurent and von Brecht 2018] and develops a new perturbation argument to show that any spurious local minimum must have full rank, a structural property which can be useful more generally

arXiv.org e-Print Archive

Deep Neural Networks

Author: Balestriero Randall
Baraniuk Richard
Publication venue
Publication date: 06/11/2017
Field of study

Deep Neural Networks (DNNs) are universal function approximators providing state-of- the-art solutions on wide range of applications. Common perceptual tasks such as speech recognition, image classification, and object tracking are now commonly tackled via DNNs. Some fundamental problems remain: (1) the lack of a mathematical framework providing an explicit and interpretable input-output formula for any topology, (2) quantification of DNNs stability regarding adversarial examples (i.e. modified inputs fooling DNN predictions whilst undetectable to humans), (3) absence of generalization guarantees and controllable behaviors for ambiguous patterns, (4) leverage unlabeled data to apply DNNs to domains where expert labeling is scarce as in the medical field. Answering those points would provide theoretical perspectives for further developments based on a common ground. Furthermore, DNNs are now deployed in tremendous societal applications, pushing the need to fill this theoretical gap to ensure control, reliability, and interpretability.Comment: Technical Repor

arXiv.org e-Print Archive

Traversing the noise of dynamic mini-batch sub-sampled loss functions: A visual guide

Author: Kafka Dominic
Wilke Daniel
Publication venue
Publication date: 06/04/2020
Field of study

Mini-batch sub-sampling in neural network training is unavoidable, due to growing data demands, memory-limited computational resources such as graphical processing units (GPUs), and the dynamics of on-line learning. In this study we specifically distinguish between static mini-batch sub-sampled loss functions, where mini-batches are intermittently fixed during training, resulting in smooth but biased loss functions; and the dynamic sub-sampling equivalent, where new mini-batches are sampled at every loss evaluation, trading bias for variance in sampling induced discontinuities. These render automated optimization strategies such as minimization line searches ineffective, since critical points may not exist and function minimizers find spurious, discontinuity induced minima. This paper suggests recasting the optimization problem to find stochastic non-negative associated gradient projection points (SNN-GPPs). We demonstrate that the SNN-GPP optimality criterion is less susceptible to sub-sampling induced discontinuities than critical points or minimizers. We conduct a visual investigation, comparing local minimum and SNN-GPP optimality criteria in the loss functions of a simple neural network training problem for a variety of popular activation functions. Since SNN-GPPs better approximate the location of true optima, particularly when using smooth activation functions with high curvature characteristics, we postulate that line searches locating SNN-GPPs can contribute significantly to automating neural network trainingComment: 43 pages, 22 Figures, to be submitted to a journa

arXiv.org e-Print Archive

A Note on Connectivity of Sublevel Sets in Deep Learning

Author: Nguyen Quynh
Publication venue
Publication date: 21/01/2021
Field of study

It is shown that for deep neural networks, a single wide layer of width

N+1

(

N

being the number of training samples) suffices to prove the connectivity of sublevel sets of the training loss function. In the two-layer setting, the same property may not hold even if one has just one neuron less (i.e. width

N

can lead to disconnected sublevel sets)

arXiv.org e-Print Archive

Efficiently testing local optimality and escaping saddles for ReLU networks

Author: Jadbabaie Ali
Sra Suvrit
Yun Chulhee
Publication venue
Publication date: 28/05/2019
Field of study

We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of

M

data points on the nondifferentiability of the ReLU divides the parameter space into at most

2^M

regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node,

O(M)

(in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.Comment: 23 pages, appeared at ICLR 201

arXiv.org e-Print Archive

Understanding Global Loss Landscape of One-hidden-layer ReLU Networks, Part 1: Theory

Author: Liu Bo
Publication venue
Publication date: 16/06/2020
Field of study

For one-hidden-layer ReLU networks, we prove that all differentiable local minima are global inside differentiable regions. We give the locations and losses of differentiable local minima, and show that these local minima can be isolated points or continuous hyperplanes, depending on an interplay between data, activation pattern of hidden neurons and network size. Furthermore, we give necessary and sufficient conditions for the existence of saddle points as well as non-differentiable local minima, and their locations if they exist

arXiv.org e-Print Archive

On Convergence and Stability of GANs

Author: Abernethy Jacob
Hays James
Kira Zsolt
Kodali Naveen
Publication venue
Publication date: 10/12/2017
Field of study

We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.Comment: Analysis of convergence and mode collapse by studying GAN training process as regret minimization. Some new result

arXiv.org e-Print Archive

Neural Networks with Complex-Valued Weights Have No Spurious Local Minima

Author: Liu Xingtu
Publication venue
Publication date: 31/01/2021
Field of study

We study the benefits of complex-valued weights for neural networks. We prove that shallow complex neural networks with quadratic activations have no spurious local minima. In contrast, shallow real neural networks with quadratic activations have infinitely many spurious local minima under the same conditions. In addition, we provide specific examples to demonstrate that complex-valued weights turn poor local minima into saddle points. The activation function CReLU is also discussed to illustrate the superiority of analytic activations in complex-valued neural networks

arXiv.org e-Print Archive

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

Author: Jadbabaie Ali
Sra Suvrit
Yun Chulhee
Publication venue
Publication date: 29/10/2019
Field of study

We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require

N

hidden nodes to memorize/interpolate arbitrary

N

data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with

\Omega(\sqrt{N})

hidden nodes can perfectly memorize most datasets with

N

points. We also prove that width

\Theta(\sqrt{N})

is necessary and sufficient for memorizing

N

data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an

L

-layer network with

W

parameters in the hidden layers can memorize

N

data points if

W = \Omega(N)

. Combined with a recent upper bound

O(WL\log W)

on VC dimension, our construction is nearly tight for any fixed

L

. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of

N

hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.Comment: 28 pages, 2 figures. NeurIPS 2019 Camera-ready versio

arXiv.org e-Print Archive