437 research outputs found

    Small nonlinearities in activation functions create bad local minima in neural networks

    Full text link
    We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic.Comment: 33 pages, appeared at ICLR 201

    Depth creates no more spurious local minima

    Full text link
    We show that for any convex differentiable loss, a deep linear network has no spurious local minima as long as it is true for the two layer case. This reduction greatly simplifies the study on the existence of spurious local minima in deep linear networks. When applied to the quadratic loss, our result immediately implies the powerful result in [Kawaguchi 2016]. Further, with the work in [Zhou and Liang 2018], we can remove all the assumptions in [Kawaguchi 2016]. This property holds for more general "multi-tower" linear networks too. Our proof builds on [Laurent and von Brecht 2018] and develops a new perturbation argument to show that any spurious local minimum must have full rank, a structural property which can be useful more generally

    Deep Neural Networks

    Full text link
    Deep Neural Networks (DNNs) are universal function approximators providing state-of- the-art solutions on wide range of applications. Common perceptual tasks such as speech recognition, image classification, and object tracking are now commonly tackled via DNNs. Some fundamental problems remain: (1) the lack of a mathematical framework providing an explicit and interpretable input-output formula for any topology, (2) quantification of DNNs stability regarding adversarial examples (i.e. modified inputs fooling DNN predictions whilst undetectable to humans), (3) absence of generalization guarantees and controllable behaviors for ambiguous patterns, (4) leverage unlabeled data to apply DNNs to domains where expert labeling is scarce as in the medical field. Answering those points would provide theoretical perspectives for further developments based on a common ground. Furthermore, DNNs are now deployed in tremendous societal applications, pushing the need to fill this theoretical gap to ensure control, reliability, and interpretability.Comment: Technical Repor

    Traversing the noise of dynamic mini-batch sub-sampled loss functions: A visual guide

    Full text link
    Mini-batch sub-sampling in neural network training is unavoidable, due to growing data demands, memory-limited computational resources such as graphical processing units (GPUs), and the dynamics of on-line learning. In this study we specifically distinguish between static mini-batch sub-sampled loss functions, where mini-batches are intermittently fixed during training, resulting in smooth but biased loss functions; and the dynamic sub-sampling equivalent, where new mini-batches are sampled at every loss evaluation, trading bias for variance in sampling induced discontinuities. These render automated optimization strategies such as minimization line searches ineffective, since critical points may not exist and function minimizers find spurious, discontinuity induced minima. This paper suggests recasting the optimization problem to find stochastic non-negative associated gradient projection points (SNN-GPPs). We demonstrate that the SNN-GPP optimality criterion is less susceptible to sub-sampling induced discontinuities than critical points or minimizers. We conduct a visual investigation, comparing local minimum and SNN-GPP optimality criteria in the loss functions of a simple neural network training problem for a variety of popular activation functions. Since SNN-GPPs better approximate the location of true optima, particularly when using smooth activation functions with high curvature characteristics, we postulate that line searches locating SNN-GPPs can contribute significantly to automating neural network trainingComment: 43 pages, 22 Figures, to be submitted to a journa

    A Note on Connectivity of Sublevel Sets in Deep Learning

    Full text link
    It is shown that for deep neural networks, a single wide layer of width N+1N+1 (NN being the number of training samples) suffices to prove the connectivity of sublevel sets of the training loss function. In the two-layer setting, the same property may not hold even if one has just one neuron less (i.e. width NN can lead to disconnected sublevel sets)

    Efficiently testing local optimality and escaping saddles for ReLU networks

    Full text link
    We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of MM data points on the nondifferentiability of the ReLU divides the parameter space into at most 2M2^M regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, O(M)O(M) (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.Comment: 23 pages, appeared at ICLR 201

    Understanding Global Loss Landscape of One-hidden-layer ReLU Networks, Part 1: Theory

    Full text link
    For one-hidden-layer ReLU networks, we prove that all differentiable local minima are global inside differentiable regions. We give the locations and losses of differentiable local minima, and show that these local minima can be isolated points or continuous hyperplanes, depending on an interplay between data, activation pattern of hidden neurons and network size. Furthermore, we give necessary and sufficient conditions for the existence of saddle points as well as non-differentiable local minima, and their locations if they exist

    On Convergence and Stability of GANs

    Full text link
    We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.Comment: Analysis of convergence and mode collapse by studying GAN training process as regret minimization. Some new result

    Neural Networks with Complex-Valued Weights Have No Spurious Local Minima

    Full text link
    We study the benefits of complex-valued weights for neural networks. We prove that shallow complex neural networks with quadratic activations have no spurious local minima. In contrast, shallow real neural networks with quadratic activations have infinitely many spurious local minima under the same conditions. In addition, we provide specific examples to demonstrate that complex-valued weights turn poor local minima into saddle points. The activation function CReLU is also discussed to illustrate the superiority of analytic activations in complex-valued neural networks

    Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

    Full text link
    We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require NN hidden nodes to memorize/interpolate arbitrary NN data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with Ω(N)\Omega(\sqrt{N}) hidden nodes can perfectly memorize most datasets with NN points. We also prove that width Θ(N)\Theta(\sqrt{N}) is necessary and sufficient for memorizing NN data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an LL-layer network with WW parameters in the hidden layers can memorize NN data points if W=Ω(N)W = \Omega(N). Combined with a recent upper bound O(WLlogW)O(WL\log W) on VC dimension, our construction is nearly tight for any fixed LL. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of NN hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.Comment: 28 pages, 2 figures. NeurIPS 2019 Camera-ready versio
    corecore