8,301 research outputs found

    An Improved Analysis of Training Over-parameterized Deep Neural Networks

    Full text link
    A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size nn (e.g., O(n24)O(n^{24})). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.Comment: 30 pages, 1 figure, 1 tabl

    Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

    Full text link
    We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., f(Z,w,a)=βˆ‘jajΟƒ(wTZj)f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_j\sigma(\mathbf{w}^T\mathbf{Z}_j), in which both the convolutional weights w\mathbf{w} and the output weights a\mathbf{a} are parameters to be learned. When the labels are the outputs from a teacher network of the same architecture with fixed weights (wβˆ—,aβˆ—)(\mathbf{w}^*, \mathbf{a}^*), we prove that with Gaussian input Z\mathbf{Z}, there is a spurious local minimizer. Surprisingly, in the presence of the spurious local minimizer, gradient descent with weight normalization from randomly initialized weights can still be proven to recover the true parameters with constant probability, which can be boosted to probability 11 with multiple restarts. We also show that with constant probability, the same procedure could also converge to the spurious local minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the gradient descent dynamics has two phases: it starts off slow, but converges much faster after several iterations.Comment: Accepted by ICML 201

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks

    Full text link
    At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function fΞΈf_\theta (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function fΞΈf_\theta follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit

    Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

    Full text link
    We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on nn real-valued inputs. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in 22-norm) of the best approximation of the target function using a polynomial of degree at most kk. Moreover, for any kk, the size of the network and number of iterations needed are both bounded by nO(k)log⁑(1/ϡ)n^{O(k)}\log(1/\epsilon). In particular, this applies to training networks of unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding that gradient descent discovers lower frequency Fourier components before higher frequency components. We complement this result with nearly matching lower bounds in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality nn must use nΩ(k)n^{\Omega(k)} queries even when the target functions are restricted to a set of nO(k)n^{O(k)} degree-kk polynomials, and the input distribution is uniform over the unit sphere; for this class the information-theoretic lower bound is only Θ(klog⁑n)\Theta(k \log n). Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss.Comment: Revised version now includes matching lower bound

    Width Provably Matters in Optimization for Deep Linear Neural Networks

    Full text link
    We prove that for an LL-layer fully-connected linear neural network, if the width of every hidden layer is Ξ©~(Lβ‹…rβ‹…doutβ‹…ΞΊ3)\tilde\Omega (L \cdot r \cdot d_{\mathrm{out}} \cdot \kappa^3 ), where rr and ΞΊ\kappa are the rank and the condition number of the input data, and doutd_{\mathrm{out}} is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an Ο΅\epsilon-suboptimal solution is O(ΞΊlog⁑(1Ο΅))O(\kappa \log(\frac{1}{\epsilon})). Our polynomial upper bound on the total running time for wide deep linear networks and the exp⁑(Ξ©(L))\exp\left(\Omega\left(L\right)\right) lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.Comment: In ICML 201

    A Convergence Theory for Deep Learning via Over-Parameterization

    Full text link
    Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find globalΒ minima\textit{global minima} on the training objective of DNNs in polynomialΒ time\textit{polynomial time}. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: polynomial\textit{polynomial} in LL, the number of layers and in nn, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in n,Ln,L. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).Comment: V2 adds citation and V3/V4/V5 polish writin

    Theory III: Dynamics and Generalization in Deep Networks

    Full text link
    The key to generalization is controlling the complexity of the network. However, there is no obvious control of complexity -- such as an explicit regularization term -- in the training of deep networks for classification. We will show that a classical form of norm control -- but kind of hidden -- is present in deep networks trained with gradient descent techniques on exponential-type losses. In particular, gradient descent induces a dynamics of the normalized weights which converge for tβ†’βˆžt \to \infty to an equilibrium which corresponds to a minimum norm (or maximum margin) solution. For sufficiently large but finite ρ\rho -- and thus finite tt -- the dynamics converges to one of several margin maximizers, with the margin monotonically increasing towards a limit stationary point of the flow. In the usual case of stochastic gradient descent, most of the stationary points are likely to be convex minima corresponding to a constrained minimizer -- the network with normalized weights-- which corresponds to vanishing regularization. The solution has zero generalization gap, for fixed architecture, asymptotically for Nβ†’βˆžN \to \infty, where NN is the number of training examples. Our approach extends some of the original results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. We believe that the elusive complexity control we describe is responsible for the puzzling empirical finding of good predictive performance by deep networks, despite overparametrization.Comment: 47 pages, 11 figures. This replaces previous versions of Theory III, that appeared on Arxiv [arXiv:1806.11379, arXiv:1801.00173] or on the CBMM site. v5: Changes throughout the paper to the presentation and tightening some of the statement

    Theoretical insights into the optimization landscape of over-parameterized shallow neural networks

    Full text link
    In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the number of observations are fewer than the number of parameters in the model. We show that with quadratic activations the optimization landscape of training such shallow neural networks has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. This result holds for an arbitrary training data of input/output pairs. For differentiable activation functions we also show that gradient descent, when suitably initialized, converges at a linear rate to a globally optimal model. This result focuses on a realizable model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted weight coefficients.Comment: Section 3 on numerical experiments is added. Theorems 2.1 and 2.2 are improved to apply to almost all input data (not just Gaussian inputs). Related work section is expanded. The paper is accepted for publication in IEEE transaction on Information Theory (2018

    Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

    Full text link
    We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.Comment: 54 pages. This version relaxes the assumptions on the loss functions and data distribution, and improves the dependency on the problem-specific parameters in the main theor

    Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via β„“1,β„“0\ell_1, \ell_0, and transformed-β„“1\ell_1 Penalties

    Full text link
    Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. We consider the problem of learning a one hidden layer convolutional neural network with ReLU activation function via gradient descent under sparsity promoting penalties. It is known that when the input data is Gaussian distributed, no-overlap networks (without penalties) in regression problems with ground truth can be learned in polynomial time at high probability. We propose a relaxed variable splitting method integrating thresholding and gradient descent to overcome the lack of non-smoothness in the loss function. The sparsity in network weight is realized during the optimization (training) process. We prove that under β„“1,β„“0\ell_1, \ell_0; and transformed-β„“1\ell_1 penalties, no-overlap networks can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel thresholding operation. Numerical experiments confirm theoretical findings, and compare the accuracy and sparsity trade-off among the penalties
    • …
    corecore