Search CORE

287 research outputs found

Training (Overparametrized) Neural Networks in Near-Linear Time

Author: Peng Binghui
Song Zhao
van den Brand Jan
Weinstein Omri
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)
Publication date: 08/12/2020
Field of study

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster

\mathit{second}

\mathit{order}

optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (

\mathit{independent}

of the training batch size

n

), second-order algorithms incur a daunting slowdown in the

\mathit{cost}

\mathit{per}

\mathit{iteration}

(inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an

O(mn^2)

-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width

m

. We show how to speed up the algorithm of [CGH+19], achieving an

\tilde{O}(mn)

-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (

mn

) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an

\ell_2

-regression problem, and then use a Fast-JL type dimension reduction to

\mathit{precondition}

the underlying Gram matrix in time independent of

M

, allowing to find a sufficiently good approximate solution via

\mathit{first}

\mathit{order}

conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in

\mathit{convex}

\mathit{optimization}

(ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics

Author: Andy Lu
Aneel Damaraju
Ankit B. Patel
Fabio Anselmi
Josue Ortega Caro
Justin Sahs
Onur Tavaslioglu
Ryan Pyle
Publication venue
Publication date: 01/01/2022
Field of study

Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2

Archivio istituzionale della ricerca - Università di Trieste

PubMed Central

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Author: Hastie Trevor
Montanari Andrea
Rosset Saharon
Tibshirani Ryan J.
Publication venue
Publication date: 07/12/2020
Field of study

Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum

\ell_2

norm (``ridgeless'') interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors

x_i \in {\mathbb R}^p

are obtained by applying a linear transform to a vector of i.i.d.\ entries,

x_i = \Sigma^{1/2} z_i

(with

z_i \in {\mathbb R}^p

); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network,

x_i = \varphi(W z_i)

(with

z_i \in {\mathbb R}^d

W \in {\mathbb R}^{p \times d}

a matrix of i.i.d.\ entries, and

\varphi

an activation function acting componentwise on

W z_i

). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.Comment: 68 pages; 16 figures. This revision contains non-asymptotic version of earlier results, and results for general coefficient

arXiv.org e-Print Archive

Theory of Deep Learning III: explaining the non-overfitting puzzle

Author: Boix Xavier
Hidary Jack
Kawaguchi Kenji
Liao Qianli
Mhaskar Hrushikesh
Miranda Brando
Poggio Tomaso
Rosasco Lorenzo
Publication venue: 'Center for Open Science'
Publication date: 01/01/2017
Field of study

THIS MEMO IS REPLACED BY CBMM MEMO 90 A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient. Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for “low noise” datasets. The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216

arXiv.org e-Print Archive

DSpace@MIT

Archivio istituzionale della ricerca - Università di Genova

Flatter, faster: scaling momentum for optimal speedup of SGD

Author: Can Tankut
Cowsik Aditya
Glorioso Paolo
Publication venue
Publication date: 13/06/2023
Field of study

Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study training dynamics arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter

1-\beta

with the learning rate to the power of

2/3

maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We confirm our scaling rule for synthetic regression problems (matrix sensing and teacher-student paradigm) and classification for realistic datasets (ResNet-18 on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our scaling rule to variations in architectures and datasets.Comment: v2: expanded introduction section, corrected minor typos. v1: 12+13 pages, 3 figure

arXiv.org e-Print Archive

Theory IIIb: Generalization in Deep Networks

Author: Banburski Andrzej
Boix Xavier
Hidary Jack
Liao Qianli
Miranda Brando
Poggio Tomaso
Publication venue: Center for Brains, Minds and Machines (CBMM), arXiv.org
Publication date: 29/06/2018
Field of study

The general features of the optimization problem for the case of overparametrized nonlinear networks have been clear for a while: SGD selects with high probability global minima vs local minima. In the overparametrized case, the key question is not optimization of the empirical risk but optimization with a generalization guarantee. In fact, a main puzzle of deep neural networks (DNNs) revolves around the apparent absence of “overfitting”, defined as follows: the expected error does not get worse when increasing the number of neurons or of iterations of gradient descent. This is superficially surprising because of the large capacity demonstrated by DNNs to fit randomly labeled data and the absence of explicit regularization. Several recent efforts, including our previous versions of this technical report, strongly suggest that good test performance of deep networks depend on constraining the norm of their weights. Here we prove that: • the loss functions of deep RELU networks under square loss and logistic loss on a compact domain are invex functions; • for such loss functions any equilibrium point is a global minimum; • convergence is fast, the minima are close to the origin; • the global minima have in general degenerate Hessians for which there is no direct control of the norm, apart from initialization close to the origin; • a simple variation of gradient descent techniques called norm-minimizing (NM) gradient descent guarantees minimum norm minimizers under both the square loss and the exponential loss, independently of initial conditions. A convenient norm for a deep network is the product of the Frobenius norms of the weight matrices. Control of the norm by NM ensures generalization for regression (because of the associated control of the Rademacher complexity). Margin bounds ensure control of classification error by maximization of the margin of f ̃ – the classifier with normalized Frobenius norms – obtained by the minimization of an exponential-type loss by NM iterations. 1 This replaces previous versions of Theory IIIa and Theory IIIb updating several vague or incorrect statements.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216

arXiv.org e-Print Archive

DSpace@MIT

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Author: Arora Sanjeev
Ge Rong
Hu Wei
Kuditipudi Rohith
Lee Holden
Li Zhiyuan
Wang Xiang
Zhang Yi
Publication venue
Publication date: 01/01/2019
Field of study

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima -- at least those discovered by gradient-based optimization -- turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory

arXiv.org e-Print Archive

Princeton University Open Access Repository