Search CORE

36 research outputs found

Training (Overparametrized) Neural Networks in Near-Linear Time

Author: Peng Binghui
Song Zhao
van den Brand Jan
Weinstein Omri
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)
Publication date: 08/12/2020
Field of study

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster

\mathit{second}

\mathit{order}

optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (

\mathit{independent}

of the training batch size

n

), second-order algorithms incur a daunting slowdown in the

\mathit{cost}

\mathit{per}

\mathit{iteration}

(inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an

O(mn^2)

-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width

m

. We show how to speed up the algorithm of [CGH+19], achieving an

\tilde{O}(mn)

-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (

mn

) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an

\ell_2

-regression problem, and then use a Fast-JL type dimension reduction to

\mathit{precondition}

the underlying Gram matrix in time independent of

M

, allowing to find a sufficiently good approximate solution via

\mathit{first}

\mathit{order}

conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in

\mathit{convex}

\mathit{optimization}

(ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

SCORE: approximating curvature information under self-concordant regularization

Author: Adeoye Adeyemi D.
Bemporad Alberto
Publication venue
Publication date: 16/06/2022
Field of study

In this paper, we propose the SCORE (self-concordant regularization) framework for unconstrained minimization problems which incorporates second-order information in the Newton decrement framework for convex optimization. We propose the generalized Gauss-Newton with Self-Concordant Regularization (GGN-SCORE) algorithm that updates the minimization variables each time it receives a new input batch. The proposed algorithm exploits the structure of the second-order information in the Hessian matrix, thereby reducing computational overhead. GGN-SCORE demonstrates how we may speed up convergence while also improving model generalization for problems that involve regularized minimization under the SCORE framework. Numerical experiments show the efficiency of our method and its fast convergence, which compare favorably against baseline first-order and quasi-Newton methods. Additional experiments involving non-convex (overparameterized) neural network training problems show similar convergence behaviour thereby highlighting the promise of the proposed algorithm for non-convex optimization.Comment: 21 pages, 12 figure

arXiv.org e-Print Archive

Rethinking Gauss-Newton for learning over-parameterized models

Author: Arbel Michael
Menegaux Romain
Wolinski Pierre
Publication venue
Publication date: 05/06/2023
Field of study

This work studies the global convergence and generalization properties of Gauss Newton's (GN) when optimizing one-hidden layer networks in the over-parameterized regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. We find that, while GN is consistently faster than GD in finding a global optimum, the performance of the learned model on a test dataset is heavily influenced by both the learning rate and the variance of the randomly initialized network's weights. Specifically, we find that initializing with a smaller variance results in a better generalization, a behavior also observed for GD. However, in contrast to GD where larger learning rates lead to the best generalization, we find that GN achieves an improved generalization when using smaller learning rates, albeit at the cost of slower convergence. This study emphasizes the significance of the learning rate in balancing the optimization speed of GN with the generalization ability of the learned solution

arXiv.org e-Print Archive

A Theoretical Framework for Target Propagation

Author: Carzaniga Francesco S.
Grewe Benjamin F.
Meulemans Alexander
Sacramento João
Suykens Johan A. K.
Publication venue
Publication date: 12/12/2020
Field of study

The success of deep learning, a brain-inspired form of AI, has sparked interest in understanding how the brain could similarly learn across multiple layers of neurons. However, the majority of biologically-plausible learning algorithms have not yet reached the performance of backpropagation (BP), nor are they built on strong theoretical foundations. Here, we analyze target propagation (TP), a popular but not yet fully understood alternative to BP, from the standpoint of mathematical optimization. Our theory shows that TP is closely related to Gauss-Newton optimization and thus substantially differs from BP. Furthermore, our analysis reveals a fundamental limitation of difference target propagation (DTP), a well-known variant of TP, in the realistic scenario of non-invertible neural networks. We provide a first solution to this problem through a novel reconstruction loss that improves feedback weight training, while simultaneously introducing architectural flexibility by allowing for direct feedback connections from the output to each hidden layer. Our theory is corroborated by experimental results that show significant improvements in performance and in the alignment of forward weight updates with loss gradients, compared to DTP.Comment: 13 pages and 4 figures in main manuscript; 41 pages and 8 figures in supplementary materia

arXiv.org e-Print Archive

Repository for Publications and Research Data