36 research outputs found

    Training (Overparametrized) Neural Networks in Near-Linear Time

    Get PDF
    The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second\mathit{second}-order\mathit{order} optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent\mathit{independent} of the training batch size nn), second-order algorithms incur a daunting slowdown in the cost\mathit{cost} per\mathit{per} iteration\mathit{iteration} (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [ZMG19,CGH+19}, yielding an O(mn2)O(mn^2)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width mm. We show how to speed up the algorithm of [CGH+19], achieving an O~(mn)\tilde{O}(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mnmn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an â„“2\ell_2-regression problem, and then use a Fast-JL type dimension reduction to precondition\mathit{precondition} the underlying Gram matrix in time independent of MM, allowing to find a sufficiently good approximate solution via first\mathit{first}-order\mathit{order} conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra -- which led to recent breakthroughs in convex\mathit{convex} optimization\mathit{optimization} (ERM, LPs, Regression) -- can be carried over to the realm of deep learning as well

    SCORE: approximating curvature information under self-concordant regularization

    Get PDF
    In this paper, we propose the SCORE (self-concordant regularization) framework for unconstrained minimization problems which incorporates second-order information in the Newton decrement framework for convex optimization. We propose the generalized Gauss-Newton with Self-Concordant Regularization (GGN-SCORE) algorithm that updates the minimization variables each time it receives a new input batch. The proposed algorithm exploits the structure of the second-order information in the Hessian matrix, thereby reducing computational overhead. GGN-SCORE demonstrates how we may speed up convergence while also improving model generalization for problems that involve regularized minimization under the SCORE framework. Numerical experiments show the efficiency of our method and its fast convergence, which compare favorably against baseline first-order and quasi-Newton methods. Additional experiments involving non-convex (overparameterized) neural network training problems show similar convergence behaviour thereby highlighting the promise of the proposed algorithm for non-convex optimization.Comment: 21 pages, 12 figure

    Rethinking Gauss-Newton for learning over-parameterized models

    Full text link
    This work studies the global convergence and generalization properties of Gauss Newton's (GN) when optimizing one-hidden layer networks in the over-parameterized regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. We find that, while GN is consistently faster than GD in finding a global optimum, the performance of the learned model on a test dataset is heavily influenced by both the learning rate and the variance of the randomly initialized network's weights. Specifically, we find that initializing with a smaller variance results in a better generalization, a behavior also observed for GD. However, in contrast to GD where larger learning rates lead to the best generalization, we find that GN achieves an improved generalization when using smaller learning rates, albeit at the cost of slower convergence. This study emphasizes the significance of the learning rate in balancing the optimization speed of GN with the generalization ability of the learned solution

    A Theoretical Framework for Target Propagation

    Full text link
    The success of deep learning, a brain-inspired form of AI, has sparked interest in understanding how the brain could similarly learn across multiple layers of neurons. However, the majority of biologically-plausible learning algorithms have not yet reached the performance of backpropagation (BP), nor are they built on strong theoretical foundations. Here, we analyze target propagation (TP), a popular but not yet fully understood alternative to BP, from the standpoint of mathematical optimization. Our theory shows that TP is closely related to Gauss-Newton optimization and thus substantially differs from BP. Furthermore, our analysis reveals a fundamental limitation of difference target propagation (DTP), a well-known variant of TP, in the realistic scenario of non-invertible neural networks. We provide a first solution to this problem through a novel reconstruction loss that improves feedback weight training, while simultaneously introducing architectural flexibility by allowing for direct feedback connections from the output to each hidden layer. Our theory is corroborated by experimental results that show significant improvements in performance and in the alignment of forward weight updates with loss gradients, compared to DTP.Comment: 13 pages and 4 figures in main manuscript; 41 pages and 8 figures in supplementary materia
    corecore