36 research outputs found
Training (Overparametrized) Neural Networks in Near-Linear Time
The slow convergence rate and pathological curvature issues of first-order
gradient methods for training deep neural networks, initiated an ongoing effort
for developing faster - optimization
algorithms beyond SGD, without compromising the generalization error. Despite
their remarkable convergence rate ( of the training batch
size ), second-order algorithms incur a daunting slowdown in the
(inverting the Hessian
matrix of the loss function), which renders them impractical. Very recently,
this computational overhead was mitigated by the works of [ZMG19,CGH+19},
yielding an -time second-order algorithm for training two-layer
overparametrized neural networks of polynomial width .
We show how to speed up the algorithm of [CGH+19], achieving an
-time backpropagation algorithm for training (mildly
overparametrized) ReLU networks, which is near-linear in the dimension ()
of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to
reformulate the Gauss-Newton iteration as an -regression problem, and
then use a Fast-JL type dimension reduction to the
underlying Gram matrix in time independent of , allowing to find a
sufficiently good approximate solution via -
conjugate gradient. Our result provides a proof-of-concept that advanced
machinery from randomized linear algebra -- which led to recent breakthroughs
in (ERM, LPs, Regression) -- can be
carried over to the realm of deep learning as well
SCORE: approximating curvature information under self-concordant regularization
In this paper, we propose the SCORE (self-concordant regularization)
framework for unconstrained minimization problems which incorporates
second-order information in the Newton decrement framework for convex
optimization. We propose the generalized Gauss-Newton with Self-Concordant
Regularization (GGN-SCORE) algorithm that updates the minimization variables
each time it receives a new input batch. The proposed algorithm exploits the
structure of the second-order information in the Hessian matrix, thereby
reducing computational overhead. GGN-SCORE demonstrates how we may speed up
convergence while also improving model generalization for problems that involve
regularized minimization under the SCORE framework. Numerical experiments show
the efficiency of our method and its fast convergence, which compare favorably
against baseline first-order and quasi-Newton methods. Additional experiments
involving non-convex (overparameterized) neural network training problems show
similar convergence behaviour thereby highlighting the promise of the proposed
algorithm for non-convex optimization.Comment: 21 pages, 12 figure
Rethinking Gauss-Newton for learning over-parameterized models
This work studies the global convergence and generalization properties of
Gauss Newton's (GN) when optimizing one-hidden layer networks in the
over-parameterized regime. We first establish a global convergence result for
GN in the continuous-time limit exhibiting a faster convergence rate compared
to GD due to improved conditioning. We then perform an empirical study on a
synthetic regression task to investigate the implicit bias of GN's method. We
find that, while GN is consistently faster than GD in finding a global optimum,
the performance of the learned model on a test dataset is heavily influenced by
both the learning rate and the variance of the randomly initialized network's
weights. Specifically, we find that initializing with a smaller variance
results in a better generalization, a behavior also observed for GD. However,
in contrast to GD where larger learning rates lead to the best generalization,
we find that GN achieves an improved generalization when using smaller learning
rates, albeit at the cost of slower convergence. This study emphasizes the
significance of the learning rate in balancing the optimization speed of GN
with the generalization ability of the learned solution
A Theoretical Framework for Target Propagation
The success of deep learning, a brain-inspired form of AI, has sparked
interest in understanding how the brain could similarly learn across multiple
layers of neurons. However, the majority of biologically-plausible learning
algorithms have not yet reached the performance of backpropagation (BP), nor
are they built on strong theoretical foundations. Here, we analyze target
propagation (TP), a popular but not yet fully understood alternative to BP,
from the standpoint of mathematical optimization. Our theory shows that TP is
closely related to Gauss-Newton optimization and thus substantially differs
from BP. Furthermore, our analysis reveals a fundamental limitation of
difference target propagation (DTP), a well-known variant of TP, in the
realistic scenario of non-invertible neural networks. We provide a first
solution to this problem through a novel reconstruction loss that improves
feedback weight training, while simultaneously introducing architectural
flexibility by allowing for direct feedback connections from the output to each
hidden layer. Our theory is corroborated by experimental results that show
significant improvements in performance and in the alignment of forward weight
updates with loss gradients, compared to DTP.Comment: 13 pages and 4 figures in main manuscript; 41 pages and 8 figures in
supplementary materia