48 research outputs found
On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length
Stochastic Gradient Descent (SGD) based training of neural networks with a
large learning rate or a small batch-size typically ends in well-generalizing,
flat regions of the weight space, as indicated by small eigenvalues of the
Hessian of the training loss. However, the curvature along the SGD trajectory
is poorly understood. An empirical investigation shows that initially SGD
visits increasingly sharp regions, reaching a maximum sharpness determined by
both the learning rate and the batch-size of SGD. When studying the SGD
dynamics in relation to the sharpest directions in this initial phase, we find
that the SGD step is large compared to the curvature and commonly fails to
minimize the loss along the sharpest directions. Furthermore, using a reduced
learning rate along these directions can improve training speed while leading
to both sharper and better generalizing solutions compared to vanilla SGD. In
summary, our analysis of the dynamics of SGD in the subspace of the sharpest
directions shows that they influence the regions that SGD steers to (where
larger learning rate or smaller batch size result in wider regions visited),
the overall training speed, and the generalization ability of the final model
A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization
Loss landscape analysis is extremely useful for a deeper understanding of the
generalization ability of deep neural network models. In this work, we propose
a layerwise loss landscape analysis where the loss surface at every layer is
studied independently and also on how each correlates to the overall loss
surface. We study the layerwise loss landscape by studying the eigenspectra of
the Hessian at each layer. In particular, our results show that the layerwise
Hessian geometry is largely similar to the entire Hessian. We also report an
interesting phenomenon where the Hessian eigenspectrum of middle layers of the
deep neural network are observed to most similar to the overall Hessian
eigenspectrum. We also show that the maximum eigenvalue and the trace of the
Hessian (both full network and layerwise) reduce as training of the network
progresses. We leverage on these observations to propose a new regularizer
based on the trace of the layerwise Hessian. Penalizing the trace of the
Hessian at every layer indirectly forces Stochastic Gradient Descent to
converge to flatter minima, which are shown to have better generalization
performance. In particular, we show that such a layerwise regularizer can be
leveraged to penalize the middlemost layers alone, which yields promising
results. Our empirical studies on well-known deep nets across datasets support
the claims of this workComment: Accepted at AAAI 202