2 research outputs found
A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization
Loss landscape analysis is extremely useful for a deeper understanding of the
generalization ability of deep neural network models. In this work, we propose
a layerwise loss landscape analysis where the loss surface at every layer is
studied independently and also on how each correlates to the overall loss
surface. We study the layerwise loss landscape by studying the eigenspectra of
the Hessian at each layer. In particular, our results show that the layerwise
Hessian geometry is largely similar to the entire Hessian. We also report an
interesting phenomenon where the Hessian eigenspectrum of middle layers of the
deep neural network are observed to most similar to the overall Hessian
eigenspectrum. We also show that the maximum eigenvalue and the trace of the
Hessian (both full network and layerwise) reduce as training of the network
progresses. We leverage on these observations to propose a new regularizer
based on the trace of the layerwise Hessian. Penalizing the trace of the
Hessian at every layer indirectly forces Stochastic Gradient Descent to
converge to flatter minima, which are shown to have better generalization
performance. In particular, we show that such a layerwise regularizer can be
leveraged to penalize the middlemost layers alone, which yields promising
results. Our empirical studies on well-known deep nets across datasets support
the claims of this workComment: Accepted at AAAI 202
RetroKD : Leveraging Past States for Regularizing Targets in Teacher-Student Learning
Several recent works show that higher accuracy models may not be better teachers for every student, and hence, refer this problem as student-teacher "knowledge gap". Further, they propose techniques, which, in this paper, we discuss are constrained to certain pre-conditions: 1). Access to Teacher Model/Architecture 2). Retraining Teacher Model 3). Models in Addition to Teacher Model. Being well known that for a lot of settings, these conditions may not hold true challenges the applicability of such approaches. In this work, we propose RetroKD, which smoothes out the logits of a student network by leveraging students' past state logits with the ones from the teacher. By doing so, we hypothesize that the present target will no longer be as hard as the teacher target and not as more uncomplicated as the past student target. Such regularization on learning the parameters alleviates the needs as required by other methods. Our extensive set of experiments comparing against the baselines for CIFAR 10, CIFAR 100, and TinyImageNet datasets and a theoretical study further help in supporting our claim. We performed crucial ablation studies such as hyperparameter sensitivity, the generalization study by showing the flatness on loss landscape and feature similarly with teacher network