Search CORE

8 research outputs found

Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width

Author: Barkeshli Maissam
Kalra Dayal Singh
Publication venue
Publication date: 24/10/2023
Field of study

We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate

\eta

, depth

d

, and width

w

of the neural network. By analyzing the maximum eigenvalue

\lambda^H_t

of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on

\eta \equiv c / \lambda_0^H

d

, and

w

. We identify several critical values of

c

, which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a ``sharpness reduction" phase, where sharpness decreases at early times, as

d

and

1/w

are increased.Comment: Accepted at NeurIPS 2023 (camera-ready version): Additional results added for cross-entropy loss and effect on network output at initialization; 10+32 pages, 8+35 figure

arXiv.org e-Print Archive

A Rainbow in Deep Network Black Boxes

Author: Guth Florentin
Mallat Stéphane
Ménard Brice
Rochette Gaspar
Publication venue
Publication date: 29/05/2023
Field of study

We introduce rainbow networks as a probabilistic model of trained deep neural networks. The model cascades random feature maps whose weight distributions are learned. It assumes that dependencies between weights at different layers are reduced to rotations which align the input activations. Neuron weights within a layer are independent after this alignment. Their activations define kernels which become deterministic in the infinite-width limit. This is verified numerically for ResNets trained on the ImageNet dataset. We also show that the learned weight distributions have low-rank covariances. Rainbow networks thus alternate between linear dimension reductions and non-linear high-dimensional embeddings with white random features. Gaussian rainbow networks are defined with Gaussian weight distributions. These models are validated numerically on image classification on the CIFAR-10 dataset, with wavelet scattering networks. We further show that during training, SGD updates the weight covariances while mostly preserving the Gaussian initialization.Comment: 56 pages, 10 figure

arXiv.org e-Print Archive