8 research outputs found
Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width
We systematically analyze optimization dynamics in deep neural networks
(DNNs) trained with stochastic gradient descent (SGD) and study the effect of
learning rate , depth , and width of the neural network. By
analyzing the maximum eigenvalue of the Hessian of the loss,
which is a measure of sharpness of the loss landscape, we find that the
dynamics can show four distinct regimes: (i) an early time transient regime,
(ii) an intermediate saturation regime, (iii) a progressive sharpening regime,
and (iv) a late time ``edge of stability" regime. The early and intermediate
regimes (i) and (ii) exhibit a rich phase diagram depending on , , and . We identify several critical values of , which
separate qualitatively distinct phenomena in the early time dynamics of
training loss and sharpness. Notably, we discover the opening up of a
``sharpness reduction" phase, where sharpness decreases at early times, as
and are increased.Comment: Accepted at NeurIPS 2023 (camera-ready version): Additional results
added for cross-entropy loss and effect on network output at initialization;
10+32 pages, 8+35 figure
A Rainbow in Deep Network Black Boxes
We introduce rainbow networks as a probabilistic model of trained deep neural
networks. The model cascades random feature maps whose weight distributions are
learned. It assumes that dependencies between weights at different layers are
reduced to rotations which align the input activations. Neuron weights within a
layer are independent after this alignment. Their activations define kernels
which become deterministic in the infinite-width limit. This is verified
numerically for ResNets trained on the ImageNet dataset. We also show that the
learned weight distributions have low-rank covariances. Rainbow networks thus
alternate between linear dimension reductions and non-linear high-dimensional
embeddings with white random features. Gaussian rainbow networks are defined
with Gaussian weight distributions. These models are validated numerically on
image classification on the CIFAR-10 dataset, with wavelet scattering networks.
We further show that during training, SGD updates the weight covariances while
mostly preserving the Gaussian initialization.Comment: 56 pages, 10 figure