8 research outputs found

    Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width

    Full text link
    We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate η\eta, depth dd, and width ww of the neural network. By analyzing the maximum eigenvalue λtH\lambda^H_t of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on η≡c/λ0H\eta \equiv c / \lambda_0^H , dd, and ww. We identify several critical values of cc, which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a ``sharpness reduction" phase, where sharpness decreases at early times, as dd and 1/w1/w are increased.Comment: Accepted at NeurIPS 2023 (camera-ready version): Additional results added for cross-entropy loss and effect on network output at initialization; 10+32 pages, 8+35 figure

    A Rainbow in Deep Network Black Boxes

    Full text link
    We introduce rainbow networks as a probabilistic model of trained deep neural networks. The model cascades random feature maps whose weight distributions are learned. It assumes that dependencies between weights at different layers are reduced to rotations which align the input activations. Neuron weights within a layer are independent after this alignment. Their activations define kernels which become deterministic in the infinite-width limit. This is verified numerically for ResNets trained on the ImageNet dataset. We also show that the learned weight distributions have low-rank covariances. Rainbow networks thus alternate between linear dimension reductions and non-linear high-dimensional embeddings with white random features. Gaussian rainbow networks are defined with Gaussian weight distributions. These models are validated numerically on image classification on the CIFAR-10 dataset, with wavelet scattering networks. We further show that during training, SGD updates the weight covariances while mostly preserving the Gaussian initialization.Comment: 56 pages, 10 figure
    corecore