Search CORE

2 research outputs found

Feature-Learning Networks Are Consistent Across Widths At Realistic Scales

Author: Atanasov Alexander
Bordelon Blake
Morwani Depen
Pehlevan Cengiz
Sainathan Sabarish
Vyas Nikhil
Publication venue
Publication date: 28/05/2023
Field of study

We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. For simple tasks such as CIFAR-5m this holds throughout training for networks of realistic widths. We also show that structural properties of the models, including internal representations, preactivation distributions, edge of stability phenomena, and large learning rate effects are consistent across large widths. This motivates the hypothesis that phenomena seen in realistic models can be captured by infinite-width, feature-learning limits. For harder tasks (such as ImageNet and language modeling), and later training times, finite-width deviations grow systematically. Two distinct effects cause these deviations across widths. First, the network output has initialization-dependent variance scaling inversely with width, which can be removed by ensembling networks. We observe, however, that ensembles of narrower networks perform worse than a single wide network. We call this the bias of narrower width. We conclude with a spectral perspective on the origin of this finite-width bias

arXiv.org e-Print Archive

Understanding the Spectral Bias of Deep Learning through Kernel Learning

Author: Sainathan Sabarish
Publication venue
Publication date: 17/08/2021
Field of study

It has been shown empirically that neural networks trained with gradient descent learn simpler functions first. We consider several theoretical justifications of this phenomenon by relating gradient descent to kernel gradient descent through the neural tangent kernel (NTK) and subsequently considering the spectral decay of the NTK.We then consider a setting beyond the lazy regime in which we can approximately describe the discrete evolution of the NTK during neural network training. We use this result to discuss properties of the evolved NTK

Dataspace