Search CORE

16 research outputs found

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Author: Goldberg Yoav
Merrill William
Ramanujan Vivek
Schwartz Roy
Smith Noah
Publication venue
Publication date: 29/09/2021
Field of study

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude (

\ell_2

norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.Comment: To appear at EMNLP 202

arXiv.org e-Print Archive

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Author: Bach Francis
Chizat Lenaic
Publication venue: HAL CCSD
Publication date: 11/02/2020
Field of study

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm

Author: Gunasekar Suriya
Jagadeesan Meena
Razenshteyn Ilya
Publication venue
Publication date: 24/02/2021
Field of study

We study the function space characterization of the inductive bias resulting from controlling the

\ell_2

norm of the weights in linear convolutional networks. We view this in terms of an induced regularizer in the function space given by the minimum norm of weights required to realize a linear function. For two layer linear convolutional networks with

C

output channels and kernel size

K

, we show the following: (a) If the inputs to the network have a single channel, the induced regularizer for any

K

is a norm given by a semidefinite program (SDP) that is independent of the number of output channels

C

. We further validate these results through a binary classification task on MNIST. (b) In contrast, for networks with multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on

C

. Further, for sufficiently large

C

, the induced regularizer for

K=1

and

K=D

are the nuclear norm and the

\ell_{2,1}

group-sparse norm, respectively, of the Fourier coefficients -- both of which promote sparse structures

arXiv.org e-Print Archive