16 research outputs found
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent
The capacity of neural networks like the widely adopted transformer is known
to be very high. Evidence is emerging that they learn successfully due to
inductive bias in the training routine, typically a variant of gradient descent
(GD). To better understand this bias, we study the tendency for transformer
parameters to grow in magnitude ( norm) during training, and its
implications for the emergent representations within self attention layers.
Empirically, we document norm growth in the training of transformer language
models, including T5 during its pretraining. As the parameters grow in
magnitude, we prove that the network approximates a discretized network with
saturated activation functions. Such "saturated" networks are known to have a
reduced capacity compared to the full network family that can be described in
terms of formal languages and automata. Our results suggest saturation is a new
characterization of an inductive bias implicit in GD of particular interest for
NLP. We leverage the emergent discrete structure in a saturated transformer to
analyze the role of different attention heads, finding that some focus locally
on a small number of positions, while other heads compute global averages,
allowing counting. We believe understanding the interplay between these two
capabilities may shed further light on the structure of computation within
large transformers.Comment: To appear at EMNLP 202
Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias
Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm
We study the function space characterization of the inductive bias resulting
from controlling the norm of the weights in linear convolutional
networks. We view this in terms of an induced regularizer in the function space
given by the minimum norm of weights required to realize a linear function. For
two layer linear convolutional networks with output channels and kernel
size , we show the following: (a) If the inputs to the network have a single
channel, the induced regularizer for any is a norm given by a semidefinite
program (SDP) that is independent of the number of output channels . We
further validate these results through a binary classification task on MNIST.
(b) In contrast, for networks with multi-channel inputs, multiple output
channels can be necessary to merely realize all matrix-valued linear functions
and thus the inductive bias does depend on . Further, for sufficiently large
, the induced regularizer for and are the nuclear norm and the
group-sparse norm, respectively, of the Fourier coefficients --
both of which promote sparse structures