7 research outputs found
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent
The capacity of neural networks like the widely adopted transformer is known
to be very high. Evidence is emerging that they learn successfully due to
inductive bias in the training routine, typically a variant of gradient descent
(GD). To better understand this bias, we study the tendency for transformer
parameters to grow in magnitude ( norm) during training, and its
implications for the emergent representations within self attention layers.
Empirically, we document norm growth in the training of transformer language
models, including T5 during its pretraining. As the parameters grow in
magnitude, we prove that the network approximates a discretized network with
saturated activation functions. Such "saturated" networks are known to have a
reduced capacity compared to the full network family that can be described in
terms of formal languages and automata. Our results suggest saturation is a new
characterization of an inductive bias implicit in GD of particular interest for
NLP. We leverage the emergent discrete structure in a saturated transformer to
analyze the role of different attention heads, finding that some focus locally
on a small number of positions, while other heads compute global averages,
allowing counting. We believe understanding the interplay between these two
capabilities may shed further light on the structure of computation within
large transformers.Comment: To appear at EMNLP 202
Recommended from our members
Formal and Empirical Studies of Counting Behaviour in ReLU RNNs
In recent years, the discussion about systematicity of neural network learning has gained renewed interest, in particular the formal analysis of neural network behaviour. In this paper, we investigate the capability of single-cell ReLU RNN models to demonstrate precise counting behaviour. Formally, we start by characterising the semi-Dyck-1 language and semi-Dyck-1 counter machine that can be implemented by a single Rectified Linear Unit (ReLU) cell. We define three Counter Indicator Conditions (CICs) on the weights of a ReLU cell and show that fulfilling these conditions is equivalent to accepting the semi-Dyck-1 language, i.e. to perform exact counting. Empirically, we study the ability of single-cell ReLU RNNs to learn to count by training and testing them on different datasets of Dyck-1 and semi-Dyck-1 strings. While networks that satisfy the CICs count exactly and thus correctly even on very long strings, the trained networks exhibit a wide range of results and never satisfy the CICs exactly. We investigate the effect of deviating from the CICs and find that configurations that fulfil the CICs are not at a minimum of the loss function in the most common setups. This is consistent with observations in previous research indicating that training ReLU networks for counting tasks often leads to poor results. We finally discuss implications of these results and possible avenues for improving network behaviour