16 research outputs found

    Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

    Full text link
    The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude (â„“2\ell_2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.Comment: To appear at EMNLP 202

    Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

    Get PDF
    Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias

    Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm

    Full text link
    We study the function space characterization of the inductive bias resulting from controlling the â„“2\ell_2 norm of the weights in linear convolutional networks. We view this in terms of an induced regularizer in the function space given by the minimum norm of weights required to realize a linear function. For two layer linear convolutional networks with CC output channels and kernel size KK, we show the following: (a) If the inputs to the network have a single channel, the induced regularizer for any KK is a norm given by a semidefinite program (SDP) that is independent of the number of output channels CC. We further validate these results through a binary classification task on MNIST. (b) In contrast, for networks with multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on CC. Further, for sufficiently large CC, the induced regularizer for K=1K=1 and K=DK=D are the nuclear norm and the â„“2,1\ell_{2,1} group-sparse norm, respectively, of the Fourier coefficients -- both of which promote sparse structures
    corecore