Despite the widespread success of Transformers on NLP tasks, recent works
have found that they struggle to model several formal languages when compared
to recurrent models. This raises the question of why Transformers perform well
in practice and whether they have any properties that enable them to generalize
better than recurrent models. In this work, we conduct an extensive empirical
study on Boolean functions to demonstrate the following: (i) Random
Transformers are relatively more biased towards functions of low sensitivity.
(ii) When trained on Boolean functions, both Transformers and LSTMs prioritize
learning functions of low sensitivity, with Transformers ultimately converging
to functions of lower sensitivity. (iii) On sparse Boolean functions which have
low sensitivity, we find that Transformers generalize near perfectly even in
the presence of noisy labels whereas LSTMs overfit and achieve poor
generalization accuracy. Overall, our results provide strong quantifiable
evidence that suggests differences in the inductive biases of Transformers and
recurrent models which may help explain Transformer's effective generalization
performance despite relatively limited expressiveness.Comment: Preprin