Language models (LMs) have proven to be powerful tools for psycholinguistic
research, but most prior work has focused on purely behavioural measures (e.g.,
surprisal comparisons). At the same time, research in model interpretability
has begun to illuminate the abstract causal mechanisms shaping LM behavior. To
help bring these strands of research closer together, we introduce CausalGym.
We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of
interpretability methods to causally affect model behaviour. To illustrate how
CausalGym can be used, we study the pythia models (14M--6.9B) and assess the
causal efficacy of a wide range of interpretability methods, including linear
probing and distributed alignment search (DAS). We find that DAS outperforms
the other methods, and so we use it to study the learning trajectory of two
difficult linguistic phenomena in pythia-1b: negative polarity item licensing
and filler--gap dependencies. Our analysis shows that the mechanism
implementing both of these tasks is learned in discrete stages, not gradually.Comment: 9 pages main text, 26 pages tota