Decomposing model activations into interpretable components is a key open
problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a
popular method for decomposing the internal activations of trained transformers
into sparse, interpretable features, and have been applied to MLP layers and
the residual stream. In this work we train SAEs on attention layer outputs and
show that also here SAEs find a sparse, interpretable decomposition. We
demonstrate this on transformers from several model families and up to 2B
parameters.
We perform a qualitative study of the features computed by attention layers,
and find multiple families: long-range context, short-range context and
induction features. We qualitatively study the role of every head in GPT-2
Small, and estimate that at least 90% of the heads are polysemantic, i.e. have
multiple unrelated roles.
Further, we show that Sparse Autoencoders are a useful tool that enable
researchers to explain model behavior in greater detail than prior work. For
example, we explore the mystery of why models have so many seemingly redundant
induction heads, use SAEs to motivate the hypothesis that some are long-prefix
whereas others are short-prefix, and confirm this with more rigorous analysis.
We use our SAEs to analyze the computation performed by the Indirect Object
Identification circuit (Wang et al.), validating that the SAEs find causally
meaningful intermediate variables, and deepening our understanding of the
semantics of the circuit. We open-source the trained SAEs and a tool for
exploring arbitrary prompts through the lens of Attention Output SAEs