Transformer networks have seen great success in natural language processing
and machine vision, where task objectives such as next word prediction and
image classification benefit from nuanced context sensitivity across
high-dimensional inputs. However, there is an ongoing debate about how and when
transformers can acquire highly structured behavior and achieve systematic
generalization. Here, we explore how well a causal transformer can perform a
set of algorithmic tasks, including copying, sorting, and hierarchical
compositions of these operations. We demonstrate strong generalization to
sequences longer than those used in training by replacing the standard
positional encoding typically used in transformers with labels arbitrarily
paired with items in the sequence. We search for the layer and head
configuration sufficient to solve these tasks, then probe for signs of
systematic processing in latent representations and attention patterns. We show
that two-layer transformers learn reliable solutions to multi-level problems,
develop signs of task decomposition, and encode input items in a way that
encourages the exploitation of shared computation across related tasks. These
results provide key insights into how attention layers support structured
computation both within a task and across multiple tasks.Comment: 18 page