Toy Models of Superposition

Amodei, Dario; Chen, Carol; Drain, Dawn; Elhage, Nelson; Grosse, Roger; Hatfield-Dodds, Zac; Henighan, Tom; Hume, Tristan; Kaplan, Jared; Kravec, Shauna; Lasenby, Robert; McCandlish, Sam; Olah, Christopher; Olsson, Catherine; Schiefer, Nicholas; Wattenberg, Martin

Toy Models of Superposition

Authors: Dario Amodei
Carol Chen
Dawn Drain
Nelson Elhage
Roger Grosse
Zac Hatfield-Dodds
Tom Henighan
Tristan Hume
Jared Kaplan
Shauna Kravec
Robert Lasenby
Sam McCandlish
Christopher Olah
Catherine Olsson
Nicholas Schiefer
Martin Wattenberg
Publication date: 21 September 2022
Publisher

Abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.Comment: Also available at https://transformer-circuits.pub/2022/toy_model/index.htm

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2209.10652

Last time updated on 18/11/2022