Recent large language models have been trained on vast datasets, but also
often on repeated data, either intentionally for the purpose of upweighting
higher quality data, or unintentionally because data deduplication is not
perfect and the model is exposed to repeated data at the sentence, paragraph,
or document level. Some works have reported substantial negative performance
effects of this repeated data. In this paper we attempt to study repeated data
systematically and to understand its effects mechanistically. To do this, we
train a family of models where most of the data is unique but a small fraction
of it is repeated many times. We find a strong double descent phenomenon, in
which repeated data can lead test loss to increase midway through training. A
predictable range of repetition frequency leads to surprisingly severe
degradation in performance. For instance, performance of an 800M parameter
model can be degraded to that of a 2x smaller model (400M params) by repeating
0.1% of the data 100 times, despite the other 90% of the training tokens
remaining unique. We suspect there is a range in the middle where the data can
be memorized and doing so consumes a large fraction of the model's capacity,
and this may be where the peak of degradation occurs. Finally, we connect these
observations to recent mechanistic interpretability work - attempting to
reverse engineer the detailed computations performed by the model - by showing
that data repetition disproportionately damages copying and internal structures
associated with generalization, such as induction heads, providing a possible
mechanism for the shift from generalization to memorization. Taken together,
these results provide a hypothesis for why repeating a relatively small
fraction of data in large language models could lead to disproportionately
large harms to performance.Comment: 23 pages, 22 figure