Large language models (LLMs) can memorize many pretrained sequences verbatim.
This paper studies if we can locate a small set of neurons in LLMs responsible
for memorizing a given sequence. While the concept of localization is often
mentioned in prior work, methods for localization have never been
systematically and directly evaluated; we address this with two benchmarking
approaches. In our INJ Benchmark, we actively inject a piece of new information
into a small subset of LLM weights and measure whether localization methods can
identify these "ground truth" weights. In the DEL Benchmark, we study
localization of pretrained data that LLMs have already memorized; while this
setting lacks ground truth, we can still evaluate localization by measuring
whether dropping out located neurons erases a memorized sequence from the
model. We evaluate five localization methods on our two benchmarks, and both
show similar rankings. All methods exhibit promising localization ability,
especially for pruning-based methods, though the neurons they identify are not
necessarily specific to a single memorized sequence