In this work, we carry out a data archaeology to infer books that are known
to ChatGPT and GPT-4 using a name cloze membership inference query. We find
that OpenAI models have memorized a wide collection of copyrighted materials,
and that the degree of memorization is tied to the frequency with which
passages of those books appear on the web. The ability of these models to
memorize an unknown set of books complicates assessments of measurement
validity for cultural analytics by contaminating test data; we show that models
perform much better on memorized books than on non-memorized books for
downstream tasks. We argue that this supports a case for open models whose
training data is known.Comment: EMNLP 2023 camera-ready (16 pages, 4 figures