Past literature has illustrated that language models (LMs) often memorize
parts of training instances and reproduce them in natural language generation
(NLG) processes. However, it is unclear to what extent LMs "reuse" a training
corpus. For instance, models can generate paraphrased sentences that are
contextually similar to training samples. In this work, therefore, we study
three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2
generated texts, in comparison to its training data, and further analyze the
plagiarism patterns of fine-tuned LMs with domain-specific corpora which are
extensively used in practice. Our results suggest that (1) three types of
plagiarism widely exist in LMs beyond memorization, (2) both size and decoding
methods of LMs are strongly associated with the degrees of plagiarism they
exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus
similarity and homogeneity. Given that a majority of LMs' training data is
scraped from the Web without informing content owners, their reiteration of
words, phrases, and even core ideas from training sets into generated texts has
ethical implications. Their patterns are likely to exacerbate as both the size
of LMs and their training data increase, raising concerns about
indiscriminately pursuing larger models with larger training corpora.
Plagiarized content can also contain individuals' personal and sensitive
information. These findings overall cast doubt on the practicality of current
LMs in mission-critical writing tasks and urge more discussions around the
observed phenomena. Data and source code are available at
https://github.com/Brit7777/LM-plagiarism.Comment: Accepted to WWW'2