Search CORE

25 research outputs found

Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success

Author: Ippolito Daphne
Zhang Yiming
Publication venue
Publication date: 13/07/2023
Field of study

The generations of large language models are commonly controlled through prompting techniques, where a user's query to the model is prefixed with a prompt that aims to guide the model's behaviour on the query. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold. However, there has been anecdotal evidence showing that the prompts can be extracted by a user even when they are kept secret. In this paper, we present a framework for systematically measuring the success of prompt extraction attacks. In experiments with multiple sources of prompts and multiple underlying language models, we find that simple text-based attacks can in fact reveal prompts with high probability

arXiv.org e-Print Archive

Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text

Author: Callison-Burch Chris
Dugan Liam
Ippolito Daphne
Kirubarajan Arun
Shi Sherry
Publication venue
Publication date: 24/12/2022
Field of study

As text generated by large language models proliferates, it becomes vital to understand how humans engage with such text, and whether or not they are able to detect when the text they are reading did not originate with a human writer. Prior work on human detection of generated text focuses on the case where an entire passage is either human-written or machine-generated. In this paper, we study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models. We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time. Furthermore, we conduct a detailed comparison study and analyze how a variety of variables (model size, decoding strategy, fine-tuning, prompt genre, etc.) affect human detection performance. Finally, we collect error annotations from our participants and use them to show that certain textual genres influence models to make different types of errors and that certain sentence-level features correlate highly with annotator selection. We release the RoFT dataset: a collection of over 21,000 human annotations paired with error classifications to encourage future work in human detection and evaluation of generated text.Comment: AAAI 2023 Long Paper. Code is available at https://github.com/liamdugan/human-detectio

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System

Author: Carlini Nicholas
Ippolito Daphne
Lee Katherine
Nasr Milad
Yu Yun William
Publication venue
Publication date: 09/09/2023
Field of study

Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-

k

or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).Comment: 6 pages, 4 figures, 3 tables. Also, 5 page appendix. Accepted to INLG 202

arXiv.org e-Print Archive

Counterfactual Memorization in Neural Language Models

Author: Carlini Nicholas
Ippolito Daphne
Jagielski Matthew
Lee Katherine
Tramèr Florian
Zhang Chiyuan
Publication venue
Publication date: 13/10/2023
Field of study

Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data. Understanding this memorization is important in real world applications and also from a learning-theoretical perspective. An open question in previous studies of language model memorization is how to filter out "common" memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing memorized familiar phrases, public knowledge, templated texts, or other repeated data. We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We estimate the influence of each memorized training example on the validation set and on generated texts, showing how this can provide direct evidence of the source of memorization at test time.Comment: NeurIPS 2023; 42 pages, 33 figure

arXiv.org e-Print Archive

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

Author: Carlini Nicholas
Choquette-Choo Christopher A.
Ippolito Daphne
Jagielski Matthew
Lee Katherine
Nasr Milad
Tramèr Florian
Zhang Chiyuan
Publication venue
Publication date: 11/09/2023
Field of study

Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. Specifically, we design and implement an efficient defense that perfectly prevents all verbatim memorization. And yet, we demonstrate that this "perfect" filter does not prevent the leakage of training data. Indeed, it is easily circumvented by plausible and minimally modified "style-transfer" prompts -- and in some cases even the non-modified original prompts -- to extract memorized information. We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models

arXiv.org e-Print Archive