25 research outputs found
Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
The generations of large language models are commonly controlled through
prompting techniques, where a user's query to the model is prefixed with a
prompt that aims to guide the model's behaviour on the query. The prompts used
by companies to guide their models are often treated as secrets, to be hidden
from the user making the query. They have even been treated as commodities to
be bought and sold. However, there has been anecdotal evidence showing that the
prompts can be extracted by a user even when they are kept secret. In this
paper, we present a framework for systematically measuring the success of
prompt extraction attacks. In experiments with multiple sources of prompts and
multiple underlying language models, we find that simple text-based attacks can
in fact reveal prompts with high probability
Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text
As text generated by large language models proliferates, it becomes vital to
understand how humans engage with such text, and whether or not they are able
to detect when the text they are reading did not originate with a human writer.
Prior work on human detection of generated text focuses on the case where an
entire passage is either human-written or machine-generated. In this paper, we
study a more realistic setting where text begins as human-written and
transitions to being generated by state-of-the-art neural language models. We
show that, while annotators often struggle at this task, there is substantial
variance in annotator skill and that given proper incentives, annotators can
improve at this task over time. Furthermore, we conduct a detailed comparison
study and analyze how a variety of variables (model size, decoding strategy,
fine-tuning, prompt genre, etc.) affect human detection performance. Finally,
we collect error annotations from our participants and use them to show that
certain textual genres influence models to make different types of errors and
that certain sentence-level features correlate highly with annotator selection.
We release the RoFT dataset: a collection of over 21,000 human annotations
paired with error classifications to encourage future work in human detection
and evaluation of generated text.Comment: AAAI 2023 Long Paper. Code is available at
https://github.com/liamdugan/human-detectio
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System
Neural language models are increasingly deployed into APIs and websites that
allow a user to pass in a prompt and receive generated text. Many of these
systems do not reveal generation parameters. In this paper, we present methods
to reverse-engineer the decoding method used to generate text (i.e., top- or
nucleus sampling). Our ability to discover which decoding strategy was used has
implications for detecting generated text. Additionally, the process of
discovering the decoding strategy can reveal biases caused by selecting
decoding settings which severely truncate a model's predicted distributions. We
perform our attack on several families of open-source language models, as well
as on production systems (e.g., ChatGPT).Comment: 6 pages, 4 figures, 3 tables. Also, 5 page appendix. Accepted to INLG
202
Counterfactual Memorization in Neural Language Models
Modern neural language models that are widely used in various NLP tasks risk
memorizing sensitive information from their training data. Understanding this
memorization is important in real world applications and also from a
learning-theoretical perspective. An open question in previous studies of
language model memorization is how to filter out "common" memorization. In
fact, most memorization criteria strongly correlate with the number of
occurrences in the training set, capturing memorized familiar phrases, public
knowledge, templated texts, or other repeated data. We formulate a notion of
counterfactual memorization which characterizes how a model's predictions
change if a particular document is omitted during training. We identify and
study counterfactually-memorized training examples in standard text datasets.
We estimate the influence of each memorized training example on the validation
set and on generated texts, showing how this can provide direct evidence of the
source of memorization at test time.Comment: NeurIPS 2023; 42 pages, 33 figure
Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy
Studying data memorization in neural language models helps us understand the
risks (e.g., to privacy or copyright) associated with models regurgitating
training data and aids in the development of countermeasures. Many prior works
-- and some recently deployed defenses -- focus on "verbatim memorization",
defined as a model generation that exactly matches a substring from the
training set. We argue that verbatim memorization definitions are too
restrictive and fail to capture more subtle forms of memorization.
Specifically, we design and implement an efficient defense that perfectly
prevents all verbatim memorization. And yet, we demonstrate that this "perfect"
filter does not prevent the leakage of training data. Indeed, it is easily
circumvented by plausible and minimally modified "style-transfer" prompts --
and in some cases even the non-modified original prompts -- to extract
memorized information. We conclude by discussing potential alternative
definitions and why defining memorization is a difficult yet crucial open
question for neural language models