2 research outputs found
How You Prompt Matters! Even Task-Oriented Constraints in Instructions Affect LLM-Generated Text Detection
Against the misuse (e.g., plagiarism or spreading misinformation) of Large
Language Models (LLMs), many recent works have presented LLM-generated-text
detectors with promising detection performance. Spotlighting a situation where
users instruct LLMs to generate texts (e.g., essay writing), there are various
ways to write the instruction (e.g., what task-oriented constraint to include).
In this paper, we discover that even a task-oriented constraint in instruction
can cause the inconsistent performance of current detectors to the generated
texts. Specifically, we focus on student essay writing as a realistic domain
and manually create the task-oriented constraint for each factor on essay
quality by Ke and Ng (2019). Our experiment shows that the detection
performance variance of the current detector on texts generated by instruction
with each task-oriented constraint is up to 20 times larger than the variance
caused by generating texts multiple times and paraphrasing the instruction. Our
finding calls for further research on developing robust detectors that can
detect such distributional shifts caused by a task-oriented constraint in the
instruction
OUTFOX: LLM-generated Essay Detection through In-context Learning with Adversarially Generated Examples
Large Language Models (LLMs) have achieved human-level fluency in text
generation, making it difficult to distinguish between human-written and
LLM-generated texts. This poses a growing risk of misuse of LLMs and demands
the development of detectors to identify LLM-generated texts. However, existing
detectors degrade detection accuracy by simply paraphrasing LLM-generated
texts. Furthermore, the effectiveness of these detectors in real-life
situations, such as when students use LLMs for writing homework assignments
(e.g., essays) and quickly learn how to evade these detectors, has not been
explored. In this paper, we propose OUTFOX, a novel framework that improves the
robustness of LLM-generated-text detectors by allowing both the detector and
the attacker to consider each other's output and apply this to the domain of
student essays. In our framework, the attacker uses the detector's prediction
labels as examples for in-context learning and adversarially generates essays
that are harder to detect. While the detector uses the adversarially generated
essays as examples for in-context learning to learn to detect essays from a
strong attacker. Our experiments show that our proposed detector learned
in-context from the attacker improves the detection performance on the attacked
dataset by up to +41.3 point F1-score. While our proposed attacker can
drastically degrade the performance of the detector by up to -57.0 point
F1-score compared to the paraphrasing method