SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Hassani, Hamed; Pappas, George J.; Robey, Alexander; Wong, Eric

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Authors: Hamed Hassani
George J. Pappas
Alexander Robey
Eric Wong
Publication date: 29 November 2023
Publisher

Abstract

Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM. Our code is publicly available at the following link: https://github.com/arobey1/smooth-llm

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2310.03684

Last time updated on 10/05/2024