Safety lies at the core of the development of Large Language Models (LLMs).
There is ample work on aligning LLMs with human ethics and preferences,
including data filtering in pretraining, supervised fine-tuning, reinforcement
learning from human feedback, and red teaming, etc. In this study, we discover
that chat in cipher can bypass the safety alignment techniques of LLMs, which
are mainly conducted in natural languages. We propose a novel framework
CipherChat to systematically examine the generalizability of safety alignment
to non-natural languages -- ciphers. CipherChat enables humans to chat with
LLMs through cipher prompts topped with system role descriptions and few-shot
enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs,
including ChatGPT and GPT-4 for different representative human ciphers across
11 safety domains in both English and Chinese. Experimental results show that
certain ciphers succeed almost 100% of the time to bypass the safety alignment
of GPT-4 in several safety domains, demonstrating the necessity of developing
safety alignment for non-natural languages. Notably, we identify that LLMs seem
to have a ''secret cipher'', and propose a novel SelfCipher that uses only role
play and several demonstrations in natural language to evoke this capability.
SelfCipher surprisingly outperforms existing human ciphers in almost all cases.
Our code and data will be released at https://github.com/RobustNLP/CipherChat.Comment: 13 pages, 4 figures, 9 table