Fake Alignment: Are LLMs Really Aligned Well?

Huang, Kexin; Jiang, Yu-Gang; Lyu, Chengqi; Ma, Xingjun; Qiao, Yu; Teng, Yan; Wang, Yingchun; Wang, Yixu; Zhang, Songyang; Zhang, Wenwei

Fake Alignment: Are LLMs Really Aligned Well?

Authors: Kexin Huang
Yu-Gang Jiang
Chengqi Lyu
Xingjun Ma
Yu Qiao
Yan Teng
Yingchun Wang
Yixu Wang
Songyang Zhang
Wenwei Zhang
Publication date: 14 November 2023
Publisher

Abstract

The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety within current research endeavors. This study investigates an interesting issue pertaining to the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, the LLM does not have a comprehensive understanding of the complex concept of safety. Instead, it only remembers what to answer for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. Such fake alignment renders previous evaluation protocols unreliable. To address this, we introduce the Fake alIgNment Evaluation (FINE) framework and two novel metrics--Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimates. Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Our work highlights potential limitations in prevailing alignment methodologies

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2311.05915

Last time updated on 10/02/2024