The growing awareness of safety concerns in large language models (LLMs) has
sparked considerable interest in the evaluation of safety within current
research endeavors. This study investigates an interesting issue pertaining to
the evaluation of LLMs, namely the substantial discrepancy in performance
between multiple-choice questions and open-ended questions. Inspired by
research on jailbreak attack patterns, we argue this is caused by mismatched
generalization. That is, the LLM does not have a comprehensive understanding of
the complex concept of safety. Instead, it only remembers what to answer for
open-ended safety questions, which makes it unable to solve other forms of
safety tests. We refer to this phenomenon as fake alignment and construct a
comparative benchmark to empirically verify its existence in LLMs. Such fake
alignment renders previous evaluation protocols unreliable. To address this, we
introduce the Fake alIgNment Evaluation (FINE) framework and two novel
metrics--Consistency Score (CS) and Consistent Safety Score (CSS), which
jointly assess two complementary forms of evaluation to quantify fake alignment
and obtain corrected performance estimates. Applying FINE to 14 widely-used
LLMs reveals several models with purported safety are poorly aligned in
practice. Our work highlights potential limitations in prevailing alignment
methodologies