56 research outputs found
Flames: Benchmarking Value Alignment of Chinese Large Language Models
The widespread adoption of large language models (LLMs) across various
regions underscores the urgent need to evaluate their alignment with human
values. Current benchmarks, however, fall short of effectively uncovering
safety vulnerabilities in LLMs. Despite numerous models achieving high scores
and 'topping the chart' in these evaluations, there is still a significant gap
in LLMs' deeper alignment with human values and achieving genuine harmlessness.
To this end, this paper proposes the first highly adversarial benchmark named
Flames, consisting of 2,251 manually crafted prompts, ~18.7K model responses
with fine-grained annotations, and a specified scorer. Our framework
encompasses both common harmlessness principles, such as fairness, safety,
legality, and data protection, and a unique morality dimension that integrates
specific Chinese values such as harmony. Based on the framework, we carefully
design adversarial prompts that incorporate complex scenarios and jailbreaking
methods, mostly with implicit malice. By prompting mainstream LLMs with such
adversarially constructed prompts, we obtain model responses, which are then
rigorously annotated for evaluation. Our findings indicate that all the
evaluated LLMs demonstrate relatively poor performance on Flames, particularly
in the safety and fairness dimensions. Claude emerges as the best-performing
model overall, but with its harmless rate being only 63.08% while GPT-4 only
scores 39.04%. The complexity of Flames has far exceeded existing benchmarks,
setting a new challenge for contemporary LLMs and highlighting the need for
further alignment of LLMs. To efficiently evaluate new models on the benchmark,
we develop a specified scorer capable of scoring LLMs across multiple
dimensions, achieving an accuracy of 77.4%. The Flames Benchmark is publicly
available on https://github.com/AIFlames/Flames
- …