Large Language Models are not Fair Evaluators

Cao, Yunbo; Chen, Liang; Li, Lei; Lin, Binghuai; Liu, Qi; Liu, Tianyu; Sui, Zhifang; Wang, Peiyi; Zhu, Dawei

Large Language Models are not Fair Evaluators

Authors: Yunbo Cao
Liang Chen
Lei Li
Binghuai Lin
Qi Liu
Tianyu Liu
Zhifang Sui
Peiyi Wang
Dawei Zhu
Publication date: 29 May 2023
Publisher

Abstract

We uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue, we propose two simple yet effective calibration strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple detailed pieces of evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score. Extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. To facilitate future research on more robust large language model comparison, we integrate the techniques in the paper into an easy-to-use toolkit \emph{FairEval}, along with the human annotations.\footnote{\url{https://github.com/i-Eval/FairEval}}Comment: work in progres

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.17926

Last time updated on 02/06/2023