We uncover a systematic bias in the evaluation paradigm of adopting large
language models~(LLMs), e.g., GPT-4, as a referee to score the quality of
responses generated by candidate models. We find that the quality ranking of
candidate responses can be easily hacked by simply altering their order of
appearance in the context. This manipulation allows us to skew the evaluation
result, making one model appear considerably superior to the other, e.g.,
vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue,
we propose two simple yet effective calibration strategies: 1) Multiple
Evidence Calibration, which requires the evaluator model to generate multiple
detailed pieces of evidence before assigning ratings; 2) Balanced Position
Calibration, which aggregates results across various orders to determine the
final score. Extensive experiments demonstrate that our approach successfully
mitigates evaluation bias, resulting in closer alignment with human judgments.
To facilitate future research on more robust large language model comparison,
we integrate the techniques in the paper into an easy-to-use toolkit
\emph{FairEval}, along with the human
annotations.\footnote{\url{https://github.com/i-Eval/FairEval}}Comment: work in progres