Collaborative Evaluation: Exploring the Synergy of Large Language Models
  and Humans for Open-ended Generation Evaluation

Bi, Wei; Cui, Leyang; Kong, Lingpeng; Li, Qintong

Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation

Authors: Wei Bi
Leyang Cui
Lingpeng Kong
Qintong Li
Publication date: 30 October 2023
Publisher

Abstract

Humans are widely involved in the evaluation of open-ended natural language generation tasks (NLG) that demand creativity, as automatic metrics often exhibit weak correlations with human judgments. Large language models (LLMs) recently have emerged as a scalable and cost-effective alternative to human evaluations. However, both humans and LLMs have limitations, i.e., inherent subjectivity and unreliable judgments, particularly for open-ended tasks that require adaptable metrics tailored to diverse task requirements. To explore the synergy between humans and LLM-based evaluators and address the challenges of existing inconsistent evaluation criteria in open-ended NLG tasks, we propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts, in which LLM generates initial ideation, and then humans engage in scrutiny. We conducted a series of experiments to investigate the mutual effects between LLMs and humans in CoEval. Results show that, by utilizing LLMs, CoEval effectively evaluates lengthy texts, saving significant time and reducing human evaluation outliers. Human scrutiny still plays a role, revising around 20% of LLM evaluation scores for ultimate reliability.Comment: We release our resources at \url{https://github.com/qtli/CoEval

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2310.19740

Last time updated on 18/01/2024