Humans are widely involved in the evaluation of open-ended natural language
generation tasks (NLG) that demand creativity, as automatic metrics often
exhibit weak correlations with human judgments. Large language models (LLMs)
recently have emerged as a scalable and cost-effective alternative to human
evaluations. However, both humans and LLMs have limitations, i.e., inherent
subjectivity and unreliable judgments, particularly for open-ended tasks that
require adaptable metrics tailored to diverse task requirements. To explore the
synergy between humans and LLM-based evaluators and address the challenges of
existing inconsistent evaluation criteria in open-ended NLG tasks, we propose a
Collaborative Evaluation pipeline CoEval, involving the design of a checklist
of task-specific criteria and the detailed evaluation of texts, in which LLM
generates initial ideation, and then humans engage in scrutiny. We conducted a
series of experiments to investigate the mutual effects between LLMs and humans
in CoEval. Results show that, by utilizing LLMs, CoEval effectively evaluates
lengthy texts, saving significant time and reducing human evaluation outliers.
Human scrutiny still plays a role, revising around 20% of LLM evaluation scores
for ultimate reliability.Comment: We release our resources at \url{https://github.com/qtli/CoEval