Correction of Errors in Preference Ratings from Automated Metrics for
  Text Generation

Cieliebak, Mark; Deriu, Jan; Tuggener, Don; von Däniken, Pius

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

Authors: Mark Cieliebak
Jan Deriu
Don Tuggener
Pius von Däniken
Publication date: 6 June 2023
Publisher

Abstract

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.03866

Last time updated on 08/06/2023