Perspectives on Large Language Models for Relevance Judgment

Clarke, Charles; Demartini, Gianluca; Dietz, Laura; Faggioli, Guglielmo; Hagen, Matthias; Hauff, Claudia; Kando, Noriko; Kanoulas, Evangelos; Potthast, Martin; Stein, Benno; Wachsmuth, Henning

Perspectives on Large Language Models for Relevance Judgment

Authors: Charles Clarke
Gianluca Demartini
Laura Dietz
Guglielmo Faggioli
Matthias Hagen
Claudia Hauff
Noriko Kando
Evangelos Kanoulas
Martin Potthast
Benno Stein
Henning Wachsmuth
Publication date: 13 April 2023
Publisher

Abstract

When asked, current large language models (LLMs) like ChatGPT claim that they can assist us with relevance judgments. Many researchers think this would not lead to credible IR research. In this perspective paper, we discuss possible ways for LLMs to assist human experts along with concerns and issues that arise. We devise a human-machine collaboration spectrum that allows categorizing different relevance judgment strategies, based on how much the human relies on the machine. For the extreme point of "fully automated assessment", we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing two opposing perspectives - for and against the use of LLMs for automatic relevance judgments - and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR researchers. We hope to start a constructive discussion within the community to avoid a stale-mate during review, where work is dammed if is uses LLMs for evaluation and dammed if it doesn't

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2304.09161

Last time updated on 22/04/2023