Query performance prediction (QPP) aims to estimate the retrieval quality of
a search system for a query without human relevance judgments. Previous QPP
methods typically return a single scalar value and do not require the predicted
values to approximate a specific information retrieval (IR) evaluation measure,
leading to certain drawbacks: (i) a single scalar is insufficient to accurately
represent different IR evaluation measures, especially when metrics do not
highly correlate, and (ii) a single scalar limits the interpretability of QPP
methods because solely using a scalar is insufficient to explain QPP results.
To address these issues, we propose a QPP framework using automatically
generated relevance judgments (QPP-GenRE), which decomposes QPP into
independent subtasks of predicting the relevance of each item in a ranked list
to a given query. This allows us to predict any IR evaluation measure using the
generated relevance judgments as pseudo-labels. This also allows us to
interpret predicted IR evaluation measures, and identify, track and rectify
errors in generated relevance judgments to improve QPP quality. We predict an
item's relevance by using open-source large language models (LLMs) to ensure
scientific reproducibility.
We face two main challenges: (i) excessive computational costs of judging an
entire corpus for predicting a metric considering recall, and (ii) limited
performance in prompting open-source LLMs in a zero-/few-shot manner. To solve
the challenges, we devise an approximation strategy to predict an IR measure
considering recall and propose to fine-tune open-source LLMs using
human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep
learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for
both lexical and neural rankers