A retrieval model should not only interpolate the training data but also
extrapolate well to the queries that are different from the training data.
While neural retrieval models have demonstrated impressive performance on
ad-hoc search benchmarks, we still know little about how they perform in terms
of interpolation and extrapolation. In this paper, we demonstrate the
importance of separately evaluating the two capabilities of neural retrieval
models. Firstly, we examine existing ad-hoc search benchmarks from the two
perspectives. We investigate the distribution of training and test data and
find a considerable overlap in query entities, query intent, and relevance
labels. This finding implies that the evaluation on these test sets is biased
toward interpolation and cannot accurately reflect the extrapolation capacity.
Secondly, we propose a novel evaluation protocol to separately evaluate the
interpolation and extrapolation performance on existing benchmark datasets. It
resamples the training and test data based on query similarity and utilizes the
resampled dataset for training and evaluation. Finally, we leverage the
proposed evaluation protocol to comprehensively revisit a number of
widely-adopted neural retrieval models. Results show models perform differently
when moving from interpolation to extrapolation. For example,
representation-based retrieval models perform almost as well as
interaction-based retrieval models in terms of interpolation but not
extrapolation. Therefore, it is necessary to separately evaluate both
interpolation and extrapolation performance and the proposed resampling method
serves as a simple yet effective evaluation tool for future IR studies.Comment: CIKM 2022 Full Pape