In the context of depth-k pooling for constructing web search test
collections, we compare two approaches to ordering pooled documents for
relevance assessors: the prioritisation strategy (PRI) used widely at NTCIR,
and the simple randomisation strategy (RND). In order to address research
questions regarding PRI and RND, we have constructed and released the WWW3E8
data set, which contains eight independent relevance labels for 32,375
topic-document pairs, i.e., a total of 259,000 labels. Four of the eight
relevance labels were obtained from PRI-based pools; the other four were
obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of
inter-assessor agreement, system ranking agreement, and robustness to new
systems that did not contribute to the pools. We also utilise an assessor
activity log we obtained as a byproduct of WWW3E8 to compare the two strategies
in terms of assessment efficiency.Comment: 30 pages. This is a corrected version of an open-access TOIS paper (
https://dl.acm.org/doi/pdf/10.1145/3494833