865 research outputs found
A Meta-Evaluation of C/W/L/A Metrics: System Ranking Similarity, System Ranking Consistency and Discriminative Power
Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for
offline evaluation metrics. This framework allows information retrieval (IR)
researchers to design evaluation metrics through the flexible combination of
user browsing models and user gain aggregations. However, the statistical
stability of C/W/L/A metrics with different aggregations is not yet
investigated. In this study, we investigate the statistical stability of
C/W/L/A metrics from the perspective of: (1) the system ranking similarity
among aggregations, (2) the system ranking consistency of aggregations and (3)
the discriminative power of aggregations. More specifically, we combined
various aggregation functions with the browsing model of Precision, Discounted
Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision
(AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of
system ranking similarity, system ranking consistency and discriminative power
on two offline test collections. Our experimental result suggests that, in
terms of system ranking consistency and discriminative power, the aggregation
function of expected rate of gain (ERG) has an outstanding performance while
the aggregation function of maximum relevance usually has an insufficient
performance. The result also suggests that Precision, DCG, RBP, INST and AP
with their canonical aggregation all have favourable performances in system
ranking consistency and discriminative power; but for ERR, replacing its
canonical aggregation with ERG can further strengthen the discriminative power
while obtaining a system ranking list similar to the canonical version at the
same time
An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric
Many evaluation metrics have been defined to evaluate the effectiveness
ad-hoc retrieval and search result diversification systems. However, it is
often unclear which evaluation metric should be used to analyze the performance
of retrieval systems given a specific task. Axiomatic analysis is an
informative mechanism to understand the fundamentals of metrics and their
suitability for particular scenarios. In this paper, we define a
constraint-based axiomatic framework to study the suitability of existing
metrics in search result diversification scenarios. The analysis informed the
definition of Rank-Biased Utility (RBU) -- an adaptation of the well-known
Rank-Biased Precision metric -- that takes into account redundancy and the user
effort associated to the inspection of documents in the ranking. Our
experiments over standard diversity evaluation campaigns show that the proposed
metric captures quality criteria reflected by different metrics, being suitable
in the absence of knowledge about particular features of the scenario under
study.Comment: Original version: 10 pages. Preprint of full paper to appear at
SIGIR'18: The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieval, July 8-12, 2018, Ann Arbor, MI, USA.
ACM, New York, NY, US
Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents? (CORRECTED VERSION)
In the context of depth- pooling for constructing web search test
collections, we compare two approaches to ordering pooled documents for
relevance assessors: the prioritisation strategy (PRI) used widely at NTCIR,
and the simple randomisation strategy (RND). In order to address research
questions regarding PRI and RND, we have constructed and released the WWW3E8
data set, which contains eight independent relevance labels for 32,375
topic-document pairs, i.e., a total of 259,000 labels. Four of the eight
relevance labels were obtained from PRI-based pools; the other four were
obtained from RND-based pools. Using WWW3E8, we compare PRI and RND in terms of
inter-assessor agreement, system ranking agreement, and robustness to new
systems that did not contribute to the pools. We also utilise an assessor
activity log we obtained as a byproduct of WWW3E8 to compare the two strategies
in terms of assessment efficiency.Comment: 30 pages. This is a corrected version of an open-access TOIS paper (
https://dl.acm.org/doi/pdf/10.1145/3494833
Statistical reform in information retrieval
Abstract IR revolves around evaluation. Therefore, IR researchers should employ sound evaluation practices. Nowadays many of us know that statistical significance testing is not enough, but not all of us know exactly what to do about it. This paper provides suggestions on how to report effect sizes and confidence intervals along with p-values, in the context of comparing IR systems using test collections. Hopefully, these practices will make IR papers more informative, and help researchers form more reliable conclusions that "add up." Finally, I pose a specific question for the IR community: should IR journal editors and SIGIR PC chairs require (rather than encourage) reporting of effect sizes and confidence intervals
- …