Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for
offline evaluation metrics. This framework allows information retrieval (IR)
researchers to design evaluation metrics through the flexible combination of
user browsing models and user gain aggregations. However, the statistical
stability of C/W/L/A metrics with different aggregations is not yet
investigated. In this study, we investigate the statistical stability of
C/W/L/A metrics from the perspective of: (1) the system ranking similarity
among aggregations, (2) the system ranking consistency of aggregations and (3)
the discriminative power of aggregations. More specifically, we combined
various aggregation functions with the browsing model of Precision, Discounted
Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision
(AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of
system ranking similarity, system ranking consistency and discriminative power
on two offline test collections. Our experimental result suggests that, in
terms of system ranking consistency and discriminative power, the aggregation
function of expected rate of gain (ERG) has an outstanding performance while
the aggregation function of maximum relevance usually has an insufficient
performance. The result also suggests that Precision, DCG, RBP, INST and AP
with their canonical aggregation all have favourable performances in system
ranking consistency and discriminative power; but for ERR, replacing its
canonical aggregation with ERG can further strengthen the discriminative power
while obtaining a system ranking list similar to the canonical version at the
same time