Search CORE

6,242 research outputs found

Users' effectiveness and satisfaction for image retrieval

Author: Al-Maskari A.
Clough P.
Sanderson M.
Publication venue
Publication date: 01/01/2006
Field of study

This paper presents results from an initial user study exploring the relationship between system effectiveness as quantified by traditional measures such as precision and recall, and users’ effectiveness and satisfaction of the results. The tasks involve finding images for recall-based tasks. It was concluded that no direct relationship between system effectiveness and users’ performance could be proven (as shown by previous research). People learn to adapt to a system regardless of its effectiveness. This study recommends that a combination of attributes (e.g. system effectiveness, user performance and satisfaction) is a more effective way to evaluate interactive retrieval systems. Results of this study also reveal that users are more concerned with accuracy than coverage of the search results

CiteSeerX

RMIT Research Repository

White Rose Research Online

Evaluating epistemic uncertainty under incomplete assessments

Author: Barry
Blair
Blair
Blair
Blair
Harter
Hull
Ian Ruthven
Ingwersen
Järvelin
Leif Azzopardi
Mark Baillie
Popper
Ruthven
Salton
Saracevic
Savoy
Schamber
Soboroff
Swanson
Swanson
Van Rijsbergen
Voorhees
Voorhees
Voorhees
Wallis
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

The thesis of this study is to propose an extended methodology for laboratory based Information Retrieval evaluation under incomplete relevance assessments. This new methodology aims to identify potential uncertainty during system comparison that may result from incompleteness. The adoption of this methodology is advantageous, because the detection of epistemic uncertainty - the amount of knowledge (or ignorance) we have about the estimate of a system's performance - during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections. Across a series of experiments we demonstrate how this methodology can lead towards a finer grained analysis of systems. In particular, we show through experimentation how the current practice in Information Retrieval evaluation of using a measurement depth larger than the pooling depth increases uncertainty during system comparison

CiteSeerX

Crossref

University of Strathclyde Institutional Repository

Enlighten

Human assessments of document similarity

Author: Belkin
Belz
Cavnar
Cavnar
Damashek
Damashek
Flesch
Fox
Furnas
Gardenfors
Haenggi
Harman
Harman
Harman
Hjørland
Johnson-Laird
Järvelin
Landauer
Lee
Lin
Lund
Miller
Morris
Resnik
Salton
Saracevic
Skupin
Vorhees
Westerman
Publication venue: 'Wiley'
Publication date: 01/01/2010
Field of study

Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems

Crossref

University of Gloucestershire Research Repository

Brunel University Research Archive

5 − 4 ≠ 4 − 3: On the Uneven Gaps between Different Levels of Graded User Satisfaction in Interactive Information Retrieval Evaluation

Author: Han Fangyuan
Liu Jiqun
Publication venue
Publication date: 03/01/2023
Field of study

Similar to other ground truth measures, graded user satisfaction has been frequently employed as a continuous variable in information retrieval evaluation based on the assumption that intervals between adjacent grades are quantitatively equal. To examine the validity of equal-gap assumption and explore dynamic perceptual thresholds triggering grade changes in search evaluation, we investigate the extent to which users are sensitive to changes in search efforts and outcomes across different gaps of graded satisfaction. Experiments on four user study datasets (15,337 queries) indicate that 1) User satisfaction sensitivity, especially to offline evaluation metrics, changes significantly across gaps in satisfaction scale; 2) the size and direction of changes in sensitivity vary across study settings, search types, and intentions, especially within “3-5” scale subrange. This study speaks to the fundamentals of user-centered evaluation and advances the knowledge of heterogeneity in satisfaction sensitivity to search efforts and gains and implicit changes in evaluation thresholds

ScholarSpace at University of Hawai'i at Manoa

A Meta-Evaluation of C/W/L/A Metrics: System Ranking Similarity, System Ranking Consistency and Discriminative Power

Author: Chen Nuo
Sakai Tetsuya
Publication venue
Publication date: 06/07/2023
Field of study

Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for offline evaluation metrics. This framework allows information retrieval (IR) researchers to design evaluation metrics through the flexible combination of user browsing models and user gain aggregations. However, the statistical stability of C/W/L/A metrics with different aggregations is not yet investigated. In this study, we investigate the statistical stability of C/W/L/A metrics from the perspective of: (1) the system ranking similarity among aggregations, (2) the system ranking consistency of aggregations and (3) the discriminative power of aggregations. More specifically, we combined various aggregation functions with the browsing model of Precision, Discounted Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision (AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of system ranking similarity, system ranking consistency and discriminative power on two offline test collections. Our experimental result suggests that, in terms of system ranking consistency and discriminative power, the aggregation function of expected rate of gain (ERG) has an outstanding performance while the aggregation function of maximum relevance usually has an insufficient performance. The result also suggests that Precision, DCG, RBP, INST and AP with their canonical aggregation all have favourable performances in system ranking consistency and discriminative power; but for ERR, replacing its canonical aggregation with ERG can further strengthen the discriminative power while obtaining a system ranking list similar to the canonical version at the same time

arXiv.org e-Print Archive

Joint Upper & Lower Bound Normalization for IR Evaluation

Author: Feng Dongji
Santu Shubhra Kanti Karmaker
Publication venue
Publication date: 20/09/2022
Field of study

In this paper, we present a novel perspective towards IR evaluation by proposing a new family of evaluation metrics where the existing popular metrics (e.g., nDCG, MAP) are customized by introducing a query-specific lower-bound (LB) normalization term. While original nDCG, MAP etc. metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding LB normalization for them has not yet been studied. Specifically, we introduce two different variants of the proposed LB normalization, where the lower bound is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case-studies by instantiating the new framework for two popular IR evaluation metric (with two variants, e.g., DCG_UL_V1,2 and MSP_UL_V1,2 ) and then comparing against the traditional metric without the proposed LB normalization. Experiments on two different data-sets with eight Learning-to-Rank (LETOR) methods demonstrate the following properties of the new LB normalized metric: 1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Lower (UL) Bound normalized version and vice-versa, especially for uninformative query-sets. 2) When compared against the original metric, our proposed UL normalized metrics demonstrate higher Discriminatory Power and better Consistency across different data-sets. These findings suggest that the IR community should consider UL normalization seriously when computing nDCG and MAP and more in-depth study of UL normalization for general IR evaluation is warranted.Comment: 26 pages, 3 figure

arXiv.org e-Print Archive

Overview of the TREC 2014 Federated Web Search Track

Author: Demeester Thomas
Hiemstra Djoerd
Nguyen Dong-Phuong
Trieschnigg Rudolf Berend
Zhou Ke
Publication venue
Publication date: 01/11/2014
Field of study

The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants’ results for the tasks are introduced, analyzed, and compared

University of Twente Research Information