Search CORE

6,539 research outputs found

Unbiased Comparative Evaluation of Ranking Functions

Author: Owen A. B.
Pavlu V.
Peng Ye D. D.
Sparck-Jones K.
Voorhees E. M.
Yuan C.
Zhao P.
Publication venue
Publication date: 25/04/2016
Field of study

Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling has shown intriguing promise since it enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing

k

systems against a baseline, and ranking

k

systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page

arXiv.org e-Print Archive

Crossref

Evaluating epistemic uncertainty under incomplete assessments

Author: Barry
Blair
Blair
Blair
Blair
Harter
Hull
Ian Ruthven
Ingwersen
Järvelin
Leif Azzopardi
Mark Baillie
Popper
Ruthven
Salton
Saracevic
Savoy
Schamber
Soboroff
Swanson
Swanson
Van Rijsbergen
Voorhees
Voorhees
Voorhees
Wallis
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

The thesis of this study is to propose an extended methodology for laboratory based Information Retrieval evaluation under incomplete relevance assessments. This new methodology aims to identify potential uncertainty during system comparison that may result from incompleteness. The adoption of this methodology is advantageous, because the detection of epistemic uncertainty - the amount of knowledge (or ignorance) we have about the estimate of a system's performance - during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections. Across a series of experiments we demonstrate how this methodology can lead towards a finer grained analysis of systems. In particular, we show through experimentation how the current practice in Information Retrieval evaluation of using a measurement depth larger than the pooling depth increases uncertainty during system comparison

CiteSeerX

Crossref

University of Strathclyde Institutional Repository

Enlighten

Anticipating Information Needs Based on Check-in Activity

Author: Kiseleva J.
Shokouhi M.
Yang D.
Yilmaz E.
Zhang D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/09/2017
Field of study

In this work we address the development of a smart personal assistant that is capable of anticipating a user's information needs based on a novel type of context: the person's activity inferred from her check-in records on a location-based social network. Our main contribution is a method that translates a check-in activity into an information need, which is in turn addressed with an appropriate information card. This task is challenging because of the large number of possible activities and related information needs, which need to be addressed in a mobile dashboard that is limited in size. Our approach considers each possible activity that might follow after the last (and already finished) activity, and selects the top information cards such that they maximize the likelihood of satisfying the user's information needs for all possible future scenarios. The proposed models also incorporate knowledge about the temporal dynamics of information needs. Using a combination of historical check-in data and manual assessments collected via crowdsourcing, we show experimentally the effectiveness of our approach.Comment: Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM '17), 201

arXiv.org e-Print Archive

Crossref

A retrieval evaluation methodology for incomplete relevance assessments

Author: Azzopardi L.
Baillie M.
Ruthven I.
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2007
Field of study

In this paper we a propose an extended methodology for laboratory based Information Retrieval evaluation under in complete relevance assessments. This new protocol aims to identify potential uncertainty during system comparison that may result from incompleteness. We demonstrate how this methodology can lead towards a finer grained analysis of systems. This is advantageous, because the detection of uncertainty during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections

CiteSeerX

University of Strathclyde Institutional Repository

Enlighten

Recommended from our members

A collaborative approach to IR evaluation

Author: Sheshadri Aashish
Publication venue
Publication date: 16/09/2014
Field of study

textIn this thesis we investigate two main problems: 1) inferring consensus from disparate inputs to improve quality of crowd contributed data; and 2) developing a reliable crowd-aided IR evaluation framework. With regard to the first contribution, while many statistical label aggregation methods have been proposed, little comparative benchmarking has occurred in the community making it difficult to determine the state-of-the-art in consensus or to quantify novelty and progress, leaving modern systems to adopt simple control strategies. To aid the progress of statistical consensus and make state-of-the-art methods accessible, we develop a benchmarking framework in SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods. Through the development of SQUARE we propose a crowd simulation model that emulates real crowd environments to enable rapid and reliable experimentation of collaborative methods with different crowd contributions. We apply the findings of the benchmark to develop reliable crowd contributed test collections for IR evaluation. As our second contribution, we describe a collaborative model for distributing relevance judging tasks between trusted assessors and crowd judges. Based on prior work's hypothesis of judging disagreements on borderline documents, we train a logistic regression model to predict assessor disagreement, prioritizing judging tasks by expected disagreement. Judgments are generated from different crowd models and intelligently aggregated. Given a priority queue, a judging budget, and a ratio for expert vs. crowd judging costs, critical judging tasks are assigned to trusted assessors with the crowd supplying remaining judgments. Results on two TREC datasets show significant judging burden can be confidently shifted to the crowd, achieving high rank correlation and often at lower cost vs. exclusive use of trusted assessors.Computer Science

Texas ScholarWorks

A Meta-Evaluation of C/W/L/A Metrics: System Ranking Similarity, System Ranking Consistency and Discriminative Power

Author: Chen Nuo
Sakai Tetsuya
Publication venue
Publication date: 06/07/2023
Field of study

Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for offline evaluation metrics. This framework allows information retrieval (IR) researchers to design evaluation metrics through the flexible combination of user browsing models and user gain aggregations. However, the statistical stability of C/W/L/A metrics with different aggregations is not yet investigated. In this study, we investigate the statistical stability of C/W/L/A metrics from the perspective of: (1) the system ranking similarity among aggregations, (2) the system ranking consistency of aggregations and (3) the discriminative power of aggregations. More specifically, we combined various aggregation functions with the browsing model of Precision, Discounted Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision (AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of system ranking similarity, system ranking consistency and discriminative power on two offline test collections. Our experimental result suggests that, in terms of system ranking consistency and discriminative power, the aggregation function of expected rate of gain (ERG) has an outstanding performance while the aggregation function of maximum relevance usually has an insufficient performance. The result also suggests that Precision, DCG, RBP, INST and AP with their canonical aggregation all have favourable performances in system ranking consistency and discriminative power; but for ERR, replacing its canonical aggregation with ERG can further strengthen the discriminative power while obtaining a system ranking list similar to the canonical version at the same time

arXiv.org e-Print Archive

Private Investment in R&D to Signal Ability to Perform Government Contracts

Author: Frank R. Lichtenberg
Publication venue
Publication date
Field of study

Official government statistics on the "mission-distribution" of U.S. R&D investment are based on the assumption that only the government sponsors military R&D. In this paper we advance and test the alternative hypothesis, that a significant share of privately-financed industrial R&D is military in orientation. We argue that in addition to (prior to) contracting with firms to perform military R&D, the government deliberately encourages firms to sponsor defense research at their own expense, to enable the government to identify the firms most capable of performing certain government contracts, particularly those for major weapons systems. To test the hypothesis of, and estimate the quantity of, private investment in 'signaling' R&D, we estimate variants of a model of company R&D expenditure on longitudinal, firm-level data, including detailed data on federal contracts. Our estimates imply that about 30 percent of U.S. private industrial R&D expenditure in 1984 was procurement- (largely defense-) related, and that almost half of the increase in private R&D between 1979 and 1984 was stimulated by the increase in Federal demand.

Research Papers in Economics