6,539 research outputs found
Unbiased Comparative Evaluation of Ranking Functions
Eliciting relevance judgments for ranking evaluation is labor-intensive and
costly, motivating careful selection of which documents to judge. Unlike
traditional approaches that make this selection deterministically,
probabilistic sampling has shown intriguing promise since it enables the design
of estimators that are provably unbiased even when reusing data with missing
judgments. In this paper, we first unify and extend these sampling approaches
by viewing the evaluation problem as a Monte Carlo estimation task that applies
to a large number of common IR metrics. Drawing on the theoretical clarity that
this view offers, we tackle three practical evaluation scenarios: comparing two
systems, comparing systems against a baseline, and ranking systems. For
each scenario, we derive an estimator and a variance-optimizing sampling
distribution while retaining the strengths of sampling-based evaluation,
including unbiasedness, reusability despite missing data, and ease of use in
practice. In addition to the theoretical contribution, we empirically evaluate
our methods against previously used sampling heuristics and find that they
generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page
Evaluating epistemic uncertainty under incomplete assessments
The thesis of this study is to propose an extended methodology for laboratory based Information Retrieval evaluation under incomplete relevance assessments. This new methodology aims to identify potential uncertainty during system comparison that may result from incompleteness. The adoption of this methodology is advantageous, because the detection of epistemic uncertainty - the amount of knowledge (or ignorance) we have about the estimate of a system's performance - during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections. Across a series of experiments we demonstrate how this methodology can lead towards a finer grained analysis of systems. In particular, we show through experimentation how the current practice in Information Retrieval evaluation of using a measurement depth larger than the pooling depth increases uncertainty during system comparison
Anticipating Information Needs Based on Check-in Activity
In this work we address the development of a smart personal assistant that is
capable of anticipating a user's information needs based on a novel type of
context: the person's activity inferred from her check-in records on a
location-based social network. Our main contribution is a method that
translates a check-in activity into an information need, which is in turn
addressed with an appropriate information card. This task is challenging
because of the large number of possible activities and related information
needs, which need to be addressed in a mobile dashboard that is limited in
size. Our approach considers each possible activity that might follow after the
last (and already finished) activity, and selects the top information cards
such that they maximize the likelihood of satisfying the user's information
needs for all possible future scenarios. The proposed models also incorporate
knowledge about the temporal dynamics of information needs. Using a combination
of historical check-in data and manual assessments collected via crowdsourcing,
we show experimentally the effectiveness of our approach.Comment: Proceedings of the 10th ACM International Conference on Web Search
and Data Mining (WSDM '17), 201
A retrieval evaluation methodology for incomplete relevance assessments
In this paper we a propose an extended methodology for laboratory based Information Retrieval evaluation under in complete relevance assessments. This new protocol aims to identify potential uncertainty during system comparison that may result from incompleteness. We demonstrate how this methodology can lead towards a finer grained analysis of systems. This is advantageous, because the detection of uncertainty during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections
Recommended from our members
A collaborative approach to IR evaluation
textIn this thesis we investigate two main problems: 1) inferring consensus from disparate inputs to improve quality of crowd contributed data; and 2) developing a reliable crowd-aided IR evaluation framework.
With regard to the first contribution, while many statistical label aggregation methods have been proposed, little comparative benchmarking has occurred in the community making it difficult to determine the state-of-the-art in consensus or to quantify novelty and progress, leaving modern systems to adopt simple control strategies. To aid the progress of statistical consensus and make state-of-the-art methods accessible, we develop a benchmarking framework in SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods. Through the development of SQUARE we propose a crowd simulation model that emulates real crowd environments to enable rapid and reliable experimentation of collaborative methods with different crowd contributions. We apply the findings of the benchmark to develop reliable crowd contributed test collections for IR evaluation.
As our second contribution, we describe a collaborative model for distributing relevance judging tasks between trusted assessors and crowd judges. Based on prior work's hypothesis of judging disagreements on borderline documents, we train a logistic regression model to predict assessor disagreement, prioritizing judging tasks by expected disagreement. Judgments are generated from different crowd models and intelligently aggregated. Given a priority queue, a judging budget, and a ratio for expert vs. crowd judging costs, critical judging tasks are assigned to trusted assessors with the crowd supplying remaining judgments. Results on two TREC datasets show significant judging burden can be confidently shifted to the crowd, achieving high rank correlation and often at lower cost vs. exclusive use of trusted assessors.Computer Science
A Meta-Evaluation of C/W/L/A Metrics: System Ranking Similarity, System Ranking Consistency and Discriminative Power
Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for
offline evaluation metrics. This framework allows information retrieval (IR)
researchers to design evaluation metrics through the flexible combination of
user browsing models and user gain aggregations. However, the statistical
stability of C/W/L/A metrics with different aggregations is not yet
investigated. In this study, we investigate the statistical stability of
C/W/L/A metrics from the perspective of: (1) the system ranking similarity
among aggregations, (2) the system ranking consistency of aggregations and (3)
the discriminative power of aggregations. More specifically, we combined
various aggregation functions with the browsing model of Precision, Discounted
Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision
(AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of
system ranking similarity, system ranking consistency and discriminative power
on two offline test collections. Our experimental result suggests that, in
terms of system ranking consistency and discriminative power, the aggregation
function of expected rate of gain (ERG) has an outstanding performance while
the aggregation function of maximum relevance usually has an insufficient
performance. The result also suggests that Precision, DCG, RBP, INST and AP
with their canonical aggregation all have favourable performances in system
ranking consistency and discriminative power; but for ERR, replacing its
canonical aggregation with ERG can further strengthen the discriminative power
while obtaining a system ranking list similar to the canonical version at the
same time
Private Investment in R&D to Signal Ability to Perform Government Contracts
Official government statistics on the "mission-distribution" of U.S. R&D investment are based on the assumption that only the government sponsors military R&D. In this paper we advance and test the alternative hypothesis, that a significant share of privately-financed industrial R&D is military in orientation. We argue that in addition to (prior to) contracting with firms to perform military R&D, the government deliberately encourages firms to sponsor defense research at their own expense, to enable the government to identify the firms most capable of performing certain government contracts, particularly those for major weapons systems. To test the hypothesis of, and estimate the quantity of, private investment in 'signaling' R&D, we estimate variants of a model of company R&D expenditure on longitudinal, firm-level data, including detailed data on federal contracts. Our estimates imply that about 30 percent of U.S. private industrial R&D expenditure in 1984 was procurement- (largely defense-) related, and that almost half of the increase in private R&D between 1979 and 1984 was stimulated by the increase in Federal demand.
- ā¦