30 research outputs found
One-Shot Labeling for Automatic Relevance Estimation
Dealing with unjudged documents ("holes") in relevance assessments is a
perennial problem when evaluating search systems with offline experiments.
Holes can reduce the apparent effectiveness of retrieval systems during
evaluation and introduce biases in models trained with incomplete data. In this
work, we explore whether large language models can help us fill such holes to
improve offline evaluations. We examine an extreme, albeit common, evaluation
setting wherein only a single known relevant document per query is available
for evaluation. We then explore various approaches for predicting the relevance
of unjudged documents with respect to a query and the known relevant document,
including nearest neighbor, supervised, and prompting techniques. We find that
although the predictions of these One-Shot Labelers (1SL) frequently disagree
with human assessments, the labels they produce yield a far more reliable
ranking of systems than the single labels do alone. Specifically, the strongest
approaches can consistently reach system ranking correlations of over 0.86 with
the full rankings over a variety of measures. Meanwhile, the approach
substantially increases the reliability of t-tests due to filling holes in
relevance assessments, giving researchers more confidence in results they find
to be significant. Alongside this work, we release an easy-to-use software
package to enable the use of 1SL for evaluation of other ad-hoc collections or
systems.Comment: SIGIR 202
An ant-colony based approach for real-time implicit collaborative information seeking
This document is an Accepted Manuscript of the following article: Alessio Malizia, Kai A. Olsen, Tommaso Turchi, and Pierluigi Crescenzi, âAn ant-colony based approach for real-time implicit collaborative information seekingâ, Information Processing & Management, Vol. 53 (3): 608-623, May 2017. Under embargo until 31 July 2018. The final, definitive version of this paper is available online at doi: https://doi.org/10.1016/j.ipm.2016.12.005, published by Elsevier Ltd.We propose an approach based on Swarm Intelligence â more specifically on Ant Colony Optimization (ACO) â to improve search enginesâ performance and reduce information overload by exploiting collective usersâ behavior. We designed and developed three different algorithms that employ an ACO-inspired strategy to provide implicit collaborative-seeking features in real time to search engines. The three different algorithms â NaĂŻveRank, RandomRank, and SessionRank â leverage on different principles of ACO in order to exploit usersâ interactions and provide them with more relevant results. We designed an evaluation experiment employing two widely used standard datasets of query-click logs issued to two major Web search engines. The results demonstrated how each algorithm is suitable to be employed in ranking results of different types of queries depending on usersâ intent.Peer reviewedFinal Accepted Versio
Statistical comparisons of non-deterministic IR systems using two dimensional variance
Retrieval systems with non-deterministic output are widely used in information retrieval. Common examples include sampling, approximation algorithms, or interactive user input. The effectiveness of such systems differs not just for different topics, but also for different instances of the system. The inherent variance presents a dilemma - What is the best way to measure the effectiveness of a non-deterministic IR system? Existing approaches to IR evaluation do not consider this problem, or the potential impact on statistical significance. In this paper, we explore how such variance can affect system comparisons, and propose an evaluation framework and methodologies capable of doing this comparison. Using the context of distributed information retrieval as a case study for our investigation, we show that the approaches provide a consistent and reliable methodology to compare the effectiveness of a non-deterministic system with a deterministic or another non-deterministic system. In addition, we present a statistical best-practice that can be used to safely show how a non-deterministic IR system has equivalent effectiveness to another IR system, and how to avoid the common pitfall of misusing a lack of significance as a proof that two systems have equivalent effectiveness
The Infinite Index: Information Retrieval on Generative Text-To-Image Models
Conditional generative models such as DALL-E and Stable Diffusion generate
images based on a user-defined text, the prompt. Finding and refining prompts
that produce a desired image has become the art of prompt engineering.
Generative models do not provide a built-in retrieval model for a user's
information need expressed through prompts. In light of an extensive literature
review, we reframe prompt engineering for generative models as interactive
text-based retrieval on a novel kind of "infinite index". We apply these
insights for the first time in a case study on image generation for game design
with an expert. Finally, we envision how active learning may help to guide the
retrieval of generated images.Comment: Final version for CHIIR 202
On rank correlation and the distance between rankings
Rank correlation statistics are useful for determining whether a there is a correspondence between two measurements, par-ticularly when the measures themselves are of less interest than their relative ordering. Kendallâs Ï in particular has found use in Information Retrieval as a âmeta-evaluationâ measure: it has been used to compare evaluation measures, evaluate system rankings, and evaluate predicted perfor-mance. In the meta-evaluation domain, however, correla-tions between systems confound relationships between mea-surements, practically guaranteeing a positive and signifi-cant estimate of Ï regardless of any actual correlation be-tween the measurements. We introduce an alternative mea-sure of distance between rankings that corrects this by ex-plicitly accounting for correlations between systems over a sample of topics, and moreover has a probabilistic interpre-tation for use in a test of statistical significance. We validate our measure with theory, simulated data, and experiment
Payoffs and pitfalls in using knowledgeâbases for consumer health search
Consumer health search (CHS) is a challenging domain with vocabulary mismatch and considerable domain expertise hampering peoplesâ ability to formulate effective queries. We posit that using knowledge bases for query reformulation may help alleviate this problem. How to exploit knowledge bases for effective CHS is nontrivial, involving a swathe of key choices and design decisions (many of which are not explored in the literature). Here we rigorously empirically evaluate the impact these different choices have on retrieval effectiveness. A state-of-the-art knowledge-base retrieval modelâthe Entity Query Feature Expansion modelâwas used to evaluate these choices, which include: which knowledge base to use (specialised vs. general purpose), how to construct the knowledge base, how to extract entities from queries and map them to entities in the knowledge base, what part of the knowledge base to use for query expansion, and if to augment the knowledge base search process with relevance feedback. While knowledge base retrieval has been proposed as a solution for CHS, this paper delves into the finer details of doing this effectively, highlighting both payoffs and pitfalls. It aims to provide some lessons to others in advancing the state-of-the-art in CHS