4,297 research outputs found
A Comparison of Methods for Evaluating Generative IR
Information retrieval systems increasingly incorporate generative components.
For example, in a retrieval augmented generation (RAG) system, a retrieval
component might provide a source of ground truth, while a generative component
summarizes and augments its responses. In other systems, a large language model
(LLM) might directly generate responses without consulting a retrieval
component. While there are multiple definitions of generative information
retrieval (Gen-IR) systems, in this paper we focus on those systems where the
system's response is not drawn from a fixed collection of documents or
passages. The response to a query may be entirely new text. Since traditional
IR evaluation methods break down under this model, we explore various methods
that extend traditional offline evaluation approaches to the Gen-IR context.
Offline IR evaluation traditionally employs paid human assessors, but
increasingly LLMs are replacing human assessment, demonstrating capabilities
similar or superior to crowdsourced labels. Given that Gen-IR systems do not
generate responses from a fixed set, we assume that methods for Gen-IR
evaluation must largely depend on LLM-generated labels. Along with methods
based on binary and graded relevance, we explore methods based on explicit
subtopics, pairwise preferences, and embeddings. We first validate these
methods against human assessments on several TREC Deep Learning Track tasks; we
then apply these methods to evaluate the output of several purely generative
systems. For each method we consider both its ability to act autonomously,
without the need for human labels or other input, and its ability to support
human auditing. To trust these methods, we must be assured that their results
align with human assessments. In order to do so, evaluation criteria must be
transparent, so that outcomes can be audited by human assessors
Retrieving Supporting Evidence for Generative Question Answering
Current large language models (LLMs) can exhibit near-human levels of
performance on many natural language-based tasks, including open-domain
question answering. Unfortunately, at this time, they also convincingly
hallucinate incorrect answers, so that responses to questions must be verified
against external sources before they can be accepted at face value. In this
paper, we report two simple experiments to automatically validate generated
answers against a corpus. We base our experiments on questions and passages
from the MS MARCO (V1) test collection, and a retrieval pipeline consisting of
sparse retrieval, dense retrieval and neural rerankers. In the first
experiment, we validate the generated answer in its entirety. After presenting
a question to an LLM and receiving a generated answer, we query the corpus with
the combination of the question + generated answer. We then present the LLM
with the combination of the question + generated answer + retrieved answer,
prompting it to indicate if the generated answer can be supported by the
retrieved answer. In the second experiment, we consider the generated answer at
a more granular level, prompting the LLM to extract a list of factual
statements from the answer and verifying each statement separately. We query
the corpus with each factual statement and then present the LLM with the
statement and the corresponding retrieved evidence. The LLM is prompted to
indicate if the statement can be supported and make necessary edits using the
retrieved material. With an accuracy of over 80%, we find that an LLM is
capable of verifying its generated answer when a corpus of supporting material
is provided. However, manual assessment of a random sample of questions reveals
that incorrect generated answers are missed by this verification process. While
this verification process can reduce hallucinations, it can not entirely
eliminate them.Comment: arXiv admin note: text overlap with arXiv:2306.1378
Human Preferences as Dueling Bandits
© 2022 Association for Computing Machinery. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,
http://dx.doi.org/10.1145/3477495.3531991The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditional graded relevance assessments. Rather than viewing items one at a time, assessors view items side-by-side and indicate the one that provides the better response to a query, allowing fine-grained distinctions. If we employ preference judgments to identify the probably best items for each query, we can measure rankers by their ability to place these items as high as possible. We frame the problem of finding best items as a dueling bandits problem. While many papers explore dueling bandits for online ranker evaluation via interleaving, they have not been considered as a framework for offline evaluation via human preference judgments. We review the literature for possible solutions. For human preference judgments, any usable algorithm must tolerate ties, since two items may appear nearly equal to assessors, and it must minimize the number of judgments required for any specific pair, since each such comparison requires an independent assessor. Since the theoretical guarantees provided by most algorithms depend on assumptions that are not satisfied by human preference judgments, we simulate selected algorithms on representative test cases to provide insight into their practical utility. Based on these simulations, one algorithm stands out for its potential. Our simulations suggest modifications to further improve its performance. Using the modified algorithm, we collect over 10,000 preference judgments for pools derived from submissions to the TREC 2021 Deep Learning Track, confirming its suitability. We test the idea of best-item evaluation and suggest ideas for further theoretical and practical progress.We thank Mark Smucker, Gautam Kamath, and Ben Carterette for
their feedback. This research was supported by the Natural Science
and Engineering Research Council of Canada through its Discovery
Grants program
Dynamic Foot Stimulation Attenuates Soleus Muscle Atrophy Induced by Hindlimb Unloading in Rats
Unloading-induced myofiber atrophy is a phenomenon that occurs in the aging population, bed-ridden patients and astronauts. The objective of this study was to determine whether or not dynamic foot stimulation (DFS) applied to the plantar surface of the rat foot can serve as a countermeasure to the soleus muscle atrophy normally observed in hindlimb unloaded (HU) rats. Thirty mature adult (6-month-old) male Wistar rats were randomly assigned into ambulatory control (AMB), hindlimb unloaded alone (HU), or hindlimb unloaded with the application of DFS (HU+DFS) groups. A dynamic pattern of pressure was applied to the right foot of each HU animal using a specially fabricated boot containing an inflatable air bladder connected to a solenoid air pump controlled by a laptop computer. The anti-atrophic effects of DFS were quantified morphometrically in frozen cross-sections of soleus muscle stained using the metachromatic-ATPase fiber typing technique. Application of DFS during HU significantly counteracted the atrophic response observed in the soleus by preventing approximately 85% of the reduction in Type I myofiber cross-sectional area (CSA) observed during HU. However, DFS did not protect type II fibers of the soleus from HU-induced atrophy or any fiber type in the soleus muscle of the contralateral control leg of the DFS-treated HU animals. These results illustrate that the application of DFS to the rat foot is an effective countermeasure to soleus muscle atrophy induced by HU
Assessing and Verifying Task Utility in LLM-Powered Applications
The rapid development of Large Language Models (LLMs) has led to a surge in
applications that facilitate collaboration among multiple agents, assisting
humans in their daily tasks. However, a significant gap remains in assessing to
what extent LLM-powered applications genuinely enhance user experience and task
execution efficiency. This highlights the need to verify utility of LLM-powered
applications, particularly by ensuring alignment between the application's
functionality and end-user needs. We introduce AgentEval, a novel framework
designed to simplify the utility verification process by automatically
proposing a set of criteria tailored to the unique purpose of any given
application. This allows for a comprehensive assessment, quantifying the
utility of an application against the suggested criteria. We present a
comprehensive analysis of the effectiveness and robustness of AgentEval for two
open source datasets including Math Problem solving and ALFWorld House-hold
related tasks. For reproducibility purposes, we make the data, code and all the
logs publicly available at https://bit.ly/3w3yKcS .Comment: arXiv admin note: text overlap with arXiv:2402.0901
- …