3 research outputs found
How Am I Doing?: Evaluating Conversational Search Systems Offline
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search
Evaluating the Cranfield Paradigm for Conversational Search Systems
Due to the sequential and interactive nature of conversations, the
application of traditional Information Retrieval (IR) methods like
the Cranfield paradigm require stronger assumptions. When building a test collection for Ad Hoc search, it is fair to assume that the
relevance judgments provided by an annotator correlate well with
the relevance judgments perceived by an actual user of the search
engine. However, when building a test collection for conversational
search, we do not know if it is fair to assume that the relevance judgments provided by an annotator correlate well with the relevance
judgments perceived by an actual user of the conversational search
system. In this paper, we perform a crowdsourcing study to evaluate
the applicability of the Cranfield paradigm to conversational search
systems. Our main aim is to understand what is the agreement in
terms of user satisfaction between the users performing a search
task in a conversational search system (i.e., directly assessing the
system) and the users observing the search task being performed
(i.e., indirectly assessing the system). The result of this study is
paramount because it underpins and guides 1) the development of
more realistic user models and simulators, and 2) the design of more
reliable and robust evaluation measures for conversational search
systems. Our results show that there is a fair agreement between
direct and indirect assessments in terms of user satisfaction and
that these two kinds of assessments share similar conversational
patterns. Indeed, by collecting relevance assessments for each system utterance, we tested several conversational patterns that show
a promising ability to predict user satisfaction
From a User Model for Query Sessions to Session Rank Biased Precision (sRBP)
To satisfy their information needs, users usually carry out searches
on retrieval systems by continuously trading off between the examination of search results retrieved by under-specified queries and
the refinement of these queries through reformulation. In Information Retrieval (IR), a series of query reformulations is known as
a query-session. Research in IR evaluation has traditionally been
focused on the development of measures for the ad hoc task, for
which a retrieval system aims to retrieve the best documents for
a single query. Thus, most IR evaluation measures, with a few exceptions, are not suitable to evaluate retrieval scenarios that call
for multiple refinements over a query-session. In this paper, by
formally modeling a user’s expected behaviour over query-sessions,
we derive a session-based evaluation measure, which results in a
generalization of the evaluation measure Rank Biased Precision
(RBP). We demonstrate the quality of this new session-based evaluation measure, named Session RBP (sRBP), by evaluating its user
model against the observed user behaviour over the query-sessions
of the 2014 TREC Session trac