3 research outputs found

    How Am I Doing?: Evaluating Conversational Search Systems Offline

    Get PDF
    As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search

    Evaluating the Cranfield Paradigm for Conversational Search Systems

    Get PDF
    Due to the sequential and interactive nature of conversations, the application of traditional Information Retrieval (IR) methods like the Cranfield paradigm require stronger assumptions. When building a test collection for Ad Hoc search, it is fair to assume that the relevance judgments provided by an annotator correlate well with the relevance judgments perceived by an actual user of the search engine. However, when building a test collection for conversational search, we do not know if it is fair to assume that the relevance judgments provided by an annotator correlate well with the relevance judgments perceived by an actual user of the conversational search system. In this paper, we perform a crowdsourcing study to evaluate the applicability of the Cranfield paradigm to conversational search systems. Our main aim is to understand what is the agreement in terms of user satisfaction between the users performing a search task in a conversational search system (i.e., directly assessing the system) and the users observing the search task being performed (i.e., indirectly assessing the system). The result of this study is paramount because it underpins and guides 1) the development of more realistic user models and simulators, and 2) the design of more reliable and robust evaluation measures for conversational search systems. Our results show that there is a fair agreement between direct and indirect assessments in terms of user satisfaction and that these two kinds of assessments share similar conversational patterns. Indeed, by collecting relevance assessments for each system utterance, we tested several conversational patterns that show a promising ability to predict user satisfaction

    From a User Model for Query Sessions to Session Rank Biased Precision (sRBP)

    Get PDF
    To satisfy their information needs, users usually carry out searches on retrieval systems by continuously trading off between the examination of search results retrieved by under-specified queries and the refinement of these queries through reformulation. In Information Retrieval (IR), a series of query reformulations is known as a query-session. Research in IR evaluation has traditionally been focused on the development of measures for the ad hoc task, for which a retrieval system aims to retrieve the best documents for a single query. Thus, most IR evaluation measures, with a few exceptions, are not suitable to evaluate retrieval scenarios that call for multiple refinements over a query-session. In this paper, by formally modeling a user’s expected behaviour over query-sessions, we derive a session-based evaluation measure, which results in a generalization of the evaluation measure Rank Biased Precision (RBP). We demonstrate the quality of this new session-based evaluation measure, named Session RBP (sRBP), by evaluating its user model against the observed user behaviour over the query-sessions of the 2014 TREC Session trac
    corecore