2,395 research outputs found

    Validating simulated interaction for retrieval evaluation

    Get PDF
    A searcher’s interaction with a retrieval system consists of actions such as query formulation, search result list interaction and document interaction. The simulation of searcher interaction has recently gained momentum in the analysis and evaluation of interactive information retrieval (IIR). However, a key issue that has not yet been adequately addressed is the validity of such IIR simulations and whether they reliably predict the performance obtained by a searcher across the session. The aim of this paper is to determine the validity of the common interaction model (CIM) typically used for simulating multi-query sessions. We focus on search result interactions, i.e., inspecting snippets, examining documents and deciding when to stop examining the results of a single query, or when to stop the whole session. To this end, we run a series of simulations grounded by real world behavioral data to show how accurate and responsive the model is to various experimental conditions under which the data were produced. We then validate on a second real world data set derived under similar experimental conditions. We seek to predict cumulated gain across the session. We find that the interaction model with a query-level stopping strategy based on consecutive non-relevant snippets leads to the highest prediction accuracy, and lowest deviation from ground truth, around 9 to 15% depending on the experimental conditions. To our knowledge, the present study is the first validation effort of the CIM that shows that the model’s acceptance and use is justified within IIR evaluations. We also identify and discuss ways to further improve the CIM and its behavioral parameters for more accurate simulations

    Hierarchical Average Precision Training for Pertinent Image Retrieval

    Full text link
    Image Retrieval is commonly evaluated with Average Precision (AP) or Recall@k. Yet, those metrics, are limited to binary labels and do not take into account errors' severity. This paper introduces a new hierarchical AP training method for pertinent image retrieval (HAP-PIER). HAPPIER is based on a new H-AP metric, which leverages a concept hierarchy to refine AP by integrating errors' importance and better evaluate rankings. To train deep models with H-AP, we carefully study the problem's structure and design a smooth lower bound surrogate combined with a clustering loss that ensures consistent ordering. Extensive experiments on 6 datasets show that HAPPIER significantly outperforms state-of-the-art methods for hierarchical retrieval, while being on par with the latest approaches when evaluating fine-grained ranking performances. Finally, we show that HAPPIER leads to better organization of the embedding space, and prevents most severe failure cases of non-hierarchical methods. Our code is publicly available at: https://github.com/elias-ramzi/HAPPIER

    A flexible framework for offline effectiveness metrics

    Get PDF
    The use of offline effectiveness metrics is one of the cornerstones of evaluation in information retrieval. Static resources that include test collections and sets of topics, the corresponding relevance judgments connecting them, and metrics that map document rankings from a retrieval system to numeric scores have been used for multiple decades as an important way of comparing systems. The basis behind this experimental structure is that the metric score for a system can serve as a surrogate measurement for user satisfaction. Here we introduce a user behavior framework that extends the C/W/L family. The essence of the new framework - which we call C/W/L/A - is that the user actions that are undertaken while reading the ranking can be considered separately from the benefit that each user will have derived as they exit the ranking. This split structure allows the great majority of current effectiveness metrics to be systematically categorized, and thus their relative properties and relationships to be better understood; and at the same time permits a wide range of novel combinations to be considered. We then carry out experiments using relevance judgments, document rankings, and user satisfaction data from two distinct sources, comparing the patterns of metric scores generated, and showing that those metrics vary quite markedly in terms of their ability to predict user satisfaction

    Can Portable EEG Headsets be Used to Determine if Students are Learning?

    Get PDF
    This study examined EEGs recorded from a single-channel, portable EEG headset (NeuroSky MindWave) during the study period of a paired-associate word paradigm which used Swahili words and their English meanings. It was hypothesized that there would be a significant difference in gamma, theta, and beta band powers for when students recalled words correctly vs. when they did not recall correctly on a subsequent test. There were 35 participants who consisted of students that volunteered at the University of Memphis (20 females and 15 males, 31 of which were right-handed and 4 which were left-handed). A paired-samples t-test suggested that there was a higher mean z-score for brainwave activity during the study period in the high gamma range (41-49.75Hz) for when participants did not recall words correctly on a test, which was opposite of what previous research has found regarding encoding. Based on the results of this study, the MindWave seems to capture muscle activity and/or saccadic behavior that is suggested by higher gamma maximums on average in the study period for word-pairs which resulted in failed recall. Exploratory results may lend insight to future work using portable EEG devices. This study\u27s main objective was to determine if portable EEG devices could be used to determine when students learn new information. Further testing, especially using other portable EEG devices is necessary to answer this question

    DIR 2011: Dutch_Belgian Information Retrieval Workshop Amsterdam

    Get PDF

    Item Response Models of Probability Judgments: Application to a Geopolitical Forecasting Tournament

    Get PDF
    In this article, we develop and study methods for evaluating forecasters and forecasting questions in dynamic environments. These methods, based on item response models, are useful in situations where items vary in difficulty, and we wish to evaluate forecasters based on the difficulty of the items that they forecasted correctly. In addition, the methods are useful in situations where we need to compare forecasters who make predictions at different points in time or for different items. We first extend traditional models to handle subjective probabilities, and we then apply a specific model to geopolitical forecasts. We evaluate the model’s ability to accommodate the data, compare the model’s estimates of forecaster ability to estimates of forecaster ability based on scoring rules, and externally validate the model’s item estimates. We also highlight some shortcomings of the traditional models and discuss some further extensions. The analyses illustrate the models’ potential for widespread use in forecasting and subjective probability evaluation
    • …
    corecore