158 research outputs found

    Categorical, Ratio, and Professorial Data: The Case for Reciprocal Rank

    Full text link
    Search engine results pages are usually abstracted as binary relevance vectors and hence are categorical data, meaning that only a limited set of operations is permitted, most notably tabulation of occurrence frequencies, with determination of medians and averages not possible. To compare retrieval systems it is thus usual to make use of a categorical-to-numeric effectiveness mapping. A previous paper has argued that any desired categorical-to-numeric mapping may be used, provided only that there is an argued connection between each category of SERP and the score that is assigned to that category by the mapping. Further, once that plausible connection has been established, then the mapped values can be treated as real-valued observations on a ratio scale, allowing the computation of averages. This article is written in support of that point of view, and to respond to ongoing claims that SERP scores may only be averaged if very restrictive conditions are imposed on the effectiveness mapping

    Arithmetic coding revisited

    Get PDF
    Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available

    Offline recommender system evaluation: Challenges and new directions

    Get PDF
    Offline evaluation is an essential complement to online experiments in the selection, improvement, tuning, and deployment of recommender systems. Offline methodologies for recommender system evaluation evolved from experimental practice in Machine Learning (ML) and Information Retrieval (IR). However, evaluating recommendations involves particularities that pose challenges to the assumptions upon which the ML and IR methodologies were developed. We recap and reflect on the development and current status of recommender system evaluation, providing an updated perspective. With a focus on offline evaluation, we review the adaptation of IR principles, procedures and metrics, and the implications of those techniques when applied to recommender systems. At the same time, we identify the singularities of recommendation that require different responses, or involve specific new needs. In addition,we provide an overview of important choices in the configuration of experiments that require particular care and understanding; discuss broader perspectives of evaluation such as recommendation value beyond accuracy; and survey open challenges such as experimental biases, and the cyclic dimension of recommendation.This work was partially supported by the Spanish Government (project PID2019-108965GB-I00) and by the Australian Research Council (project DP190101113)

    cwl_eval : An evaluation tool for information retrieval

    Get PDF
    We present a tool (“cwl_eval”) which unifies many metrics typically used to evaluate information retrieval systems using test collections. In the C/W/L framework metrics are specified via a single function which can be used to derive a number of related measurements: Expected Utility per item, Expected Total Utility, Expected Cost per item, Expected Total Cost, and Expected Depth. The C/W/L framework brings together several independent approaches for measuring the quality of a ranked list, and provides a coherent user model-based framework for developing measures based on utility (gain) and cost.Here we outline the C/W/L measurement framework; describe the cwl_eval architecture; and provide examples of how to use it. We provide implementations of a number of recent metrics, including Time Biased Gain, U-Measure, Bejewelled Measure, and the Information Foraging Based Measure, as well as previous metrics such as Precision, Average Precision, Discounted Cumulative Gain, Rank-Biased Precision, and INST. By providing state-of-the-art and traditional metrics within the same framework, we promote a standardised approach to evaluating search effectiveness

    An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric

    Full text link
    Many evaluation metrics have been defined to evaluate the effectiveness ad-hoc retrieval and search result diversification systems. However, it is often unclear which evaluation metric should be used to analyze the performance of retrieval systems given a specific task. Axiomatic analysis is an informative mechanism to understand the fundamentals of metrics and their suitability for particular scenarios. In this paper, we define a constraint-based axiomatic framework to study the suitability of existing metrics in search result diversification scenarios. The analysis informed the definition of Rank-Biased Utility (RBU) -- an adaptation of the well-known Rank-Biased Precision metric -- that takes into account redundancy and the user effort associated to the inspection of documents in the ranking. Our experiments over standard diversity evaluation campaigns show that the proposed metric captures quality criteria reflected by different metrics, being suitable in the absence of knowledge about particular features of the scenario under study.Comment: Original version: 10 pages. Preprint of full paper to appear at SIGIR'18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, July 8-12, 2018, Ann Arbor, MI, USA. ACM, New York, NY, US

    Bootstrapping Generalization of Process Models Discovered From Event Data

    Get PDF
    Process mining studies ways to derive value from process executions recorded in event logs of IT-systems, with process discovery the task of inferring a process model for an event log emitted by some unknown system. One quality criterion for discovered process models is generalization. Generalization seeks to quantify how well the discovered model describes future executions of the system, and is perhaps the least understood quality criterion in process mining. The lack of understanding is primarily a consequence of generalization seeking to measure properties over the entire future behavior of the system, when the only available sample of behavior is that provided by the event log itself. In this paper, we draw inspiration from computational statistics, and employ a bootstrap approach to estimate properties of a population based on a sample. Specifically, we define an estimator of the model's generalization based on the event log it was discovered from, and then use bootstrapping to measure the generalization of the model with respect to the system, and its statistical significance. Experiments demonstrate the feasibility of the approach in industrial settings.Comment: 8 page

    An Entropic Relevance Measure for Stochastic Conformance Checking in Process Mining

    Get PDF
    Given an event log as a collection of recorded real-world process traces, process mining aims to automatically construct a process model that is both simple and provides a useful explanation of the traces. Conformance checking techniques are then employed to characterize and quantify commonalities and discrepancies between the log's traces and the candidate models. Recent approaches to conformance checking acknowledge that the elements being compared are inherently stochastic - for example, some traces occur frequently and others infrequently - and seek to incorporate this knowledge in their analyses. Here we present an entropic relevance measure for stochastic conformance checking, computed as the average number of bits required to compress each of the log's traces, based on the structure and information about relative likelihoods provided by the model. The measure penalizes traces from the event log not captured by the model and traces described by the model but absent in the event log, thus addressing both precision and recall quality criteria at the same time. We further show that entropic relevance is computable in time linear in the size of the log, and provide evaluation outcomes that demonstrate the feasibility of using the new approach in industrial settings.Comment: 8 pages. Postprint version of the ICPM 2020 pape
    • 

    corecore