158 research outputs found
Categorical, Ratio, and Professorial Data: The Case for Reciprocal Rank
Search engine results pages are usually abstracted as binary relevance
vectors and hence are categorical data, meaning that only a limited set of
operations is permitted, most notably tabulation of occurrence frequencies,
with determination of medians and averages not possible. To compare retrieval
systems it is thus usual to make use of a categorical-to-numeric effectiveness
mapping. A previous paper has argued that any desired categorical-to-numeric
mapping may be used, provided only that there is an argued connection between
each category of SERP and the score that is assigned to that category by the
mapping. Further, once that plausible connection has been established, then the
mapped values can be treated as real-valued observations on a ratio scale,
allowing the computation of averages. This article is written in support of
that point of view, and to respond to ongoing claims that SERP scores may only
be averaged if very restrictive conditions are imposed on the effectiveness
mapping
Arithmetic coding revisited
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed,
low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of low-precision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a word-based text compression program. We report a range of experimental results using this and other models. Complete source code is available
Offline recommender system evaluation: Challenges and new directions
Offline evaluation is an essential complement to online experiments in the selection,
improvement, tuning, and deployment of recommender systems. Offline
methodologies for recommender system evaluation evolved from experimental
practice in Machine Learning (ML) and Information Retrieval (IR). However,
evaluating recommendations involves particularities that pose challenges to the
assumptions upon which the ML and IR methodologies were developed. We
recap and reflect on the development and current status of recommender system
evaluation, providing an updated perspective. With a focus on offline evaluation,
we review the adaptation of IR principles, procedures and metrics, and the
implications of those techniques when applied to recommender systems. At the
same time, we identify the singularities of recommendation that require different
responses, or involve specific new needs. In addition,we provide an overview
of important choices in the configuration of experiments that require particular
care and understanding; discuss broader perspectives of evaluation such as
recommendation value beyond accuracy; and survey open challenges such as
experimental biases, and the cyclic dimension of recommendation.This work was partially supported by the Spanish Government
(project PID2019-108965GB-I00) and by the Australian
Research Council (project DP190101113)
cwl_eval : An evaluation tool for information retrieval
We present a tool (âcwl_evalâ) which unifies many metrics typically used to evaluate information retrieval systems using test collections. In the C/W/L framework metrics are specified via a single function which can be used to derive a number of related measurements: Expected Utility per item, Expected Total Utility, Expected Cost per item, Expected Total Cost, and Expected Depth. The C/W/L framework brings together several independent approaches for measuring the quality of a ranked list, and provides a coherent user model-based framework for developing measures based on utility (gain) and cost.Here we outline the C/W/L measurement framework; describe the cwl_eval architecture; and provide examples of how to use it. We provide implementations of a number of recent metrics, including Time Biased Gain, U-Measure, Bejewelled Measure, and the Information Foraging Based Measure, as well as previous metrics such as Precision, Average Precision, Discounted Cumulative Gain, Rank-Biased Precision, and INST. By providing state-of-the-art and traditional metrics within the same framework, we promote a standardised approach to evaluating search effectiveness
An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric
Many evaluation metrics have been defined to evaluate the effectiveness
ad-hoc retrieval and search result diversification systems. However, it is
often unclear which evaluation metric should be used to analyze the performance
of retrieval systems given a specific task. Axiomatic analysis is an
informative mechanism to understand the fundamentals of metrics and their
suitability for particular scenarios. In this paper, we define a
constraint-based axiomatic framework to study the suitability of existing
metrics in search result diversification scenarios. The analysis informed the
definition of Rank-Biased Utility (RBU) -- an adaptation of the well-known
Rank-Biased Precision metric -- that takes into account redundancy and the user
effort associated to the inspection of documents in the ranking. Our
experiments over standard diversity evaluation campaigns show that the proposed
metric captures quality criteria reflected by different metrics, being suitable
in the absence of knowledge about particular features of the scenario under
study.Comment: Original version: 10 pages. Preprint of full paper to appear at
SIGIR'18: The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieval, July 8-12, 2018, Ann Arbor, MI, USA.
ACM, New York, NY, US
Bootstrapping Generalization of Process Models Discovered From Event Data
Process mining studies ways to derive value from process executions recorded
in event logs of IT-systems, with process discovery the task of inferring a
process model for an event log emitted by some unknown system. One quality
criterion for discovered process models is generalization. Generalization seeks
to quantify how well the discovered model describes future executions of the
system, and is perhaps the least understood quality criterion in process
mining. The lack of understanding is primarily a consequence of generalization
seeking to measure properties over the entire future behavior of the system,
when the only available sample of behavior is that provided by the event log
itself. In this paper, we draw inspiration from computational statistics, and
employ a bootstrap approach to estimate properties of a population based on a
sample. Specifically, we define an estimator of the model's generalization
based on the event log it was discovered from, and then use bootstrapping to
measure the generalization of the model with respect to the system, and its
statistical significance. Experiments demonstrate the feasibility of the
approach in industrial settings.Comment: 8 page
An Entropic Relevance Measure for Stochastic Conformance Checking in Process Mining
Given an event log as a collection of recorded real-world process traces,
process mining aims to automatically construct a process model that is both
simple and provides a useful explanation of the traces. Conformance checking
techniques are then employed to characterize and quantify commonalities and
discrepancies between the log's traces and the candidate models. Recent
approaches to conformance checking acknowledge that the elements being compared
are inherently stochastic - for example, some traces occur frequently and
others infrequently - and seek to incorporate this knowledge in their analyses.
Here we present an entropic relevance measure for stochastic conformance
checking, computed as the average number of bits required to compress each of
the log's traces, based on the structure and information about relative
likelihoods provided by the model. The measure penalizes traces from the event
log not captured by the model and traces described by the model but absent in
the event log, thus addressing both precision and recall quality criteria at
the same time. We further show that entropic relevance is computable in time
linear in the size of the log, and provide evaluation outcomes that demonstrate
the feasibility of using the new approach in industrial settings.Comment: 8 pages. Postprint version of the ICPM 2020 pape
- âŠ