14,049 research outputs found
An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric
Many evaluation metrics have been defined to evaluate the effectiveness
ad-hoc retrieval and search result diversification systems. However, it is
often unclear which evaluation metric should be used to analyze the performance
of retrieval systems given a specific task. Axiomatic analysis is an
informative mechanism to understand the fundamentals of metrics and their
suitability for particular scenarios. In this paper, we define a
constraint-based axiomatic framework to study the suitability of existing
metrics in search result diversification scenarios. The analysis informed the
definition of Rank-Biased Utility (RBU) -- an adaptation of the well-known
Rank-Biased Precision metric -- that takes into account redundancy and the user
effort associated to the inspection of documents in the ranking. Our
experiments over standard diversity evaluation campaigns show that the proposed
metric captures quality criteria reflected by different metrics, being suitable
in the absence of knowledge about particular features of the scenario under
study.Comment: Original version: 10 pages. Preprint of full paper to appear at
SIGIR'18: The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieval, July 8-12, 2018, Ann Arbor, MI, USA.
ACM, New York, NY, US
Answer Summarization for Technical Queries: Benchmark and New Approach
Prior studies have demonstrated that approaches to generate an answer summary
for a given technical query in Software Question and Answer (SQA) sites are
desired. We find that existing approaches are assessed solely through user
studies. There is a need for a benchmark with ground truth summaries to
complement assessment through user studies. Unfortunately, such a benchmark is
non-existent for answer summarization for technical queries from SQA sites. To
fill the gap, we manually construct a high-quality benchmark to enable
automatic evaluation of answer summarization for technical queries for SQA
sites. Using the benchmark, we comprehensively evaluate the performance of
existing approaches and find that there is still a big room for improvement.
Motivated by the results, we propose a new approach TechSumBot with three key
modules:1) Usefulness Ranking module, 2) Centrality Estimation module, and 3)
Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e.,
using our benchmark) and manual (i.e., via a user study) manners. The results
from both evaluations consistently demonstrate that TechSumBot outperforms the
best performing baseline approaches from both SE and NLP domains by a large
margin, i.e., 10.83%-14.90%, 32.75%-36.59%, and 12.61%-17.54%, in terms of
ROUGE-1, ROUGE-2, and ROUGE-L on automatic evaluation, and 5.79%-9.23% and
17.03%-17.68%, in terms of average usefulness and diversity score on human
evaluation. This highlights that the automatic evaluation of our benchmark can
uncover findings similar to the ones found through user studies. More
importantly, automatic evaluation has a much lower cost, especially when it is
used to assess a new approach. Additionally, we also conducted an ablation
study, which demonstrates that each module in TechSumBot contributes to
boosting the overall performance of TechSumBot.Comment: Accepted by ASE 202
Choosing effective methods for design diversity - How to progress from intuition to science
Design diversity is a popular defence against design faults in safety critical systems. Design diversity is at times pursued by simply isolating the development teams of the different versions, but it is presumably better to "force" diversity, by appropriate prescriptions to the teams. There are many ways of forcing diversity. Yet, managers who have to choose a cost-effective combination of these have little guidance except their own intuition. We argue the need for more scientifically based recommendations, and outline the problems with producing them. We focus on what we think is the standard basis for most recommendations: the belief that, in order to produce failure diversity among versions, project decisions should aim at causing "diversity" among the faults in the versions. We attempt to clarify what these beliefs mean, in which cases they may be justified and how they can be checked or disproved experimentally
Information-theoretic measures of music listening behaviour
We present an information-theoretic approach to the mea-
surement of users’ music listening behaviour and selection of music features. Existing
ethnographic studies of mu- sic use have guided the design of music retrieval systems however are
typically qualitative and exploratory in nature. We introduce the SPUD dataset, comprising 10, 000
hand- made playlists, with user and audio stream metadata. With this, we illustrate the use of
entropy for analysing music listening behaviour, e.g. identifying when a user changed music
retrieval system. We then develop an approach to identifying music features that reflect users’
criteria for playlist curation, rejecting features that are independent of user behaviour. The
dataset and the code used to produce it are made available. The techniques described support a
quantitative yet user-centred approach to the evaluation of music features and retrieval systems,
without assuming objective ground truth labels
HitFraud: A Broad Learning Approach for Collective Fraud Detection in Heterogeneous Information Networks
On electronic game platforms, different payment transactions have different
levels of risk. Risk is generally higher for digital goods in e-commerce.
However, it differs based on product and its popularity, the offer type
(packaged game, virtual currency to a game or subscription service), storefront
and geography. Existing fraud policies and models make decisions independently
for each transaction based on transaction attributes, payment velocities, user
characteristics, and other relevant information. However, suspicious
transactions may still evade detection and hence we propose a broad learning
approach leveraging a graph based perspective to uncover relationships among
suspicious transactions, i.e., inter-transaction dependency. Our focus is to
detect suspicious transactions by capturing common fraudulent behaviors that
would not be considered suspicious when being considered in isolation. In this
paper, we present HitFraud that leverages heterogeneous information networks
for collective fraud detection by exploring correlated and fast evolving
fraudulent behaviors. First, a heterogeneous information network is designed to
link entities of interest in the transaction database via different semantics.
Then, graph based features are efficiently discovered from the network
exploiting the concept of meta-paths, and decisions on frauds are made
collectively on test instances. Experiments on real-world payment transaction
data from Electronic Arts demonstrate that the prediction performance is
effectively boosted by HitFraud with fast convergence where the computation of
meta-path based features is largely optimized. Notably, recall can be improved
up to 7.93% and F-score 4.62% compared to baselines.Comment: ICDM 201
Information-theoretic measures of music listening behaviour
We present an information-theoretic approach to the mea-
surement of users’ music listening behaviour and selection of music features. Existing
ethnographic studies of mu- sic use have guided the design of music retrieval systems however are
typically qualitative and exploratory in nature. We introduce the SPUD dataset, comprising 10, 000
hand- made playlists, with user and audio stream metadata. With this, we illustrate the use of
entropy for analysing music listening behaviour, e.g. identifying when a user changed music
retrieval system. We then develop an approach to identifying music features that reflect users’
criteria for playlist curation, rejecting features that are independent of user behaviour. The
dataset and the code used to produce it are made available. The techniques described support a
quantitative yet user-centred approach to the evaluation of music features and retrieval systems,
without assuming objective ground truth labels
- …