76 research outputs found
Monitoring Electoral Violence through Social Media: A Machine Learning Approach
No abstract available
Pytrec_eval: An Extremely Fast Python Interface to trec_eval
We introduce pytrec_eval, a Python interface to the tree_eval information
retrieval evaluation toolkit. pytrec_eval exposes the reference implementations
of trec_eval within Python as a native extension. We show that pytrec_eval is
around one order of magnitude faster than invoking trec_eval as a sub process
from within Python. Compared to a native Python implementation of NDCG,
pytrec_eval is twice as fast for practically-sized rankings. Finally, we
demonstrate its effectiveness in an application where pytrec_eval is combined
with Pyndri and the OpenAI Gym where query expansion is learned using
Q-learning.Comment: SIGIR '18. The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieva
Using Word Embeddings in Twitter Election Classification
Word embeddings and convolutional neural networks (CNN)
have attracted extensive attention in various classification
tasks for Twitter, e.g. sentiment classification. However,
the effect of the configuration used to train and generate
the word embeddings on the classification performance has
not been studied in the existing literature. In this paper,
using a Twitter election classification task that aims to detect
election-related tweets, we investigate the impact of
the background dataset used to train the embedding models,
the context window size and the dimensionality of word
embeddings on the classification performance. By comparing
the classification results of two word embedding models,
which are trained using different background corpora
(e.g. Wikipedia articles and Twitter microposts), we show
that the background data type should align with the Twitter
classification dataset to achieve a better performance. Moreover,
by evaluating the results of word embeddings models
trained using various context window sizes and dimensionalities,
we found that large context window and dimension
sizes are preferable to improve the performance. Our experimental
results also show that using word embeddings and
CNN leads to statistically significant improvements over various
baselines such as random, SVM with TF-IDF and SVM
with word embeddings
Optimizing Ranking Models in an Online Setting
Online Learning to Rank (OLTR) methods optimize ranking models by directly
interacting with users, which allows them to be very efficient and responsive.
All OLTR methods introduced during the past decade have extended on the
original OLTR method: Dueling Bandit Gradient Descent (DBGD). Recently, a
fundamentally different approach was introduced with the Pairwise
Differentiable Gradient Descent (PDGD) algorithm. To date the only comparisons
of the two approaches are limited to simulations with cascading click models
and low levels of noise. The main outcome so far is that PDGD converges at
higher levels of performance and learns considerably faster than DBGD-based
methods. However, the PDGD algorithm assumes cascading user behavior,
potentially giving it an unfair advantage. Furthermore, the robustness of both
methods to high levels of noise has not been investigated. Therefore, it is
unclear whether the reported advantages of PDGD over DBGD generalize to
different experimental conditions. In this paper, we investigate whether the
previous conclusions about the PDGD and DBGD comparison generalize from ideal
to worst-case circumstances. We do so in two ways. First, we compare the
theoretical properties of PDGD and DBGD, by taking a critical look at
previously proven properties in the context of ranking. Second, we estimate an
upper and lower bound on the performance of methods by simulating both ideal
user behavior and extremely difficult behavior, i.e., almost-random
non-cascading user models. Our findings show that the theoretical bounds of
DBGD do not apply to any common ranking model and, furthermore, that the
performance of DBGD is substantially worse than PDGD in both ideal and
worst-case circumstances. These results reproduce previously published findings
about the relative performance of PDGD vs. DBGD and generalize them to
extremely noisy and non-cascading circumstances.Comment: European Conference on Information Retrieval (ECIR) 201
Building High-Quality Datasets for Information Retrieval Evaluation at a Reduced Cost
[Abstract] Information Retrieval is not any more exclusively about document ranking. Continuously new tasks are proposed on this and sibling fields. With this proliferation of tasks, it becomes crucial to have a cheap way of constructing test collections to evaluate the new developments. Building test collections is time and resource consuming: it requires time to obtain the documents, to define the user needs and it requires the assessors to judge a lot of documents. To reduce the latest, pooling strategies aim to decrease the assessment effort by presenting to the assessors a sample of documents in the corpus with the maximum number of relevant documents in it. In this paper, we propose the preliminary design of different techniques to easily and cheapily build high-quality test collections without the need of having participants systems.Ministerio de Ciencia, Innovación y Universidades; RTI2018-093336-B-C22Xunta de Galicia; GPC ED431B 2019/03Xunta de Galicia; ED431G/0
Active Sampling for Large-scale Information Retrieval Evaluation
Evaluation is crucial in Information Retrieval. The development of models,
tools and methods has significantly benefited from the availability of reusable
test collections formed through a standardized and thoroughly tested
methodology, known as the Cranfield paradigm. Constructing these collections
requires obtaining relevance judgments for a pool of documents, retrieved by
systems participating in an evaluation task; thus involves immense human labor.
To alleviate this effort different methods for constructing collections have
been proposed in the literature, falling under two broad categories: (a)
sampling, and (b) active selection of documents. The former devises a smart
sampling strategy by choosing only a subset of documents to be assessed and
inferring evaluation measure on the basis of the obtained sample; the sampling
distribution is being fixed at the beginning of the process. The latter
recognizes that systems contributing documents to be judged vary in quality,
and actively selects documents from good systems. The quality of systems is
measured every time a new document is being judged. In this paper we seek to
solve the problem of large-scale retrieval evaluation combining the two
approaches. We devise an active sampling method that avoids the bias of the
active selection methods towards good systems, and at the same time reduces the
variance of the current sampling approaches by placing a distribution over
systems, which varies as judgments become available. We validate the proposed
method using TREC data and demonstrate the advantages of this new method
compared to past approaches
Unbiased Comparative Evaluation of Ranking Functions
Eliciting relevance judgments for ranking evaluation is labor-intensive and
costly, motivating careful selection of which documents to judge. Unlike
traditional approaches that make this selection deterministically,
probabilistic sampling has shown intriguing promise since it enables the design
of estimators that are provably unbiased even when reusing data with missing
judgments. In this paper, we first unify and extend these sampling approaches
by viewing the evaluation problem as a Monte Carlo estimation task that applies
to a large number of common IR metrics. Drawing on the theoretical clarity that
this view offers, we tackle three practical evaluation scenarios: comparing two
systems, comparing systems against a baseline, and ranking systems. For
each scenario, we derive an estimator and a variance-optimizing sampling
distribution while retaining the strengths of sampling-based evaluation,
including unbiasedness, reusability despite missing data, and ease of use in
practice. In addition to the theoretical contribution, we empirically evaluate
our methods against previously used sampling heuristics and find that they
generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page
Differentiable Unbiased Online Learning to Rank
Online Learning to Rank (OLTR) methods optimize rankers based on user
interactions. State-of-the-art OLTR methods are built specifically for linear
models. Their approaches do not extend well to non-linear models such as neural
networks. We introduce an entirely novel approach to OLTR that constructs a
weighted differentiable pairwise loss after each interaction: Pairwise
Differentiable Gradient Descent (PDGD). PDGD breaks away from the traditional
approach that relies on interleaving or multileaving and extensive sampling of
models to estimate gradients. Instead, its gradient is based on inferring
preferences between document pairs from user clicks and can optimize any
differentiable model. We prove that the gradient of PDGD is unbiased w.r.t.
user document pair preferences. Our experiments on the largest publicly
available Learning to Rank (LTR) datasets show considerable and significant
improvements under all levels of interaction noise. PDGD outperforms existing
OLTR methods both in terms of learning speed as well as final convergence.
Furthermore, unlike previous OLTR methods, PDGD also allows for non-linear
models to be optimized effectively. Our results show that using a neural
network leads to even better performance at convergence than a linear model. In
summary, PDGD is an efficient and unbiased OLTR approach that provides a better
user experience than previously possible.Comment: Conference on Information and Knowledge Management 201
- …