10 research outputs found
User Fairness in Recommender Systems
Recent works in recommendation systems have focused on diversity in
recommendations as an important aspect of recommendation quality. In this work
we argue that the post-processing algorithms aimed at only improving diversity
among recommendations lead to discrimination among the users. We introduce the
notion of user fairness which has been overlooked in literature so far and
propose measures to quantify it. Our experiments on two diversification
algorithms show that an increase in aggregate diversity results in increased
disparity among the users
Boilerplate Removal using a Neural Sequence Labeling Model
The extraction of main content from web pages is an important task for
numerous applications, ranging from usability aspects, like reader views for
news articles in web browsers, to information retrieval or natural language
processing. Existing approaches are lacking as they rely on large amounts of
hand-crafted features for classification. This results in models that are
tailored to a specific distribution of web pages, e.g. from a certain time
frame, but lack in generalization power. We propose a neural sequence labeling
model that does not rely on any hand-crafted features but takes only the HTML
tags and words that appear in a web page as input. This allows us to present a
browser extension which highlights the content of arbitrary web pages directly
within the browser using our model. In addition, we create a new, more current
dataset to show that our model is able to adapt to changes in the structure of
web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape
Efficient and Explainable Neural Ranking
The recent availability of increasingly powerful hardware has caused a shift from traditional information retrieval (IR) approaches based on term matching, which remained the state of the art for several decades, to large pre-trained neural language models. These neural rankers achieve substantial improvements in performance, as their complexity and extensive pre-training give them the ability of understanding natural language in a way. As a result, neural rankers go beyond term matching by performing relevance estimation based on the semantics of queries and documents.
However, these improvements in performance don't come without sacrifice. In this thesis, we focus on two fundamental challenges of neural ranking models, specifically, ones based on large language models: On the one hand, due to their complexity, the models are inefficient; they require considerable amounts of computational power, which often comes in the form of specialized hardware, such as GPUs or TPUs. Consequently, the carbon footprint is an increasingly important aspect of systems using neural IR. This effect is amplified when low latency is required, as in, for example, web search. On the other hand, neural models are known for being inherently unexplainable; in other words, it is often not comprehensible for humans why a neural model produced a specific output. In general, explainability is deemed important in order to identify undesired behavior, such as bias.
We tackle the efficiency challenge of neural rankers by proposing Fast-Forward indexes, which are simple vector forward indexes that heavily utilize pre-computation techniques. Our approach substantially reduces the computational load during query processing, enabling efficient ranking solely on CPUs without requiring hardware acceleration. Furthermore, we introduce BERT-DMN to show that the training efficiency of neural rankers can be improved by training only parts of the model.
In order to improve the explainability of neural ranking, we propose the Select-and-Rank paradigm to make ranking models explainable by design: First, a query-dependent subset of the input document is extracted to serve as an explanation; second, the ranking model makes its decision based only on the extracted subset, rather than the complete document. We show that our models exhibit performance similar to models that are not explainable by design and conduct a user study to determine the faithfulness of the explanations.
Finally, we introduce BoilerNet, a web content extraction technique that allows the removal of boilerplate from web pages, leaving only the main content in plain text. Our method requires no feature engineering and can be used to aid in the process of creating new document corpora from the web
Data Augmentation for Sample Efficient and Robust Document Ranking
Contextual ranking models have delivered impressive performance improvements
over classical models in the document ranking task. However, these highly
over-parameterized models tend to be data-hungry and require large amounts of
data even for fine-tuning. In this paper, we propose data-augmentation methods
for effective and robust ranking performance. One of the key benefits of using
data augmentation is in achieving sample efficiency or learning effectively
when we have only a small amount of training data. We propose supervised and
unsupervised data augmentation schemes by creating training data using parts of
the relevant documents in the query-document pairs. We then adapt a family of
contrastive losses for the document ranking task that can exploit the augmented
data to learn an effective ranking model. Our extensive experiments on subsets
of the MS MARCO and TREC-DL test sets show that data augmentation, along with
the ranking-adapted contrastive losses, results in performance improvements
under most dataset sizes. Apart from sample efficiency, we conclusively show
that data augmentation results in robust models when transferred to
out-of-domain benchmarks. Our performance improvements in in-domain and more
prominently in out-of-domain benchmarks show that augmentation regularizes the
ranking model and improves its robustness and generalization capability
Efficient Neural Ranking using Forward Indexes and Lightweight Encoders
Dual-encoder-based dense retrieval models have become the standard in IR.
They employ large Transformer-based language models, which are notoriously
inefficient in terms of resources and latency. We propose Fast-Forward indexes
-- vector forward indexes which exploit the semantic matching capabilities of
dual-encoder models for efficient and effective re-ranking. Our framework
enables re-ranking at very high retrieval depths and combines the merits of
both lexical and semantic matching via score interpolation. Furthermore, in
order to mitigate the limitations of dual-encoders, we tackle two main
challenges: Firstly, we improve computational efficiency by either
pre-computing representations, avoiding unnecessary computations altogether, or
reducing the complexity of encoders. This allows us to considerably improve
ranking efficiency and latency. Secondly, we optimize the memory footprint and
maintenance cost of indexes; we propose two complementary techniques to reduce
the index size and show that, by dynamically dropping irrelevant document
tokens, the index maintenance efficiency can be improved substantially. We
perform evaluation to show the effectiveness and efficiency of Fast-Forward
indexes -- our method has low latency and achieves competitive results without
the need for hardware acceleration, such as GPUs.Comment: Accepted at ACM TOIS. arXiv admin note: text overlap with
arXiv:2110.0605
Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. PODS
Similarity search is a fundamental algorithmic primitive, widely used in many
computer science disciplines. There are several variants of the similarity
search problem, and one of the most relevant is the -near neighbor (-NN)
problem: given a radius and a set of points , construct a data
structure that, for any given query point , returns a point within
distance at most from . In this paper, we study the -NN problem in
the light of fairness. We consider fairness in the sense of equal opportunity:
all points that are within distance from the query should have the same
probability to be returned. In the low-dimensional case, this problem was first
studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the
theoretically strongest approach to similarity search in high dimensions, does
not provide such a fairness guarantee. To address this, we propose efficient
data structures for -NN where all points in that are near have the
same probability to be selected and returned by the query. Specifically, we
first propose a black-box approach that, given any LSH scheme, constructs a
data structure for uniformly sampling points in the neighborhood of a query.
Then, we develop a data structure for fair similarity search under inner
product that requires nearly-linear space and exploits locality sensitive
filters. The paper concludes with an experimental evaluation that highlights
(un)fairness in a recommendation setting on real-world datasets and discusses
the inherent unfairness introduced by solving other variants of the problem.Comment: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems (PODS), Pages 191-204, June 202
Supervised Contrastive Learning Approach for Contextual Ranking
Contextual ranking models have delivered impressive performance improvements
over classical models in the document ranking task. However, these highly
over-parameterized models tend to be data-hungry and require large amounts of
data even for fine tuning. This paper proposes a simple yet effective method to
improve ranking performance on smaller datasets using supervised contrastive
learning for the document ranking problem. We perform data augmentation by
creating training data using parts of the relevant documents in the
query-document pairs. We then use a supervised contrastive learning objective
to learn an effective ranking model from the augmented dataset. Our experiments
on subsets of the TREC-DL dataset show that, although data augmentation leads
to an increasing the training data sizes, it does not necessarily improve the
performance using existing pointwise or pairwise training objectives. However,
our proposed supervised contrastive loss objective leads to performance
improvements over the standard non-augmented setting showcasing the utility of
data augmentation using contrastive losses. Finally, we show the real benefit
of using supervised contrastive learning objectives by showing marked
improvements in smaller ranking datasets relating to news (Robust04), finance
(FiQA), and scientific fact checking (SciFact)
Multiple-bias modelling for analysis of observational data
Conventional analytic results do not reflect any source of uncertainty other than random error, and as a result readers must rely on informal judgments regarding the effect of possible biases. When standard errors are small these judgments often fail to capture sources of uncertainty and their interactions adequately. Multiple-bias models provide alternatives that allow one systematically to integrate major sources of uncertainty, and thus to provide better input to research planning and policy analysis. Typically, the bias parameters in the model are not identified by the analysis data and so the results depend completely on priors for those parameters. A Bayesian analysis is then natural, but several alternatives based on sensitivity analysis have appeared in the risk assessment and epidemiologic literature. Under some circumstances these methods approximate a Bayesian analysis and can be modified to do so even better. These points are illustrated with a pooled analysis of case-control studies of residential magnetic field exposure and childhood leukaemia, which highlights the diminishing value of conventional studies conducted after the early 1990s. It is argued that multiple-bias modelling should become part of the core training of anyone who will be entrusted with the analysis of observational data, and should become standard procedure when random error is not the only important source of uncertainty (as in meta-analysis and pooled analysis). Copyright 2005 Royal Statistical Society.