Search CORE

10 research outputs found

User Fairness in Recommender Systems

Author: Anand Avishek
Khosla Megha
Leonhardt Jurek
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Recent works in recommendation systems have focused on diversity in recommendations as an important aspect of recommendation quality. In this work we argue that the post-processing algorithms aimed at only improving diversity among recommendations lead to discrimination among the users. We introduce the notion of user fairness which has been overlooked in literature so far and propose measures to quantify it. Our experiments on two diversification algorithms show that an increase in aggregate diversity results in increased disparity among the users

arXiv.org e-Print Archive

Crossref

Boilerplate Removal using a Neural Sequence Labeling Model

Author: Anand Avishek
Khosla Megha
Leonhardt Jurek
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/04/2020
Field of study

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape

arXiv.org e-Print Archive

Crossref

Efficient and Explainable Neural Ranking

Author: Leonhardt Lutz Jurek
Publication venue: Hannover : Institutionelles Repositorium der Leibniz Universität Hannover
Publication date: 01/01/2023
Field of study

The recent availability of increasingly powerful hardware has caused a shift from traditional information retrieval (IR) approaches based on term matching, which remained the state of the art for several decades, to large pre-trained neural language models. These neural rankers achieve substantial improvements in performance, as their complexity and extensive pre-training give them the ability of understanding natural language in a way. As a result, neural rankers go beyond term matching by performing relevance estimation based on the semantics of queries and documents. However, these improvements in performance don't come without sacrifice. In this thesis, we focus on two fundamental challenges of neural ranking models, specifically, ones based on large language models: On the one hand, due to their complexity, the models are inefficient; they require considerable amounts of computational power, which often comes in the form of specialized hardware, such as GPUs or TPUs. Consequently, the carbon footprint is an increasingly important aspect of systems using neural IR. This effect is amplified when low latency is required, as in, for example, web search. On the other hand, neural models are known for being inherently unexplainable; in other words, it is often not comprehensible for humans why a neural model produced a specific output. In general, explainability is deemed important in order to identify undesired behavior, such as bias. We tackle the efficiency challenge of neural rankers by proposing Fast-Forward indexes, which are simple vector forward indexes that heavily utilize pre-computation techniques. Our approach substantially reduces the computational load during query processing, enabling efficient ranking solely on CPUs without requiring hardware acceleration. Furthermore, we introduce BERT-DMN to show that the training efficiency of neural rankers can be improved by training only parts of the model. In order to improve the explainability of neural ranking, we propose the Select-and-Rank paradigm to make ranking models explainable by design: First, a query-dependent subset of the input document is extracted to serve as an explanation; second, the ranking model makes its decision based only on the extracted subset, rather than the complete document. We show that our models exhibit performance similar to models that are not explainable by design and conduct a user study to determine the faithfulness of the explanations. Finally, we introduce BoilerNet, a web content extraction technique that allows the removal of boilerplate from web pages, leaving only the main content in plain text. Our method requires no feature engineering and can be used to aid in the process of creating new document corpora from the web

Institutionelles Repositorium der Leibniz Universität Hannover

Data Augmentation for Sample Efficient and Robust Document Ranking

Author: Anand Abhijit
Anand Avishek
Leonhardt Jurek
Rudra Koustav
Singh Jaspreet
Publication venue
Publication date: 26/11/2023
Field of study

Contextual ranking models have delivered impressive performance improvements over classical models in the document ranking task. However, these highly over-parameterized models tend to be data-hungry and require large amounts of data even for fine-tuning. In this paper, we propose data-augmentation methods for effective and robust ranking performance. One of the key benefits of using data augmentation is in achieving sample efficiency or learning effectively when we have only a small amount of training data. We propose supervised and unsupervised data augmentation schemes by creating training data using parts of the relevant documents in the query-document pairs. We then adapt a family of contrastive losses for the document ranking task that can exploit the augmented data to learn an effective ranking model. Our extensive experiments on subsets of the MS MARCO and TREC-DL test sets show that data augmentation, along with the ranking-adapted contrastive losses, results in performance improvements under most dataset sizes. Apart from sample efficiency, we conclusively show that data augmentation results in robust models when transferred to out-of-domain benchmarks. Our performance improvements in in-domain and more prominently in out-of-domain benchmarks show that augmentation regularizes the ranking model and improves its robustness and generalization capability

arXiv.org e-Print Archive

Efficient Neural Ranking using Forward Indexes and Lightweight Encoders

Author: Anand Abhijit
Anand Avishek
Khosla Megha
Leonhardt Jurek
Müller Henrik
Rudra Koustav
Publication venue
Publication date: 02/11/2023
Field of study

Dual-encoder-based dense retrieval models have become the standard in IR. They employ large Transformer-based language models, which are notoriously inefficient in terms of resources and latency. We propose Fast-Forward indexes -- vector forward indexes which exploit the semantic matching capabilities of dual-encoder models for efficient and effective re-ranking. Our framework enables re-ranking at very high retrieval depths and combines the merits of both lexical and semantic matching via score interpolation. Furthermore, in order to mitigate the limitations of dual-encoders, we tackle two main challenges: Firstly, we improve computational efficiency by either pre-computing representations, avoiding unnecessary computations altogether, or reducing the complexity of encoders. This allows us to considerably improve ranking efficiency and latency. Secondly, we optimize the memory footprint and maintenance cost of indexes; we propose two complementary techniques to reduce the index size and show that, by dynamically dropping irrelevant document tokens, the index maintenance efficiency can be improved substantially. We perform evaluation to show the effectiveness and efficiency of Fast-Forward indexes -- our method has low latency and achieves competitive results without the need for hardware acceleration, such as GPUs.Comment: Accepted at ACM TOIS. arXiv admin note: text overlap with arXiv:2110.0605

arXiv.org e-Print Archive

Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. PODS

Author: Afshani Peyman
Afshani Peyman
Aumüller Martin
Broder Andrei Z.
Dwork Cynthia
Har-Peled Sariel
Hardt Moritz
Ilya
Leonhardt Jurek
Riazi M. Sadegh
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the

r

-near neighbor (

r

-NN) problem: given a radius

r>0

and a set of points

S

, construct a data structure that, for any given query point

q

, returns a point

p

within distance at most

r

from

q

. In this paper, we study the

r

-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance

r

from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for

r

-NN where all points in

S

that are near

q

have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.Comment: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), Pages 191-204, June 202

arXiv.org e-Print Archive

Crossref

The IT University of Copenhagen's Repository

Archivio istituzionale della ricerca - Università di Padova

Supervised Contrastive Learning Approach for Contextual Ranking

Author: Anand Abhijit
Anand Avishek
Leonhardt Jurek
Rudra Koustav
Publication venue
Publication date: 01/01/2022
Field of study

Contextual ranking models have delivered impressive performance improvements over classical models in the document ranking task. However, these highly over-parameterized models tend to be data-hungry and require large amounts of data even for fine tuning. This paper proposes a simple yet effective method to improve ranking performance on smaller datasets using supervised contrastive learning for the document ranking problem. We perform data augmentation by creating training data using parts of the relevant documents in the query-document pairs. We then use a supervised contrastive learning objective to learn an effective ranking model from the augmented dataset. Our experiments on subsets of the TREC-DL dataset show that, although data augmentation leads to an increasing the training data sizes, it does not necessarily improve the performance using existing pointwise or pairwise training objectives. However, our proposed supervised contrastive loss objective leads to performance improvements over the standard non-augmented setting showcasing the utility of data augmentation using contrastive losses. Finally, we show the real benefit of using supervised contrastive learning objectives by showing marked improvements in smaller ranking datasets relating to news (Robust04), finance (FiQA), and scientific fact checking (SciFact)

arXiv.org e-Print Archive

TU Delft Repository

Multiple-bias modelling for analysis of observational data

Author: Angrist J.
Berkson J.
Bishop Y. M. M.
Box G. E. P.
Box G. E. P.
Bracken M. B.
Brain J. D.
Breslow N. E.
Cochran W. G.
Coghill R. W.
Cole S. R.
Cornfield J.
Crouch A. C.
Crystal D.
Draper D.
Draper D.
Eddy D. M.
Eddy D. M.
Efron B.
Electric Power Research Institute
Feychting M.
Flegal K. M.
Flegal K. M.
Gelman A.
Good I. J.
Good I. J.
Good I. J.
Greenland S.
Greenland S.
Greenland S.
Greenland S.
Greenland S.
Gustafson P.
Gustafson P.
Gustafson P.
Heckman J.
Heitjan D. F.
Holland P.
Jurek A. M.
Kabuto M.
Lash T. L.
Leamer E. E.
Leamer E. E.
Leonhardt D.
Little R. J. A.
London S. J.
Maclure M.
Manski C. F.
Matthews R. A. J.
McBride M. L.
Morgan M. G.
Mosteller F.
Mosteller F.
Olsen J. H.
Pearl J.
Phillips C. V.
Pocock S. J.
Poole C.
Robins J. M.
Rosenbaum P.
Rothman K. J.
Rothman K. J.
Rothman K. J.
Rubin D. B.
Rubin D. B.
Rubin D. B.
Rubin D. B.
Savitz D. A.
Smith A. F. M.
Smith T. C.
Spiegelhalter D. J.
Stigler S. M.
Tomenius L.
Tynes T.
Verkasalo P. K.
Vose D.
Wacholder S.
Yanagawa T.
Publication venue
Publication date
Field of study

Conventional analytic results do not reflect any source of uncertainty other than random error, and as a result readers must rely on informal judgments regarding the effect of possible biases. When standard errors are small these judgments often fail to capture sources of uncertainty and their interactions adequately. Multiple-bias models provide alternatives that allow one systematically to integrate major sources of uncertainty, and thus to provide better input to research planning and policy analysis. Typically, the bias parameters in the model are not identified by the analysis data and so the results depend completely on priors for those parameters. A Bayesian analysis is then natural, but several alternatives based on sensitivity analysis have appeared in the risk assessment and epidemiologic literature. Under some circumstances these methods approximate a Bayesian analysis and can be modified to do so even better. These points are illustrated with a pooled analysis of case-control studies of residential magnetic field exposure and childhood leukaemia, which highlights the diminishing value of conventional studies conducted after the early 1990s. It is argued that multiple-bias modelling should become part of the core training of anyone who will be entrusted with the analysis of observational data, and should become standard procedure when random error is not the only important source of uncertainty (as in meta-analysis and pooled analysis). Copyright 2005 Royal Statistical Society.

Crossref

Research Papers in Economics

DNA damage signaling assessed in individual cells in relation to the cell cycle phase and induction of apoptosis

Author: Abraham RT
Ahamed M
Ahn J
Ahn JY
Albino AP
Albino AP
Allen C
Anderson L
Audebert M
Audelbert M
Bakkenist CJ
Bakkenist CJ
Banath JP
Banath JP
Banath JP
Banuelos CA
Bartkova J
Bassing CH
Beckman KB
Behbehani GK
Bonner WM
Bourton EC
Bourton EC
Boutros R
Boutros R
Buchelnikov AS
Bulterijs S
Burma S
Cann KL
Carvalho CM
Cavalier C
Celeste A
Chan SD
Cheng TJ
Chowdhury D
Cleaver JE
Cuadrado M
Darzynkiewicz Z
Darzynkiewicz Z
De Vos WH
Ditch S
Donald Wlodkowic
D’Arpa P
Elvers I
Elvers I
Fienberg HG
Forment JV
Fu S
Furuta T
Garcia-Canton C
George TC
George TC
Gonzalez JE
Gorbunova V
Guo Z
H. Dorota Halicka
Halicka HD
Halicka HD
Halicka HD
Hanasoge S
Helt CE
Henriksen M
Hong Zhao
Horn S
Hsiang YH
Huang X
Huang X
Huang X
Huang X
Ichijima Y
Ichijima Y
Ikura T
Ivashkevich A
Jackson SP
Jett JH
Juan G
Juan G
Jun YW
Jurek Dobrucki
Kapuscinski J
Kapuscinski J
Kastan MB
Kawane K
Kim YC
Kitagawa R
Kitagawa R
Kneipp J
Krutzik PO
Kurose A
Kurose A
Kurose A
Lee JH
Lee JH
Lee JS
Leonhardt H
Li J
Li L
Lim J
Lin J
Liu X
Lobrich M
Lord CJ
Lovejoy CA
Lukas J
Maertens RM
Mah LJ
Marko JF
Marti TM
Massudi H
Matsuoka S
Misteli T
Modesti M
Murga M
Nair-Shaliliker V
Negelkerke A
Notingher I
Oka K
Olive PL
Olive PL
Olive PL
Olive PL
Olive PL
Ornatsky O
Pandita TK
Park EJ
Pastwa E
Paulina Rybak
Pham NA
Pietrzak M
Pommier Y
Redon CE
Redon CE
Rogakou EP
Rogakou EP
Rouleau M
Rubi CP
Rudolph J
Samper E
Sedelnikova OA
Shiloh Y
Shiloh Y
Sinha M
Sinha RP
Smart DJ
Smart DJ
Smart DJ
Smith GC
Smith GR
Sohn LL
Staker BL
Stevens C
Stokes MP
Stucki M
Sun Y
Tamaki E
Tan Y
Tanaka T
Tanaka T
Tanaka T
Tanaka T
Tanaka T
Thatcher TH
Toduka Y
Toyooka T
Toyooka T
Toyooka T
Tsaousi A
Vilenchik MM
Wakeman TP
Walters DK
Wang J
Ward M
Watters GP
Wlodkowic D
Wlodkowic D
Wlodkowic D
Wu C-C
Wu H
Yajima H
Yang S
Yuan SS
Zbigniew Darzynkiewicz
Zhang H
Zhang Q
Zhao H
Zhao H
Zhao H
Zhao H
Zhao H
Zhao H
Zhao H
Zhao H
Zhou BB
Ziv Y
Zuba-Surma E
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref