186 research outputs found
Learning recommender systems from biased user interactions
Recommender systems have been widely deployed to help users quickly find what they need from a collection of items. Predominant recommendation methods rely on supervised learning models to predict user ratings on items or the probabilities of users interacting with items. In addition, reinforcement learning models are crucial in improving long-term user engagement within recommender systems. In practice, both of these recommendation methods are commonly trained on logged user interactions and, therefore, subject to bias present in logged user interactions. This thesis concerns complex forms of bias in real-world user behaviors and aims to mitigate the effect of bias on reinforcement learning-based recommendation methods. The first part of the thesis consists of two research chapters, each dedicated to tackling a specific form of bias: dynamic selection bias and multifactorial bias. To mitigate the effect of dynamic selection bias and multifactorial bias, we propose a bias propensity estimation method for each. By incorporating the results from the bias propensity estimation methods, the widely used inverse propensity scoring-based debiasing method can be extended to correct for the corresponding bias. The second part of the thesis consists of two chapters that concern the effect of bias on reinforcement learning-based recommendation methods. Its first chapter focuses on mitigating the effect of bias on simulators, which enables the learning and evaluation of reinforcement learning-based recommendation methods. Its second chapter further explores different state encoders for reinforcement learning-based recommendation methods when learning and evaluating with the proposed debiased simulator
Autobidders with Budget and ROI Constraints: Efficiency, Regret, and Pacing Dynamics
We study a game between autobidding algorithms that compete in an online
advertising platform. Each autobidder is tasked with maximizing its
advertiser's total value over multiple rounds of a repeated auction, subject to
budget and/or return-on-investment constraints. We propose a gradient-based
learning algorithm that is guaranteed to satisfy all constraints and achieves
vanishing individual regret. Our algorithm uses only bandit feedback and can be
used with the first- or second-price auction, as well as with any
"intermediate" auction format. Our main result is that when these autobidders
play against each other, the resulting expected liquid welfare over all rounds
is at least half of the expected optimal liquid welfare achieved by any
allocation. This holds whether or not the bidding dynamics converges to an
equilibrium and regardless of the correlation structure between advertiser
valuations
The Public Performance Of Sanctions In Insolvency Cases: The Dark, Humiliating, And Ridiculous Side Of The Law Of Debt In The Italian Experience. A Historical Overview Of Shaming Practices
This study provides a diachronic comparative overview of how the law of debt has been applied by certain institutions in Italy. Specifically, it offers historical and comparative insights into the public performance of sanctions for insolvency through shaming and customary practices in Roman Imperial Law, in the Middle Ages, and in later periods.
The first part of the essay focuses on the Roman bonorum cessio culo nudo super lapidem and on the medieval customary institution called pietra della vergogna (stone of shame), which originates from the Roman model.
The second part of the essay analyzes the social function of the zecca and the pittima Veneziana during the Republic of Venice, and of the practice of lu soldate a castighe (no translation is possible).
The author uses a functionalist approach to apply some arguments and concepts from the current context to this historical analysis of ancient institutions that we would now consider ridiculous.
The article shows that the customary norms that play a crucial regulatory role in online interactions today can also be applied to the public square in the past. One of these tools is shaming. As is the case in contemporary online settings, in the public square in historic periods, shaming practices were used to enforce the rules of civility in a given community. Such practices can be seen as virtuous when they are intended for use as a tool to pursue positive change in forces entrenched in the culture, and thus to address social wrongs considered outside the reach of the law, or to address human rights abuses
A Survey on Causal Reinforcement Learning
While Reinforcement Learning (RL) achieves tremendous success in sequential
decision-making problems of many domains, it still faces key challenges of data
inefficiency and the lack of interpretability. Interestingly, many researchers
have leveraged insights from the causality literature recently, bringing forth
flourishing works to unify the merits of causality and address well the
challenges from RL. As such, it is of great necessity and significance to
collate these Causal Reinforcement Learning (CRL) works, offer a review of CRL
methods, and investigate the potential functionality from causality toward RL.
In particular, we divide existing CRL approaches into two categories according
to whether their causality-based information is given in advance or not. We
further analyze each category in terms of the formalization of different
models, ranging from the Markov Decision Process (MDP), Partially Observed
Markov Decision Process (POMDP), Multi-Arm Bandits (MAB), and Dynamic Treatment
Regime (DTR). Moreover, we summarize the evaluation matrices and open sources
while we discuss emerging applications, along with promising prospects for the
future development of CRL.Comment: 29 pages, 20 figure
Off-Policy Evaluation for Large Action Spaces via Policy Convolution
Developing accurate off-policy estimators is crucial for both evaluating and
optimizing for new policies. The main challenge in off-policy estimation is the
distribution shift between the logging policy that generates data and the
target policy that we aim to evaluate. Typically, techniques for correcting
distribution shift involve some form of importance sampling. This approach
results in unbiased value estimation but often comes with the trade-off of high
variance, even in the simpler case of one-step contextual bandits. Furthermore,
importance sampling relies on the common support assumption, which becomes
impractical when the action space is large. To address these challenges, we
introduce the Policy Convolution (PC) family of estimators. These methods
leverage latent structure within actions -- made available through action
embeddings -- to strategically convolve the logging and target policies. This
convolution introduces a unique bias-variance trade-off, which can be
controlled by adjusting the amount of convolution. Our experiments on synthetic
and benchmark datasets demonstrate remarkable mean squared error (MSE)
improvements when using PC, especially when either the action space or policy
mismatch becomes large, with gains of up to 5 - 6 orders of magnitude over
existing estimators.Comment: Under review. 36 pages, 31 figure
Off-Policy Evaluation of Ranking Policies under Diverse User Behavior
Ranking interfaces are everywhere in online platforms. There is thus an ever
growing interest in their Off-Policy Evaluation (OPE), aiming towards an
accurate performance evaluation of ranking policies using logged data. A
de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides
an unbiased and consistent value estimate. However, it becomes extremely
inaccurate in the ranking setup due to its high variance under large action
spaces. To deal with this problem, previous studies assume either independent
or cascade user behavior, resulting in some ranking versions of IPS. While
these estimators are somewhat effective in reducing the variance, all existing
estimators apply a single universal assumption to every user, causing excessive
bias and variance. Therefore, this work explores a far more general formulation
where user behavior is diverse and can vary depending on the user context. We
show that the resulting estimator, which we call Adaptive IPS (AIPS), can be
unbiased under any complex user behavior. Moreover, AIPS achieves the minimum
variance among all unbiased estimators based on IPS. We further develop a
procedure to identify the appropriate user behavior model to minimize the mean
squared error (MSE) of AIPS in a data-driven fashion. Extensive experiments
demonstrate that the empirical accuracy improvement can be significant,
enabling effective OPE of ranking systems even under diverse user behavior.Comment: KDD2023 Research trac
Active and Passive Causal Inference Learning
This paper serves as a starting point for machine learning researchers,
engineers and students who are interested in but not yet familiar with causal
inference. We start by laying out an important set of assumptions that are
collectively needed for causal identification, such as exchangeability,
positivity, consistency and the absence of interference. From these
assumptions, we build out a set of important causal inference techniques, which
we do so by categorizing them into two buckets; active and passive approaches.
We describe and discuss randomized controlled trials and bandit-based
approaches from the active category. We then describe classical approaches,
such as matching and inverse probability weighting, in the passive category,
followed by more recent deep learning based algorithms. By finishing the paper
with some of the missing aspects of causal inference from this paper, such as
collider biases, we expect this paper to provide readers with a diverse set of
starting points for further reading and research in causal inference and
discovery
Improved sequential decision-making with structural priors: Enhanced treatment personalization with historical data
Personalizing treatments for patients involves a period where different treatments out of a set of available treatments are tried until an optimal treatment is found, for particular patient characteristics. To minimize suffering and other costs, it is critical to minimize this search. When treatments have primarily short-term effects, the search can be performed with multi-armed bandit algorithms (MABs). However, these typically require long exploration periods to guarantee optimality. With historical data, it is possible to recover a structure incorporating the prior knowledge of the types of patients that can be encountered, and the conditional reward models for those patient types. Such structural priors can be used to reduce the treatment exploration period for enhanced applicability in the real world. This thesis presents work on designing MAB algorithms that find optimal treatments quickly, by incorporating a structural prior for patient types in the form of a latent variable model. Theoretical guarantees for the algorithms, including a lower and a matching upper bound, and an empirical study is provided, showing that incorporating latent structural priors is beneficial. Another line of work in this thesis is the design of simulators for evaluating treatment policies and comparing algorithms. A new simulator for benchmarking estimators of causal effects, the Alzheimer’s Disease Causal estimation Benchmark (ADCB) is presented. ADCB combines data-driven simulation with subject-matter knowledge for high realism and causal verifiability. The design of the simulator is discussed, and to demonstrate its utility, the results of a usage scenario for evaluating estimators of causal effects are outlined
Modeling Recommender Ecosystems: Research Challenges at the Intersection of Mechanism Design, Reinforcement Learning and Generative Models
Modern recommender systems lie at the heart of complex ecosystems that couple
the behavior of users, content providers, advertisers, and other actors.
Despite this, the focus of the majority of recommender research -- and most
practical recommenders of any import -- is on the local, myopic optimization of
the recommendations made to individual users. This comes at a significant cost
to the long-term utility that recommenders could generate for its users. We
argue that explicitly modeling the incentives and behaviors of all actors in
the system -- and the interactions among them induced by the recommender's
policy -- is strictly necessary if one is to maximize the value the system
brings to these actors and improve overall ecosystem "health". Doing so
requires: optimization over long horizons using techniques such as
reinforcement learning; making inevitable tradeoffs in the utility that can be
generated for different actors using the methods of social choice; reducing
information asymmetry, while accounting for incentives and strategic behavior,
using the tools of mechanism design; better modeling of both user and
item-provider behaviors by incorporating notions from behavioral economics and
psychology; and exploiting recent advances in generative and foundation models
to make these mechanisms interpretable and actionable. We propose a conceptual
framework that encompasses these elements, and articulate a number of research
challenges that emerge at the intersection of these different disciplines
The Catalog Problem:Deep Learning Methods for Transforming Sets into Sequences of Clusters
The titular Catalog Problem refers to predicting a varying number of ordered clusters from sets of any cardinality. This task arises in many diverse areas, ranging from medical triage, through multi-channel signal analysis for petroleum exploration to product catalog structure prediction. This thesis focuses on the latter, which exemplifies a number of challenges inherent to ordered clustering. These include learning variable cluster constraints, exhibiting relational reasoning and managing combinatorial complexity. All of which present unique challenges for neural networks, combining elements of set representation, neural clustering and permutation learning.In order to approach the Catalog Problem, a curated dataset of over ten thousand real-world product catalogs consisting of more than one million product offers is provided. Additionally, a library for generating simpler, synthetic catalog structures is presented. These and other datasets form the foundation of the included work, allowing for a quantitative comparison of the proposed methods’ ability to address the underlying challenge. In particular, synthetic datasets enable the assessment of the models’ capacity to learn higher order compositional and structural rules.Two novel neural methods are proposed to tackle the Catalog Problem, a set encoding module designed to enhance the network’s ability to condition the prediction on the entirety of the input set, and a larger architecture for inferring an input- dependent number of diverse, ordered partitional clusters with an added cardinality prediction module. Both result in an improved performance on the presented datasets, with the latter being the only neural method fulfilling all requirements inherent to addressing the Catalog Problem
- …