749 research outputs found
Self-Supervised Reinforcement Learning for Recommender Systems
In session-based or sequential recommendation, it is important to consider a
number of factors like long-term user engagement, multiple types of user-item
interactions such as clicks, purchases etc. The current state-of-the-art
supervised approaches fail to model them appropriately. Casting sequential
recommendation task as a reinforcement learning (RL) problem is a promising
direction. A major component of RL approaches is to train the agent through
interactions with the environment. However, it is often problematic to train a
recommender in an on-line fashion due to the requirement to expose users to
irrelevant recommendations. As a result, learning the policy from logged
implicit feedback is of vital importance, which is challenging due to the pure
off-policy setting and lack of negative rewards (feedback). In this paper, we
propose self-supervised reinforcement learning for sequential recommendation
tasks. Our approach augments standard recommendation models with two output
layers: one for self-supervised learning and the other for RL. The RL part acts
as a regularizer to drive the supervised layer focusing on specific
rewards(e.g., recommending items which may lead to purchases rather than
clicks) while the self-supervised layer with cross-entropy loss provides strong
gradient signals for parameter updates. Based on such an approach, we propose
two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised
Actor-Critic(SAC). We integrate the proposed frameworks with four
state-of-the-art recommendation models. Experimental results on two real-world
datasets demonstrate the effectiveness of our approach.Comment: SIGIR202
Learning Deep Features in Instrumental Variable Regression
Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by using an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task
Causal Reinforcement Learning: A Survey
Reinforcement learning is an essential paradigm for solving sequential
decision problems under uncertainty. Despite many remarkable achievements in
recent decades, applying reinforcement learning methods in the real world
remains challenging. One of the main obstacles is that reinforcement learning
agents lack a fundamental understanding of the world and must therefore learn
from scratch through numerous trial-and-error interactions. They may also face
challenges in providing explanations for their decisions and generalizing the
acquired knowledge. Causality, however, offers a notable advantage as it can
formalize knowledge in a systematic manner and leverage invariance for
effective knowledge transfer. This has led to the emergence of causal
reinforcement learning, a subfield of reinforcement learning that seeks to
enhance existing algorithms by incorporating causal relationships into the
learning process. In this survey, we comprehensively review the literature on
causal reinforcement learning. We first introduce the basic concepts of
causality and reinforcement learning, and then explain how causality can
address core challenges in non-causal reinforcement learning. We categorize and
systematically review existing causal reinforcement learning approaches based
on their target problems and methodologies. Finally, we outline open issues and
future directions in this emerging field.Comment: 48 pages, 10 figure
New debiasing strategies in collaborative filtering recommender systems: modeling user conformity, multiple biases, and causality.
Recommender Systems are widely used to personalize the user experience in a diverse set of online applications ranging from e-commerce and education to social media and online entertainment. These State of the Art AI systems can suffer from several biases that may occur at different stages of the recommendation life-cycle. For instance, using biased data to train recommendation models may lead to several issues, such as the discrepancy between online and offline evaluation, decreasing the recommendation performance, and hurting the user experience. Bias can occur during the data collection stage where the data inherits the user-item interaction biases, such as selection and exposure bias. Bias can also occur in the training stage, where popular items tend to be recommended much more frequently given that they received more interactions to start with. The closed feedback loop nature of online recommender systems will further amplify the latter biases as well. In this dissertation, we study the bias in the context of Collaborative Filtering recommender system, and propose a new Popularity Correction Matrix Factorization (PCMF) that aims to improve the recommender system performance as well as decrease popularity bias and increase the diversity of items in the recommendation lists. PCMF mitigates popularity bias by disentangling relevance and conformity and by learning a user-personalized bias vector to capture the users\u27 individual conformity levels along a full spectrum of conformity bias. One shortcoming of the proposed PCMF debiasing approach, is its assumption that the recommender system is affected by only popularity bias. However in the real word, different types of bias do occur simultaneously and interact with one another. We therefore relax the latter assumption and propose a multi-pronged approach that can account for two biases simultaneously, namely popularity and exposure bias. our experimental results show that accounting for multiple biases does improve the results in terms of providing more accurate and less biased results. Finally, we propose a novel two-stage debiasing approach, inspired from the proximal causal inference framework. Unlike the existing causal IPS approach that corrects for observed confounders, our proposed approach corrects for both observed and potential unobserved confounders. The approach relies on a pair of negative control variables to adjust for the bias in the potential ratings. Our proposed approach outperforms state of the art causal approaches, proving that accounting for unobserved confounders can improve the recommendation system\u27s performance
Spatio-temporal Incentives Optimization for Ride-hailing Services with Offline Deep Reinforcement Learning
A fundamental question in any peer-to-peer ride-sharing system is how to,
both effectively and efficiently, meet the request of passengers to balance the
supply and demand in real time. On the passenger side, traditional approaches
focus on pricing strategies by increasing the probability of users' call to
adjust the distribution of demand. However, previous methods do not take into
account the impact of changes in strategy on future supply and demand changes,
which means drivers are repositioned to different destinations due to
passengers' calls, which will affect the driver's income for a period of time
in the future. Motivated by this observation, we make an attempt to optimize
the distribution of demand to handle this problem by learning the long-term
spatio-temporal values as a guideline for pricing strategy. In this study, we
propose an offline deep reinforcement learning based method focusing on the
demand side to improve the utilization of transportation resources and customer
satisfaction. We adopt a spatio-temporal learning method to learn the value of
different time and location, then incentivize the ride requests of passengers
to adjust the distribution of demand to balance the supply and demand in the
system. In particular, we model the problem as a Markov Decision Process (MDP)
Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning
In Model-based Reinforcement Learning (MBRL), model learning is critical
since an inaccurate model can bias policy learning via generating misleading
samples. However, learning an accurate model can be difficult since the policy
is continually updated and the induced distribution over visited states used
for model learning shifts accordingly. Prior methods alleviate this issue by
quantifying the uncertainty of model-generated samples. However, these methods
only quantify the uncertainty passively after the samples were generated,
rather than foreseeing the uncertainty before model trajectories fall into
those highly uncertain regions. The resulting low-quality samples can induce
unstable learning targets and hinder the optimization of the policy. Moreover,
while being learned to minimize one-step prediction errors, the model is
generally used to predict for multiple steps, leading to a mismatch between the
objectives of model learning and model usage. To this end, we propose
\emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout
process as a sequential decision making problem by reversely considering the
model as a decision maker and the current policy as the dynamics. In this way,
the model can quickly adapt to the current policy and foresee the multi-step
future uncertainty when generating trajectories. Theoretically, we show that
the performance of P2P can be guaranteed by approximately optimizing a lower
bound of the true environment return. Empirical results demonstrate that P2P
achieves state-of-the-art performance on several challenging benchmark tasks.Comment: Accepted by NeurIPS202
- …