11 research outputs found
Exploration by Maximizing R\'enyi Entropy for Reward-Free RL Framework
Exploration is essential for reinforcement learning (RL). To face the
challenges of exploration, we consider a reward-free RL framework that
completely separates exploration from exploitation and brings new challenges
for exploration algorithms. In the exploration phase, the agent learns an
exploratory policy by interacting with a reward-free environment and collects a
dataset of transitions by executing the policy. In the planning phase, the
agent computes a good policy for any reward function based on the dataset
without further interacting with the environment. This framework is suitable
for the meta RL setting where there are many reward functions of interest. In
the exploration phase, we propose to maximize the Renyi entropy over the
state-action space and justify this objective theoretically. The success of
using Renyi entropy as the objective results from its encouragement to explore
the hard-to-reach state-actions. We further deduce a policy gradient
formulation for this objective and design a practical exploration algorithm
that can deal with complex environments. In the planning phase, we solve for
good policies given arbitrary reward functions using a batch RL algorithm.
Empirically, we show that our exploration algorithm is effective and sample
efficient, and results in superior policies for arbitrary reward functions in
the planning phase.Comment: Accepted by AAAI-2
Towards Generalizable Reinforcement Learning for Trade Execution
Optimized trade execution is to sell (or buy) a given amount of assets in a
given time with the lowest possible trading cost. Recently, reinforcement
learning (RL) has been applied to optimized trade execution to learn smarter
policies from market data. However, we find that many existing RL methods
exhibit considerable overfitting which prevents them from real deployment. In
this paper, we provide an extensive study on the overfitting problem in
optimized trade execution. First, we model the optimized trade execution as
offline RL with dynamic context (ORDC), where the context represents market
variables that cannot be influenced by the trading policy and are collected in
an offline manner. Under this framework, we derive the generalization bound and
find that the overfitting issue is caused by large context space and limited
context samples in the offline setting. Accordingly, we propose to learn
compact representations for context to address the overfitting problem, either
by leveraging prior knowledge or in an end-to-end manner. To evaluate our
algorithms, we also implement a carefully designed simulator based on
historical limit order book (LOB) data to provide a high-fidelity benchmark for
different algorithms. Our experiments on the high-fidelity simulator
demonstrate that our algorithms can effectively alleviate overfitting and
achieve better performance.Comment: Accepted by IJCAI-2
Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks
With the recent prevalence of reinforcement learning (RL), there have been
tremendous interests in utilizing RL for ads allocation in recommendation
platforms (e.g., e-commerce and news feed sites). For better performance,
recent RL-based ads allocation agent makes decisions based on representations
of list-wise item arrangement. This results in a high-dimensional state-action
space, which makes it difficult to learn an efficient and generalizable
list-wise representation. To address this problem, we propose a novel algorithm
to learn a better representation by leveraging task-specific signals on Meituan
food delivery platform. Specifically, we propose three different types of
auxiliary tasks that are based on reconstruction, prediction, and contrastive
learning respectively. We conduct extensive offline experiments on the
effectiveness of these auxiliary tasks and test our method on real-world food
delivery platform. The experimental results show that our method can learn
better list-wise representations and achieve higher revenue for the platform.Comment: arXiv admin note: text overlap with arXiv:2109.04353,
arXiv:2204.0037
Policy Search by Target Distribution Learning for Continuous Control
It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic, leading to an unstable training process. We show that such instability can happen even in a very simple environment. To address this issue, we propose a new method, called target distribution learning (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training
RePreM: Representation Pre-training with Masked Model for Reinforcement Learning
Inspired by the recent success of sequence modeling in RL and the use of masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory. RePreM is simple but effective compared to existing representation pre-training methods in RL. It avoids algorithmic sophistication (such as data augmentation or estimating multiple models) with sequence modeling and generates a representation that captures long-term dynamics well. Empirically, we demonstrate the effectiveness of RePreM in various tasks, including dynamic prediction, transfer learning, and sample-efficient RL with both value-based and actor-critic methods. Moreover, we show that RePreM scales well with dataset size, dataset quality, and the scale of the encoder, which indicates its potential towards big RL models
A novel sparse model-based algorithm to cluster categorical data for improved health screening and public health promotion
Screening for interpersonal violence is critical to mitigate the consequences of violence and improve women’s health. Current guidelines recommend that health care providers screen all women for experiences of violence. Despite these recommendations, studies have noted a large variation in provider-reported interpersonal violence screening rates ranging from 10% to 90%. Given the disparity in screening rates, identifying variables correlated with providers’ screening practices is an important contribution. A survey of healthcare providers previously collected was utilized for this analysis and consisted of the providers’ socio-demographics, attitudes and beliefs, practice environment characteristics as well as self-reported screening practices. The objective of the study was to stratify healthcare providers into relatively homogeneous clusters based on mixed types of categorical nominal and ordinal variables and correlate the identified clusters with the violence screening rates. This paper proposes a sparse categorical Factor Mixture Model (sc-FMM) to cluster a large number of categorical variables, in which an (Formula presented.) norm was used for variable selection. An Expectation Maximization framework integrated with Gauss-Hermite approximation was developed for model estimation. Simulation studies show significantly better performance of sc-FMM than competing methods. sc-FMM was applied to identify clusters/subgroups of healthcare providers. The identified clusters were further correlated with interpersonal violence screening rates. The findings reveal how the providers’ screening rate for interpersonal violence are associated with multi-source impacting factors which inform the formation of policy and intervention development to promote the uptake of routine screening for interpersonal violence in women