Search CORE

11 research outputs found

Exploration by Maximizing R\'enyi Entropy for Reward-Free RL Framework

Author: Cai Yuanying
Huang Longbo
Li Jian
Zhang Chuheng
Publication venue
Publication date: 10/12/2020
Field of study

Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free RL framework that completely separates exploration from exploitation and brings new challenges for exploration algorithms. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, we propose to maximize the Renyi entropy over the state-action space and justify this objective theoretically. The success of using Renyi entropy as the objective results from its encouragement to explore the hard-to-reach state-actions. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments. In the planning phase, we solve for good policies given arbitrary reward functions using a batch RL algorithm. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.Comment: Accepted by AAAI-2

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Towards Generalizable Reinforcement Learning for Trade Execution

Author: Chen Jianyu
Chen Xiaoyu
Duan Yitong
Li Jian
Zhang Chuheng
Zhao Li
Publication venue
Publication date: 11/05/2023
Field of study

Optimized trade execution is to sell (or buy) a given amount of assets in a given time with the lowest possible trading cost. Recently, reinforcement learning (RL) has been applied to optimized trade execution to learn smarter policies from market data. However, we find that many existing RL methods exhibit considerable overfitting which prevents them from real deployment. In this paper, we provide an extensive study on the overfitting problem in optimized trade execution. First, we model the optimized trade execution as offline RL with dynamic context (ORDC), where the context represents market variables that cannot be influenced by the trading policy and are collected in an offline manner. Under this framework, we derive the generalization bound and find that the overfitting issue is caused by large context space and limited context samples in the offline setting. Accordingly, we propose to learn compact representations for context to address the overfitting problem, either by leveraging prior knowledge or in an end-to-end manner. To evaluate our algorithms, we also implement a carefully designed simulator based on historical limit order book (LOB) data to provide a high-fidelity benchmark for different algorithms. Our experiments on the high-fidelity simulator demonstrate that our algorithms can effectively alleviate overfitting and achieve better performance.Comment: Accepted by IJCAI-2

arXiv.org e-Print Archive

Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks

Author: Liao Guogang
Shi Xiaowen
Wang Dong
Wang Xingxing
Wang Yongkang
Wang Ze
Wu Xiaoxu
Zhang Chuheng
Publication venue
Publication date: 02/04/2022
Field of study

With the recent prevalence of reinforcement learning (RL), there have been tremendous interests in utilizing RL for ads allocation in recommendation platforms (e.g., e-commerce and news feed sites). For better performance, recent RL-based ads allocation agent makes decisions based on representations of list-wise item arrangement. This results in a high-dimensional state-action space, which makes it difficult to learn an efficient and generalizable list-wise representation. To address this problem, we propose a novel algorithm to learn a better representation by leveraging task-specific signals on Meituan food delivery platform. Specifically, we propose three different types of auxiliary tasks that are based on reconstruction, prediction, and contrastive learning respectively. We conduct extensive offline experiments on the effectiveness of these auxiliary tasks and test our method on real-world food delivery platform. The experimental results show that our method can learn better list-wise representations and achieve higher revenue for the platform.Comment: arXiv admin note: text overlap with arXiv:2109.04353, arXiv:2204.0037

arXiv.org e-Print Archive

Policy Search by Target Distribution Learning for Continuous Control

Author: Li Jian
Li Yuanqi
Zhang Chuheng
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 17/11/2019
Field of study

It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic, leading to an unstable training process. We show that such instability can happen even in a very simple environment. To address this issue, we propose a new method, called target distribution learning (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

RePreM: Representation Pre-training with Masked Model for Reinforcement Learning

Author: Cai Yuanying
Huang Longbo
Ruan Wenjie
Shen Wei
Zhang Chuheng
Zhang Xuyun
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 26/06/2023
Field of study

Inspired by the recent success of sequence modeling in RL and the use of masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory. RePreM is simple but effective compared to existing representation pre-training methods in RL. It avoids algorithmic sophistication (such as data augmentation or estimating multiple models) with sequence modeling and generates a representation that captures long-term dynamics well. Empirically, we demonstrate the effectiveness of RePreM in various tasks, including dynamic prediction, transfer learning, and sample-efficient RL with both value-based and actor-critic methods. Moreover, we show that RePreM scales well with dataset size, dataset quality, and the scale of the encoder, which indicates its potential towards big RL models

Association for the Advancement of Artificial Intelligence: AAAI Publications

A novel sparse model-based algorithm to cluster categorical data for improved health screening and public health promotion

Author: Ding Yu
Hutchinson M. Katherine
Jiang Lan
Si Bing
Sutherland Melissa A.
Zhang Chuheng
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2021
Field of study

Screening for interpersonal violence is critical to mitigate the consequences of violence and improve women’s health. Current guidelines recommend that health care providers screen all women for experiences of violence. Despite these recommendations, studies have noted a large variation in provider-reported interpersonal violence screening rates ranging from 10% to 90%. Given the disparity in screening rates, identifying variables correlated with providers’ screening practices is an important contribution. A survey of healthcare providers previously collected was utilized for this analysis and consisted of the providers’ socio-demographics, attitudes and beliefs, practice environment characteristics as well as self-reported screening practices. The objective of the study was to stratify healthcare providers into relatively homogeneous clusters based on mixed types of categorical nominal and ordinal variables and correlate the identified clusters with the violence screening rates. This paper proposes a sparse categorical Factor Mixture Model (sc-FMM) to cluster a large number of categorical variables, in which an (Formula presented.) norm was used for variable selection. An Expectation Maximization framework integrated with Gauss-Hermite approximation was developed for model estimation. Simulation studies show significantly better performance of sc-FMM than competing methods. sc-FMM was applied to identify clusters/subgroups of healthcare providers. The identified clusters were further correlated with interpersonal violence screening rates. The findings reveal how the providers’ screening rate for interpersonal violence are associated with multi-source impacting factors which inform the formation of policy and intervention development to promote the uptake of routine screening for interpersonal violence in women

DigitalCommons@URI