11 research outputs found

    Exploration by Maximizing R\'enyi Entropy for Reward-Free RL Framework

    Full text link
    Exploration is essential for reinforcement learning (RL). To face the challenges of exploration, we consider a reward-free RL framework that completely separates exploration from exploitation and brings new challenges for exploration algorithms. In the exploration phase, the agent learns an exploratory policy by interacting with a reward-free environment and collects a dataset of transitions by executing the policy. In the planning phase, the agent computes a good policy for any reward function based on the dataset without further interacting with the environment. This framework is suitable for the meta RL setting where there are many reward functions of interest. In the exploration phase, we propose to maximize the Renyi entropy over the state-action space and justify this objective theoretically. The success of using Renyi entropy as the objective results from its encouragement to explore the hard-to-reach state-actions. We further deduce a policy gradient formulation for this objective and design a practical exploration algorithm that can deal with complex environments. In the planning phase, we solve for good policies given arbitrary reward functions using a batch RL algorithm. Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase.Comment: Accepted by AAAI-2

    Towards Generalizable Reinforcement Learning for Trade Execution

    Full text link
    Optimized trade execution is to sell (or buy) a given amount of assets in a given time with the lowest possible trading cost. Recently, reinforcement learning (RL) has been applied to optimized trade execution to learn smarter policies from market data. However, we find that many existing RL methods exhibit considerable overfitting which prevents them from real deployment. In this paper, we provide an extensive study on the overfitting problem in optimized trade execution. First, we model the optimized trade execution as offline RL with dynamic context (ORDC), where the context represents market variables that cannot be influenced by the trading policy and are collected in an offline manner. Under this framework, we derive the generalization bound and find that the overfitting issue is caused by large context space and limited context samples in the offline setting. Accordingly, we propose to learn compact representations for context to address the overfitting problem, either by leveraging prior knowledge or in an end-to-end manner. To evaluate our algorithms, we also implement a carefully designed simulator based on historical limit order book (LOB) data to provide a high-fidelity benchmark for different algorithms. Our experiments on the high-fidelity simulator demonstrate that our algorithms can effectively alleviate overfitting and achieve better performance.Comment: Accepted by IJCAI-2

    Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks

    Full text link
    With the recent prevalence of reinforcement learning (RL), there have been tremendous interests in utilizing RL for ads allocation in recommendation platforms (e.g., e-commerce and news feed sites). For better performance, recent RL-based ads allocation agent makes decisions based on representations of list-wise item arrangement. This results in a high-dimensional state-action space, which makes it difficult to learn an efficient and generalizable list-wise representation. To address this problem, we propose a novel algorithm to learn a better representation by leveraging task-specific signals on Meituan food delivery platform. Specifically, we propose three different types of auxiliary tasks that are based on reconstruction, prediction, and contrastive learning respectively. We conduct extensive offline experiments on the effectiveness of these auxiliary tasks and test our method on real-world food delivery platform. The experimental results show that our method can learn better list-wise representations and achieve higher revenue for the platform.Comment: arXiv admin note: text overlap with arXiv:2109.04353, arXiv:2204.0037

    Policy Search by Target Distribution Learning for Continuous Control

    No full text
    It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic, leading to an unstable training process. We show that such instability can happen even in a very simple environment. To address this issue, we propose a new method, called target distribution learning (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training

    RePreM: Representation Pre-training with Masked Model for Reinforcement Learning

    No full text
    Inspired by the recent success of sequence modeling in RL and the use of masked language model for pre-training, we propose a masked model for pre-training in RL, RePreM (Representation Pre-training with Masked Model), which trains the encoder combined with transformer blocks to predict the masked states or actions in a trajectory. RePreM is simple but effective compared to existing representation pre-training methods in RL. It avoids algorithmic sophistication (such as data augmentation or estimating multiple models) with sequence modeling and generates a representation that captures long-term dynamics well. Empirically, we demonstrate the effectiveness of RePreM in various tasks, including dynamic prediction, transfer learning, and sample-efficient RL with both value-based and actor-critic methods. Moreover, we show that RePreM scales well with dataset size, dataset quality, and the scale of the encoder, which indicates its potential towards big RL models

    A novel sparse model-based algorithm to cluster categorical data for improved health screening and public health promotion

    No full text
    Screening for interpersonal violence is critical to mitigate the consequences of violence and improve women’s health. Current guidelines recommend that health care providers screen all women for experiences of violence. Despite these recommendations, studies have noted a large variation in provider-reported interpersonal violence screening rates ranging from 10% to 90%. Given the disparity in screening rates, identifying variables correlated with providers’ screening practices is an important contribution. A survey of healthcare providers previously collected was utilized for this analysis and consisted of the providers’ socio-demographics, attitudes and beliefs, practice environment characteristics as well as self-reported screening practices. The objective of the study was to stratify healthcare providers into relatively homogeneous clusters based on mixed types of categorical nominal and ordinal variables and correlate the identified clusters with the violence screening rates. This paper proposes a sparse categorical Factor Mixture Model (sc-FMM) to cluster a large number of categorical variables, in which an (Formula presented.) norm was used for variable selection. An Expectation Maximization framework integrated with Gauss-Hermite approximation was developed for model estimation. Simulation studies show significantly better performance of sc-FMM than competing methods. sc-FMM was applied to identify clusters/subgroups of healthcare providers. The identified clusters were further correlated with interpersonal violence screening rates. The findings reveal how the providers’ screening rate for interpersonal violence are associated with multi-source impacting factors which inform the formation of policy and intervention development to promote the uptake of routine screening for interpersonal violence in women
    corecore