Search CORE

6,525 research outputs found

Scalable Safe Policy Improvement via Monte Carlo Tree Search

Author: A. Castellini
A. Farinelli
E. Zorzi
F. Bianchi
M. T. J. Spaan
T. D. Simao
Publication venue: PMLR
Publication date: 01/01/2023
Field of study

Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent

Catalogo dei prodotti della ricerca

Behavior Prior Representation learning for Offline Reinforcement Learning

Author: Combes Remi Tachet Des
Islam Riashat
Laroche Romain
Li Xin
Liu Chen
Yu Jie
Zang Hongyu
Publication venue
Publication date: 02/11/2022
Field of study

Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks

arXiv.org e-Print Archive