Search CORE

713 research outputs found

Corruption-Robust Offline Reinforcement Learning with General Function Approximation

Author: Gu Quanquan
Yang Rui
Ye Chenlu
Zhang Tong
Publication venue
Publication date: 14/11/2023
Field of study

We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level

\zeta\geq0

quantifies the cumulative corruption amount over

n

episodes and

H

steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of

\zeta

, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of

\mathcal O(\zeta \cdot (\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})

due to the corruption. Here

\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H)

is the coverage coefficient that depends on the regularization parameter

\lambda

, the confidence set

\hat{\mathcal F}

, and the dataset

\mathcal Z_n^H

, and

C(\hat{\mathcal F},\mu)

is a coefficient that depends on

\hat{\mathcal F}

and the underlying data distribution

\mu

. When specialized to linear MDPs, the corruption-dependent error term reduces to

\mathcal O(\zeta d n^{-1})

with

d

being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term

arXiv.org e-Print Archive

Trustworthy Reinforcement Learning Against Intrinsic Vulnerabilities: Robustness, Safety, and Generalizability

Author: Cen Zhepeng
Ding Wenhao
Huang Peide
Li Bo
Liu Zuxin
Xu Mengdi
Zhao Ding
Publication venue
Publication date: 16/09/2022
Field of study

A trustworthy reinforcement learning algorithm should be competent in solving challenging real-world problems, including {robustly} handling uncertainties, satisfying {safety} constraints to avoid catastrophic failures, and {generalizing} to unseen scenarios during deployments. This study aims to overview these main perspectives of trustworthy reinforcement learning considering its intrinsic vulnerabilities on robustness, safety, and generalizability. In particular, we give rigorous formulations, categorize corresponding methodologies, and discuss benchmarks for each perspective. Moreover, we provide an outlook section to spur promising future directions with a brief discussion on extrinsic vulnerabilities considering human feedback. We hope this survey could bring together separate threads of studies together in a unified framework and promote the trustworthiness of reinforcement learning.Comment: 36 pages, 5 figure

arXiv.org e-Print Archive

Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning

Author: Blondé Lionel
Kalousis Alexandros
Strasser Pablo
Publication venue
Publication date: 03/07/2021
Field of study

Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyper-parameters and are often riddled with essential engineering feats allowing their success. We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the state-value function. We complement these guarantees with empirical evidence attesting to the strong positive effect that the consistent satisfaction of the Lipschitzness constraint on the reward has on imitation performance. Finally, we tackle a generic pessimistic reward preconditioning add-on spawning a large class of reward shaping methods, which makes the base method it is plugged into provably more robust, as shown in several additional theoretical guarantees. We then discuss these through a fine-grained lens and share our insights. Crucially, the guarantees derived and reported in this work are valid for any reward satisfying the Lipschitzness condition, nothing is specific to imitation. As such, these may be of independent interest

arXiv.org e-Print Archive

Human-level Atari 200x faster

Author: Badia Adrià Puigdomènech
Blundell Charles
Campos Víctor
Jiang Ray
Kapturowski Steven
Rakićević Nemanja
van Hasselt Hado
Publication venue
Publication date: 15/09/2022
Field of study

The task of building general agents that perform well over a wide range of tasks has been an important goal in reinforcement learning since its inception. The problem has been subject of research of a large body of work, with performance frequently measured by observing scores over the wide range of environments contained in the Atari 57 benchmark. Agent57 was the first agent to surpass the human benchmark on all 57 games, but this came at the cost of poor data-efficiency, requiring nearly 80 billion frames of experience to achieve. Taking Agent57 as a starting point, we employ a diverse set of strategies to achieve a 200-fold reduction of experience needed to out perform the human baseline. We investigate a range of instabilities and bottlenecks we encountered while reducing the data regime, and propose effective solutions to build a more robust and efficient agent. We also demonstrate competitive performance with high-performing methods such as Muesli and MuZero. The four key components to our approach are (1) an approximate trust region method which enables stable bootstrapping from the online network, (2) a normalisation scheme for the loss and priorities which improves robustness when learning a set of value functions with a wide range of scales, (3) an improved architecture employing techniques from NFNets in order to leverage deeper networks without the need for normalization layers, and (4) a policy distillation method which serves to smooth out the instantaneous greedy policy overtime

arXiv.org e-Print Archive