713 research outputs found
Corruption-Robust Offline Reinforcement Learning with General Function Approximation
We investigate the problem of corruption robustness in offline reinforcement
learning (RL) with general function approximation, where an adversary can
corrupt each sample in the offline dataset, and the corruption level
quantifies the cumulative corruption amount over episodes and
steps. Our goal is to find a policy that is robust to such corruption and
minimizes the suboptimality gap with respect to the optimal policy for the
uncorrupted Markov decision processes (MDPs). Drawing inspiration from the
uncertainty-weighting technique from the robust online RL setting
\citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight
iteration procedure to efficiently compute on batched samples and propose a
corruption-robust algorithm for offline RL. Notably, under the assumption of
single policy coverage and the knowledge of , our proposed algorithm
achieves a suboptimality bound that is worsened by an additive factor of
due to the corruption.
Here is the coverage
coefficient that depends on the regularization parameter , the
confidence set , and the dataset , and
is a coefficient that depends on
and the underlying data distribution . When specialized to linear MDPs,
the corruption-dependent error term reduces to
with being the dimension of the feature map, which matches the existing
lower bound for corrupted linear MDPs. This suggests that our analysis is tight
in terms of the corruption-dependent term
Trustworthy Reinforcement Learning Against Intrinsic Vulnerabilities: Robustness, Safety, and Generalizability
A trustworthy reinforcement learning algorithm should be competent in solving
challenging real-world problems, including {robustly} handling uncertainties,
satisfying {safety} constraints to avoid catastrophic failures, and
{generalizing} to unseen scenarios during deployments. This study aims to
overview these main perspectives of trustworthy reinforcement learning
considering its intrinsic vulnerabilities on robustness, safety, and
generalizability. In particular, we give rigorous formulations, categorize
corresponding methodologies, and discuss benchmarks for each perspective.
Moreover, we provide an outlook section to spur promising future directions
with a brief discussion on extrinsic vulnerabilities considering human
feedback. We hope this survey could bring together separate threads of studies
together in a unified framework and promote the trustworthiness of
reinforcement learning.Comment: 36 pages, 5 figure
Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning
Despite the recent success of reinforcement learning in various domains,
these approaches remain, for the most part, deterringly sensitive to
hyper-parameters and are often riddled with essential engineering feats
allowing their success. We consider the case of off-policy generative
adversarial imitation learning, and perform an in-depth review, qualitative and
quantitative, of the method. We show that forcing the learned reward function
to be local Lipschitz-continuous is a sine qua non condition for the method to
perform well. We then study the effects of this necessary condition and provide
several theoretical results involving the local Lipschitzness of the
state-value function. We complement these guarantees with empirical evidence
attesting to the strong positive effect that the consistent satisfaction of the
Lipschitzness constraint on the reward has on imitation performance. Finally,
we tackle a generic pessimistic reward preconditioning add-on spawning a large
class of reward shaping methods, which makes the base method it is plugged into
provably more robust, as shown in several additional theoretical guarantees. We
then discuss these through a fine-grained lens and share our insights.
Crucially, the guarantees derived and reported in this work are valid for any
reward satisfying the Lipschitzness condition, nothing is specific to
imitation. As such, these may be of independent interest
Human-level Atari 200x faster
The task of building general agents that perform well over a wide range of
tasks has been an important goal in reinforcement learning since its inception.
The problem has been subject of research of a large body of work, with
performance frequently measured by observing scores over the wide range of
environments contained in the Atari 57 benchmark. Agent57 was the first agent
to surpass the human benchmark on all 57 games, but this came at the cost of
poor data-efficiency, requiring nearly 80 billion frames of experience to
achieve. Taking Agent57 as a starting point, we employ a diverse set of
strategies to achieve a 200-fold reduction of experience needed to out perform
the human baseline. We investigate a range of instabilities and bottlenecks we
encountered while reducing the data regime, and propose effective solutions to
build a more robust and efficient agent. We also demonstrate competitive
performance with high-performing methods such as Muesli and MuZero. The four
key components to our approach are (1) an approximate trust region method which
enables stable bootstrapping from the online network, (2) a normalisation
scheme for the loss and priorities which improves robustness when learning a
set of value functions with a wide range of scales, (3) an improved
architecture employing techniques from NFNets in order to leverage deeper
networks without the need for normalization layers, and (4) a policy
distillation method which serves to smooth out the instantaneous greedy policy
overtime
- …