60 research outputs found
Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes
The robust constrained Markov decision process (RCMDP) is a recent
task-modelling framework for reinforcement learning that incorporates
behavioural constraints and that provides robustness to errors in the
transition dynamics model through the use of an uncertainty set. Simulating
RCMDPs requires computing the worst-case dynamics based on value estimates for
each state, an approach which has previously been used in the Robust
Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG
such as not robustifying the full constrained objective and the lack of
incremental learning, this paper introduces two algorithms, called RCPG with
Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies
RCPG by taking the worst-case dynamics based on the Lagrangian rather than
either the value or the constraint. Adversarial RCPG also formulates the
worst-case dynamics based on the Lagrangian but learns this directly and
incrementally as an adversarial policy through gradient descent rather than
indirectly and abruptly through constrained optimisation on a sorted value
list. A theoretical analysis first derives the Lagrangian policy gradient for
the policy optimisation of both proposed algorithms and then the adversarial
policy gradient to learn the adversary for Adversarial RCPG. Empirical
experiments injecting perturbations in inventory management and safe navigation
tasks demonstrate the competitive performance of both algorithms compared to
traditional RCPG variants as well as non-robust and non-constrained ablations.
In particular, Adversarial RCPG ranks among the top two performing algorithms
on all tests
CCIL: Continuity-based Data Augmentation for Corrective Imitation Learning
We present a new technique to enhance the robustness of imitation learning
methods by generating corrective data to account for compounding errors and
disturbances. While existing methods rely on interactive expert labeling,
additional offline datasets, or domain-specific invariances, our approach
requires minimal additional assumptions beyond access to expert data. The key
insight is to leverage local continuity in the environment dynamics to generate
corrective labels. Our method first constructs a dynamics model from the expert
demonstration, encouraging local Lipschitz continuity in the learned model. In
locally continuous regions, this model allows us to generate corrective labels
within the neighborhood of the demonstrations but beyond the actual set of
states and actions in the dataset. Training on this augmented data enhances the
agent's ability to recover from perturbations and deal with compounding errors.
We demonstrate the effectiveness of our generated labels through experiments in
a variety of robotics domains in simulation that have distinct forms of
continuity and discontinuity, including classic control problems, drone flying,
navigation with high-dimensional sensor observations, legged locomotion, and
tabletop manipulation
Regret-Based Optimization for Robust Reinforcement Learning
Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable
to small adversarial noise in observations. Such adversarial noise can have
disastrous consequences in safety-critical environments. For instance, a
self-driving car receiving adversarially perturbed sensory observations about
nearby signs (e.g., a stop sign physically altered to be perceived as a speed
limit sign) or objects (e.g., cars altered to be recognized as trees) can be
fatal. Existing approaches for making RL algorithms robust to an
observation-perturbing adversary have focused on reactive approaches that
iteratively improve against adversarial examples generated at each iteration.
While such approaches have been shown to provide improvements over regular RL
methods, they are reactive and can fare significantly worse if certain
categories of adversarial examples are not generated during training. To that
end, we pursue a more proactive approach that relies on directly optimizing a
well-studied robustness measure, regret instead of expected value. We provide a
principled approach that minimizes maximum regret over a "neighborhood" of
observations to the received "observation". Our regret criterion can be used to
modify existing value- and policy-based Deep RL methods. We demonstrate that
our approaches provide a significant improvement in performance across a wide
variety of benchmarks against leading approaches for robust Deep RL
GriddlyJS: A Web IDE for Reinforcement Learning
Progress in reinforcement learning (RL) research is often driven by the design of new, challenging environments-a costly undertaking requiring skills orthogonal to that of a typical machine learning researcher. The complexity of environment development has only increased with the rise of procedural-content generation (PCG) as the prevailing paradigm for producing varied environments capable of testing the robustness and generalization of RL agents. Moreover, existing environments often require complex build processes, making reproducing results difficult. To address these issues, we introduce GriddlyJS, a web-based Integrated Development Environment (IDE) based on the Griddly engine. GriddlyJS allows researchers to visually design and debug arbitrary, complex PCG grid-world environments using a convenient graphical interface, as well as visualize, evaluate, and record the performance of trained agent models. By connecting the RL workflow to the advanced functionality enabled by modern web standards, GriddlyJS allows publishing interactive agent-environment demos that reproduce experimental results directly to the web. To demonstrate the versatility of GriddlyJS, we use it to quickly develop a complex compositional puzzle-solving environment alongside arbitrary human-designed environment configurations and their solutions for use in automatic curriculum learning and offline RL. The GriddlyJS IDE is open source and freely available at https://griddly.ai
Feasible Adversarial Robust Reinforcement Learning for Underspecified Environments
Robust reinforcement learning (RL) considers the problem of learning policies
that perform well in the worst case among a set of possible environment
parameter values. In real-world environments, choosing the set of possible
values for robust RL can be a difficult task. When that set is specified too
narrowly, the agent will be left vulnerable to reasonable parameter values
unaccounted for. When specified too broadly, the agent will be too cautious. In
this paper, we propose Feasible Adversarial Robust RL (FARR), a novel problem
formulation and objective for automatically determining the set of environment
parameter values over which to be robust. FARR implicitly defines the set of
feasible parameter values as those on which an agent could achieve a benchmark
reward given enough training resources. By formulating this problem as a
two-player zero-sum game, optimizing the FARR objective jointly produces an
adversarial distribution over parameter values with feasible support and a
policy robust over this feasible parameter set. We demonstrate that approximate
Nash equilibria for this objective can be found using a variation of the PSRO
algorithm. Furthermore, we show that an optimal agent trained with FARR is more
robust to feasible adversarial parameter selection than with existing minimax,
domain-randomization, and regret objectives in a parameterized gridworld and
three MuJoCo control environments.Comment: Added new theory sections. Added comparison to self-play. Added
adversary mixed-strategy analysi
Observational Robustness and Invariances in Reinforcement Learning via Lexicographic Objectives
Policy robustness in Reinforcement Learning (RL) may not be desirable at any
price; the alterations caused by robustness requirements from otherwise optimal
policies should be explainable and quantifiable. Policy gradient algorithms
that have strong convergence guarantees are usually modified to obtain robust
policies in ways that do not preserve algorithm guarantees, which defeats the
purpose of formal robustness requirements. In this work we study a notion of
robustness in partially observable MDPs where state observations are perturbed
by a noise-induced stochastic kernel. We characterise the set of policies that
are maximally robust by analysing how the policies are altered by this kernel.
We then establish a connection between such robust policies and certain
properties of the noise kernel, as well as with structural properties of the
underlying MDPs, constructing sufficient conditions for policy robustness. We
use these notions to propose a robustness-inducing scheme, applicable to any
policy gradient algorithm, to formally trade off the reward achieved by a
policy with its robustness level through lexicographic optimisation, which
preserves convergence properties of the original algorithm. We test the the
proposed approach through numerical experiments on safety-critical RL
environments, and show how the proposed method helps achieve high robustness
when state errors are introduced in the policy roll-out
- …