14 research outputs found
Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning
Many real-world domains require safe decision making in uncertain
environments. In this work, we introduce a deep reinforcement learning
framework for approaching this important problem. We consider a distribution
over transition models, and apply a risk-averse perspective towards model
uncertainty through the use of coherent distortion risk measures. We provide
robustness guarantees for this framework by showing it is equivalent to a
specific class of distributionally robust safe reinforcement learning problems.
Unlike existing approaches to robustness in deep reinforcement learning,
however, our formulation does not involve minimax optimization. This leads to
an efficient, model-free implementation of our approach that only requires
standard data collection from a single training environment. In experiments on
continuous control tasks with safety constraints, we demonstrate that our
framework produces robust performance and safety at deployment time across a
range of perturbed test environments.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS
2023
Reliable deep reinforcement learning: stable training and robust deployment
Deep reinforcement learning (RL) represents a data-driven framework for sequential decision making that has demonstrated the ability to solve challenging control tasks. This data-driven, learning-based approach offers the potential to improve operations in complex systems, but only if it can be trusted to produce reliable performance both during training and upon deployment. These requirements have hindered the adoption of deep RL in many real-world applications. In order to overcome the limitations of existing methods, this dissertation introduces reliable deep RL algorithms that deliver (i) stable training from limited data and (ii) robust, safe deployment in the presence of uncertainty.
The first part of the dissertation addresses the interactive nature of deep RL, where learning requires data collection from the environment. This interactive process can be expensive, time-consuming, and dangerous in many real-world settings, which motivates the need for reliable and efficient learning. We develop deep RL algorithms that guarantee stable performance throughout training, while also directly considering data efficiency in their design. These algorithms are supported by novel policy improvement lower bounds that account for finite-sample estimation error and sample reuse.
The second part of the dissertation focuses on the uncertainty present in real-world applications, which can impact the performance and safety of learned control policies. In order to reliably deploy deep RL in the presence of uncertainty, we introduce frameworks that incorporate safety constraints and provide robustness to general disturbances in the environment. Importantly, these frameworks make limited assumptions on the training process, and can be implemented in settings that require real-world interaction for training. This motivates deep RL algorithms that deliver robust, safe performance at deployment time, while only using standard data collection from a single training environment.
Overall, this dissertation contributes new techniques to overcome key limitations of deep RL for real-world decision making and control. Experiments across a variety of continuous control tasks demonstrate the effectiveness of our algorithms
Adversarial Imitation Learning from Visual Observations using Latent Information
We focus on the problem of imitation learning from visual observations, where
the learning agent has access to videos of experts as its sole learning source.
The challenges of this framework include the absence of expert actions and the
partial observability of the environment, as the ground-truth states can only
be inferred from pixels. To tackle this problem, we first conduct a theoretical
analysis of imitation learning in partially observable environments. We
establish upper bounds on the suboptimality of the learning agent with respect
to the divergence between the expert and the agent latent state-transition
distributions. Motivated by this analysis, we introduce an algorithm called
Latent Adversarial Imitation from Observations, which combines off-policy
adversarial imitation techniques with a learned latent representation of the
agent's state from sequences of observations. In experiments on
high-dimensional continuous robotic tasks, we show that our model-free approach
in latent space matches state-of-the-art performance. Additionally, we show how
our method can be used to improve the efficiency of reinforcement learning from
pixels by leveraging expert videos. To ensure reproducibility, we provide free
access to our code
Generalized proximal policy optimization with sample reuse
In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. In this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization. This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse. We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency.Published versio
Generalized proximal policy optimization with sample reuse
In real-world decision making tasks, it is critical for data-driven reinforcement learning methods to be both stable and sample efficient. On-policy methods typically generate reliable policy improvement throughout training, while off-policy methods make more efficient use of data through sample reuse. In this work, we combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization. This motivates an off-policy version of the popular algorithm that we call Generalized Proximal Policy Optimization with Sample Reuse. We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency.Published versio
Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees
Robustness and safety are critical for the trustworthy deployment of deep
reinforcement learning. Real-world decision making applications require
algorithms that can guarantee robust performance and safety in the presence of
general environment disturbances, while making limited assumptions on the data
collection process during training. In order to accomplish this goal, we
introduce a safe reinforcement learning framework that incorporates robustness
through the use of an optimal transport cost uncertainty set. We provide an
efficient implementation based on applying Optimal Transport Perturbations to
construct worst-case virtual state transitions, which does not impact data
collection during training and does not require detailed simulator access. In
experiments on continuous control tasks with safety constraints, our approach
demonstrates robust performance while significantly improving safety at
deployment time compared to standard safe reinforcement learning.Comment: Transactions on Machine Learning Research (TMLR), 202
Opportunities and challenges from using animal videos in reinforcement learning for navigation
N00014-19-1-2571 - Department of Defense/ONRAccepted manuscrip
Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach
In order for reinforcement learning techniques to be useful in real-world
decision making processes, they must be able to produce robust performance from
limited data. Deep policy optimization methods have achieved impressive results
on complex tasks, but their real-world adoption remains limited because they
often require significant amounts of data to succeed. When combined with small
sample sizes, these methods can result in unstable learning due to their
reliance on high-dimensional sample-based estimates. In this work, we develop
techniques to control the uncertainty introduced by these estimates. We
leverage these techniques to propose a deep policy optimization approach
designed to produce stable performance even when data is scarce. The resulting
algorithm, Uncertainty-Aware Trust Region Policy Optimization, generates robust
policy updates that adapt to the level of uncertainty present throughout the
learning process.Comment: To appear in Proceedings of the Thirty-Fifth AAAI Conference on
Artificial Intelligence (AAAI-21