55,983 research outputs found
Recommended from our members
Safe Reinforcement Learning
This dissertation proposes and presents solutions to two new problems that fall within the broad scope of reinforcement learning (RL) research. The first problem, high confidence off-policy evaluation (HCOPE), requires an algorithm to use historical data from one or more behavior policies to compute a high confidence lower bound on the performance of an evaluation policy. This allows us to, for the first time, provide the user of any RL algorithm with confidence that a newly proposed policy (which has never actually been used) will perform well.
The second problem is to construct what we call a safe reinforcement learning algorithm---an algorithm that searches for new and improved policies, while ensuring that the probability that a bad policy is proposed is low. Importantly, the user of the RL algorithm may tune the meaning of bad (in terms of a desired performance baseline) and how low the probability of a bad policy being deployed should be, in order to capture the level of risk that is acceptable for the application at hand.
We show empirically that our solutions to these two critical problems require surprisingly little data, making them practical for real problems. While our methods allow us to, for the first time, produce convincing statistical guarantees about the performance of a policy without requiring its execution, the primary contribution of this dissertation is not the methods that we propose. The primary contribution of this dissertation is a compelling argument that these two problems, HCOPE and safe reinforcement learning, which at first may seem out of reach, are actually tractable. We hope that this will inspire researchers to propose their own methods, which improve upon our own, and that the development of increasingly data-efficient safe reinforcement learning algorithms will catalyze the widespread adoption of reinforcement learning algorithms for suitable real-world problems
Enhancing Exploration and Safety in Deep Reinforcement Learning
A Deep Reinforcement Learning (DRL) agent tries to learn a policy maximizing a long-term objective by trials and errors in large state spaces. However, this learning paradigm requires a non-trivial amount of interactions in the environment to achieve good performance. Moreover, critical applications, such as robotics, typically involve safety criteria to consider while designing novel DRL solutions. Hence, devising safe learning approaches with efficient exploration is crucial to avoid getting stuck in local optima, failing to learn properly, or causing damages to the surrounding environment. This thesis focuses on developing Deep Reinforcement Learning algorithms to foster efficient exploration and safer behaviors in simulation and real domains of interest, ranging from robotics to multi-agent systems. To this end, we rely both on standard benchmarks, such as SafetyGym, and robotic tasks widely adopted in the literature (e.g., manipulation, navigation). This variety of problems is crucial to assess the statistical significance of our empirical studies and the generalization skills of our approaches. We initially benchmark the sample efficiency versus performance trade-off between value-based and policy-gradient algorithms. This part highlights the benefits of using non-standard simulation environments (i.e., Unity), which also facilitates the development of further optimization for DRL. We also discuss the limitations of standard evaluation metrics (e.g., return) in characterizing the actual behaviors of a policy, proposing the use of Formal Verification (FV) as a practical methodology to evaluate behaviors over desired specifications. The second part introduces Evolutionary Algorithms (EAs) as a gradient-free complimentary optimization strategy. In detail, we combine population-based and gradient-based DRL to diversify exploration and improve performance both in single and multi-agent applications. For the latter, we discuss how prior Multi-Agent (Deep) Reinforcement Learning (MARL) approaches hinder exploration, proposing an architecture that favors cooperation without affecting exploration
Clipped-Objective Policy Gradients for Pessimistic Policy Optimization
To facilitate efficient learning, policy gradient approaches to deep
reinforcement learning (RL) are typically paired with variance reduction
measures and strategies for making large but safe policy changes based on a
batch of experiences. Natural policy gradient methods, including Trust Region
Policy Optimization (TRPO), seek to produce monotonic improvement through
bounded changes in policy outputs. Proximal Policy Optimization (PPO) is a
commonly used, first-order algorithm that instead uses loss clipping to take
multiple safe optimization steps per batch of data, replacing the bound on the
single step of TRPO with regularization on multiple steps. In this work, we
find that the performance of PPO, when applied to continuous action spaces, may
be consistently improved through a simple change in objective. Instead of the
importance sampling objective of PPO, we instead recommend a basic policy
gradient, clipped in an equivalent fashion. While both objectives produce
biased gradient estimates with respect to the RL objective, they also both
display significantly reduced variance compared to the unbiased off-policy
policy gradient. Additionally, we show that (1) the clipped-objective policy
gradient (COPG) objective is on average "pessimistic" compared to both the PPO
objective and (2) this pessimism promotes enhanced exploration. As a result, we
empirically observe that COPG produces improved learning compared to PPO in
single-task, constrained, and multi-task learning, without adding significant
computational cost or complexity. Compared to TRPO, the COPG approach is seen
to offer comparable or superior performance, while retaining the simplicity of
a first-order method.Comment: 12 pages, 8 figure
Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions
Learning in MDPs with highly complex state representations is currently
possible due to multiple advancements in reinforcement learning algorithm
design. However, this incline in complexity, and furthermore the increase in
the dimensions of the observation came at the cost of volatility that can be
taken advantage of via adversarial attacks (i.e. moving along worst-case
directions in the observation space). To solve this policy instability problem
we propose a novel method to detect the presence of these non-robust directions
via local quadratic approximation of the deep neural policy loss. Our method
provides a theoretical basis for the fundamental cut-off between safe
observations and adversarial observations. Furthermore, our technique is
computationally efficient, and does not depend on the methods used to produce
the worst-case directions. We conduct extensive experiments in the Arcade
Learning Environment with several different adversarial attack techniques. Most
significantly, we demonstrate the effectiveness of our approach even in the
setting where non-robust directions are explicitly optimized to circumvent our
proposed method.Comment: Published in ICML 202
Recommended from our members
Understanding Model-Based Reinforcement Learning and its Application in Safe Reinforcement Learning
Model-based reinforcement learning algorithms have been shown to achieve successful results on various continuous control benchmarks, but the understanding of model-based methods is limited. We try to interpret how model-based method works through novel experiments on state-of-the-art algorithms with an emphasis on the model learning part. We evaluate the role of the model learning in policy optimization and propose methods to learn a more accurate model. With a better understanding of model-based reinforcement learning, we then apply model-based methods to solve safe reinforcement learning (RL) problems with near-zero violation of hard constraints throughout training. Drawing an analogy with how humans and animals learn to perform safe actions, we break down the safe RL problem into three stages. First, we train agents in a constraint-free environment to learn a performant policy for reaching high rewards, and simultaneously learn a model of the dynamics. Second, we use model-based methods to plan safe actions and train a safeguarding policy from these actions through imitation. Finally, we propose a factored framework to train an overall policy that mixes the performant policy and the safeguarding policy. This three-step curriculum ensures near-zero violation of safety constraints at all times. As an advantage of model-based method, the sample complexity required at the second and third steps of the process is significantly lower than model-free methods and can enable online safe learning. We demonstrate the effectiveness of our methods in various continuous control problems and analyze the advantages over state-of-the-art approaches
- …