18 research outputs found
Soft-Robust Actor-Critic Policy-Gradient
Robust Reinforcement Learning aims to derive optimal behavior that accounts
for model uncertainty in dynamical systems. However, previous studies have
shown that by considering the worst case scenario, robust policies can be
overly conservative. Our soft-robust framework is an attempt to overcome this
issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm
(SR-AC). It learns an optimal policy with respect to a distribution over an
uncertainty set and stays robust to model uncertainty but avoids the
conservativeness of robust strategies. We show the convergence of SR-AC and
test the efficiency of our approach on different domains by comparing it
against regular learning methods and their robust formulations.Comment: UAI 201
A Bayesian Approach to Robust Reinforcement Learning
Robust Markov Decision Processes (RMDPs) intend to ensure robustness with
respect to changing or adversarial system behavior. In this framework,
transitions are modeled as arbitrary elements of a known and properly
structured uncertainty set and a robust optimal policy can be derived under the
worst-case scenario. In this study, we address the issue of learning in RMDPs
using a Bayesian approach. We introduce the Uncertainty Robust Bellman Equation
(URBE) which encourages safe exploration for adapting the uncertainty set to
new observations while preserving robustness. We propose a URBE-based
algorithm, DQN-URBE, that scales this method to higher dimensional domains. Our
experiments show that the derived URBE-based strategy leads to a better
trade-off between less conservative solutions and robustness in the presence of
model misspecification. In addition, we show that the DQN-URBE algorithm can
adapt significantly faster to changing dynamics online compared to existing
robust techniques with fixed uncertainty sets.Comment: Accepted to UAI 201
Challenges of Real-World Reinforcement Learning
Reinforcement learning (RL) has proven its worth in a series of artificial
domains, and is beginning to show some successes in real-world scenarios.
However, much of the research advances in RL are often hard to leverage in
real-world systems due to a series of assumptions that are rarely satisfied in
practice. We present a set of nine unique challenges that must be addressed to
productionize RL to real world problems. For each of these challenges, we
specify the exact meaning of the challenge, present some approaches from the
literature, and specify some metrics for evaluating that challenge. An approach
that addresses all nine challenges would be applicable to a large number of
real world problems. We also present an example domain that has been modified
to present these challenges as a testbed for practical RL research
Learning Robust Options by Conditional Value at Risk Optimization
Options are generally learned by using an inaccurate environment model (or
simulator), which contains uncertain model parameters. While there are several
methods to learn options that are robust against the uncertainty of model
parameters, these methods only consider either the worst case or the average
(ordinary) case for learning options. This limited consideration of the cases
often produces options that do not work well in the unconsidered case. In this
paper, we propose a conditional value at risk (CVaR)-based method to learn
options that work well in both the average and worst cases. We extend the
CVaR-based policy gradient method proposed by Chow and Ghavamzadeh (2014) to
deal with robust Markov decision processes and then apply the extended method
to learning robust options. We conduct experiments to evaluate our method in
multi-joint robot control tasks (HopperIceBlock, Half-Cheetah, and Walker2D).
Experimental results show that our method produces options that 1) give better
worst-case performance than the options learned only to minimize the
average-case loss, and 2) give better average-case performance than the options
learned only to minimize the worst-case loss.Comment: NeurIPS 2019. Video demo:
https://drive.google.com/open?id=1xXgSeEa_nNG397ZkIayk3CwYPy_BPy8X Source
codes:
https://github.com/TakuyaHiraoka/Learning-Robust-Options-by-Conditional-Value-at-Risk-Optimizatio
Challenges of Applying Deep Reinforcement Learning in Dynamic Dispatching
Dynamic dispatching aims to smartly allocate the right resources to the right
place at the right time. Dynamic dispatching is one of the core problems for
operations optimization in the mining industry. Theoretically, deep
reinforcement learning (RL) should be a natural fit to solve this problem.
However, the industry relies on heuristics or even human intuitions, which are
often short-sighted and sub-optimal solutions. In this paper, we review the
main challenges in using deep RL to address the dynamic dispatching problem in
the mining industry.Comment: arXiv admin note: text overlap with arXiv:2008.1071
Bayesian Robust Optimization for Imitation Learning
One of the main challenges in imitation learning is determining what action
an agent should take when outside the state distribution of the demonstrations.
Inverse reinforcement learning (IRL) can enable generalization to new states by
learning a parameterized reward function, but these approaches still face
uncertainty over the true reward function and corresponding optimal policy.
Existing safe imitation learning approaches based on IRL deal with this
uncertainty using a maxmin framework that optimizes a policy under the
assumption of an adversarial reward function, whereas risk-neutral IRL
approaches either optimize a policy for the mean or MAP reward function. While
completely ignoring risk can lead to overly aggressive and unsafe policies,
optimizing in a fully adversarial sense is also problematic as it can lead to
overly conservative policies that perform poorly in practice. To provide a
bridge between these two extremes, we propose Bayesian Robust Optimization for
Imitation Learning (BROIL). BROIL leverages Bayesian reward function inference
and a user specific risk tolerance to efficiently optimize a robust policy that
balances expected return and conditional value at risk. Our empirical results
show that BROIL provides a natural way to interpolate between return-maximizing
and risk-minimizing behaviors and outperforms existing risk-sensitive and
risk-neutral inverse reinforcement learning algorithms. Code is available at
https://github.com/dsbrown1331/broil.Comment: In proceedings NeurIPS 202
Distributional Robustness and Regularization in Reinforcement Learning
Distributionally Robust Optimization (DRO) has enabled to prove the
equivalence between robustness and regularization in classification and
regression, thus providing an analytical reason why regularization generalizes
well in statistical learning. Although DRO's extension to sequential
decision-making overcomes through the robust
Markov Decision Process (MDP) setting, the resulting formulation is hard to
solve, especially on large domains. On the other hand, existing regularization
methods in reinforcement learning only address
due to stochasticity. Our study aims to facilitate robust reinforcement
learning by establishing a dual relation between robust MDPs and
regularization. We introduce Wasserstein distributionally robust MDPs and prove
that they hold out-of-sample performance guarantees. Then, we introduce a new
regularizer for empirical value functions and show that it lower bounds the
Wasserstein distributionally robust value function. We extend the result to
linear value function approximation for large state spaces. Our approach
provides an alternative formulation of robustness with guaranteed finite-sample
performance. Moreover, it suggests using regularization as a practical tool for
dealing with in reinforcement learning methods.Comment: Accepted at the "Theoretical Foundations of Reinforcement Learning"
Workshop - ICML 202
Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification
Many real-world physical control systems are required to satisfy constraints
upon deployment. Furthermore, real-world systems are often subject to effects
such as non-stationarity, wear-and-tear, uncalibrated sensors and so on. Such
effects effectively perturb the system dynamics and can cause a policy trained
successfully in one domain to perform poorly when deployed to a perturbed
version of the same domain. This can affect a policy's ability to maximize
future rewards as well as the extent to which it satisfies constraints. We
refer to this as constrained model misspecification. We present an algorithm
that mitigates this form of misspecification, and showcase its performance in
multiple simulated Mujoco tasks from the Real World Reinforcement Learning
(RWRL) suite
Robust Reinforcement Learning for Continuous Control with Model Misspecification
We provide a framework for incorporating robustness -- to perturbations in
the transition dynamics which we refer to as model misspecification -- into
continuous control Reinforcement Learning (RL) algorithms. We specifically
focus on incorporating robustness into a state-of-the-art continuous control RL
algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve
this by learning a policy that optimizes for a worst case expected return
objective and derive a corresponding robust entropy-regularized Bellman
contraction operator. In addition, we introduce a less conservative,
soft-robust, entropy-regularized objective with a corresponding Bellman
operator. We show that both, robust and soft-robust policies, outperform their
non-robust counterparts in nine Mujoco domains with environment perturbations.
In addition, we show improved robust performance on a high-dimensional,
simulated, dexterous robotic hand. Finally, we present multiple investigative
experiments that provide a deeper insight into the robustness framework. This
includes an adaptation to another continuous control RL algorithm as well as
learning the uncertainty set from offline data. Performance videos can be found
online at https://sites.google.com/view/robust-rl
Evaluating the progress of Deep Reinforcement Learning in the real world: aligning domain-agnostic and domain-specific research
Deep Reinforcement Learning (DRL) is considered a potential framework to
improve many real-world autonomous systems; it has attracted the attention of
multiple and diverse fields. Nevertheless, the successful deployment in the
real world is a test most of DRL models still need to pass. In this work we
focus on this issue by reviewing and evaluating the research efforts from both
domain-agnostic and domain-specific communities. On one hand, we offer a
comprehensive summary of DRL challenges and summarize the different proposals
to mitigate them; this helps identifying five gaps of domain-agnostic research.
On the other hand, from the domain-specific perspective, we discuss different
success stories and argue why other models might fail to be deployed. Finally,
we take up on ways to move forward accounting for both perspectives