108,920 research outputs found
Policy Gradients with Variance Related Risk Criteria
Managing risk in dynamic decision problems is of cardinal importance in many
fields such as finance and process control. The most common approach to
defining risk is through various variance related criteria such as the Sharpe
Ratio or the standard deviation adjusted reward. It is known that optimizing
many of the variance related risk criteria is NP-hard. In this paper we devise
a framework for local policy gradient style algorithms for reinforcement
learning for variance related criteria. Our starting point is a new formula for
the variance of the cost-to-go in episodic tasks. Using this formula we develop
policy gradient algorithms for criteria that involve both the expected cost and
the variance of the cost. We prove the convergence of these algorithms to local
minima and demonstrate their applicability in a portfolio planning problem.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Practical Risk Measures in Reinforcement Learning
Practical application of Reinforcement Learning (RL) often involves risk
considerations. We study a generalized approximation scheme for risk measures,
based on Monte-Carlo simulations, where the risk measures need not necessarily
be \emph{coherent}. We demonstrate that, even in simple problems, measures such
as the variance of the reward-to-go do not capture the risk in a satisfactory
manner. In addition, we show how a risk measure can be derived from model's
realizations. We propose a neural architecture for estimating the risk and
suggest the risk critic architecture that can be use to optimize a policy under
general risk measures. We conclude our work with experiments that demonstrate
the efficacy of our approach
Risk Averse Robust Adversarial Reinforcement Learning
Deep reinforcement learning has recently made significant progress in solving
computer games and robotic control tasks. A known problem, though, is that
policies overfit to the training environment and may not avoid rare,
catastrophic events such as automotive accidents. A classical technique for
improving the robustness of reinforcement learning algorithms is to train on a
set of randomized environments, but this approach only guards against common
situations. Recently, robust adversarial reinforcement learning (RARL) was
developed, which allows efficient applications of random and systematic
perturbations by a trained adversary. A limitation of RARL is that only the
expected control objective is optimized; there is no explicit modeling or
optimization of risk. Thus the agents do not consider the probability of
catastrophic events (i.e., those inducing abnormally large negative reward),
except through their effect on the expected objective. In this paper we
introduce risk-averse robust adversarial reinforcement learning (RARARL), using
a risk-averse protagonist and a risk-seeking adversary. We test our approach on
a self-driving vehicle controller. We use an ensemble of policy networks to
model risk as the variance of value functions. We show through experiments that
a risk-averse agent is better equipped to handle a risk-seeking adversary, and
experiences substantially fewer crashes compared to agents trained without an
adversary.Comment: ICRA 201
Variance Adjusted Actor Critic Algorithms
We present an actor-critic framework for MDPs where the objective is the
variance-adjusted expected return. Our critic uses linear function
approximation, and we extend the concept of compatible features to the
variance-adjusted setting. We present an episodic actor-critic algorithm and
show that it converges almost surely to a locally optimal point of the
objective function
Adaptive Symmetric Reward Noising for Reinforcement Learning
Recent reinforcement learning algorithms, though achieving impressive results
in various fields, suffer from brittle training effects such as regression in
results and high sensitivity to initialization and parameters. We claim that
some of the brittleness stems from variance differences, i.e. when different
environment areas - states and/or actions - have different rewards variance.
This causes two problems: First, the "Boring Areas Trap" in algorithms such as
Q-learning, where moving between areas depends on the current area variance,
and getting out of a boring area is hard due to its low variance. Second, the
"Manipulative Consultant" problem, when value-estimation functions used in DQN
and Actor-Critic algorithms influence the agent to prefer boring areas,
regardless of the mean rewards return, as they maximize estimation precision
rather than rewards. This sheds a new light on how exploration contribute to
training, as it helps with both challenges. Cognitive experiments in humans
showed that noised reward signals may paradoxically improve performance. We
explain this using the two mentioned problems, claiming that both humans and
algorithms may share similar challenges. Inspired by this result, we propose
the Adaptive Symmetric Reward Noising (ASRN), by which we mean adding Gaussian
noise to rewards according to their states' estimated variance, thus avoiding
the two problems while not affecting the environment's mean rewards behavior.
We conduct our experiments in a Multi Armed Bandit problem with variance
differences. We demonstrate that a Q-learning algorithm shows the brittleness
effect in this problem, and that the ASRN scheme can dramatically improve the
results. We show that ASRN helps a DQN algorithm training process reach better
results in an end to end autonomous driving task using the AirSim driving
simulator.Comment: 9 pages, 7 figures, conferenc
Multi-objective Model-based Policy Search for Data-efficient Learning with Sparse Rewards
The most data-efficient algorithms for reinforcement learning in robotics are
model-based policy search algorithms, which alternate between learning a
dynamical model of the robot and optimizing a policy to maximize the expected
return given the model and its uncertainties. However, the current algorithms
lack an effective exploration strategy to deal with sparse or misleading reward
scenarios: if they do not experience any state with a positive reward during
the initial random exploration, it is very unlikely to solve the problem. Here,
we propose a novel model-based policy search algorithm, Multi-DEX, that
leverages a learned dynamical model to efficiently explore the task space and
solve tasks with sparse rewards in a few episodes. To achieve this, we frame
the policy search problem as a multi-objective, model-based policy optimization
problem with three objectives: (1) generate maximally novel state trajectories,
(2) maximize the expected return and (3) keep the system in state-space regions
for which the model is as accurate as possible. We then optimize these
objectives using a Pareto-based multi-objective optimization algorithm. The
experiments show that Multi-DEX is able to solve sparse reward scenarios (with
a simulated robotic arm) in much lower interaction time than VIME, TRPO,
GEP-PG, CMA-ES and Black-DROPS.Comment: Conference on Robot Learning (CoRL)- 2018; Code at
https://github.com/resibots/kaushik_2018_multi-dex ; Video at
https://youtu.be/9ZLwUxAAq6
Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes
In this paper we extend temporal difference policy evaluation algorithms to
performance criteria that include the variance of the cumulative reward. Such
criteria are useful for risk management, and are important in domains such as
finance and process control. We propose both TD(0) and LSTD(lambda) variants
with linear function approximation, prove their convergence, and demonstrate
their utility in a 4-dimensional continuous state space problem
Decoupled Data Based Approach for Learning to Control Nonlinear Dynamical Systems
This paper addresses the problem of learning the optimal control policy for a
nonlinear stochastic dynamical system with continuous state space, continuous
action space and unknown dynamics. This class of problems are typically
addressed in stochastic adaptive control and reinforcement learning literature
using model-based and model-free approaches respectively. Both methods rely on
solving a dynamic programming problem, either directly or indirectly, for
finding the optimal closed loop control policy. The inherent `curse of
dimensionality' associated with dynamic programming method makes these
approaches also computationally difficult.
This paper proposes a novel decoupled data-based control (D2C) algorithm that
addresses this problem using a decoupled, `open loop - closed loop', approach.
First, an open-loop deterministic trajectory optimization problem is solved
using a black-box simulation model of the dynamical system. Then, a closed loop
control is developed around this open loop trajectory by linearization of the
dynamics about this nominal trajectory. By virtue of linearization, a linear
quadratic regulator based algorithm can be used for this closed loop control.
We show that the performance of D2C algorithm is approximately optimal.
Moreover, simulation performance suggests significant reduction in training
time compared to other state of the art algorithms
Emergent Complexity via Multi-Agent Competition
Reinforcement learning algorithms can train agents that solve problems in
complex, interesting environments. Normally, the complexity of the trained
agent is closely related to the complexity of the environment. This suggests
that a highly capable agent requires a complex environment for training. In
this paper, we point out that a competitive multi-agent environment trained
with self-play can produce behaviors that are far more complex than the
environment itself. We also point out that such environments come with a
natural curriculum, because for any skill level, an environment full of agents
of this level will have the right level of difficulty. This work introduces
several competitive multi-agent environments where agents compete in a 3D world
with simulated physics. The trained agents learn a wide variety of complex and
interesting skills, even though the environment themselves are relatively
simple. The skills include behaviors such as running, blocking, ducking,
tackling, fooling opponents, kicking, and defending using both arms and legs. A
highlight of the learned behaviors can be found here: https://goo.gl/eR7fbXComment: Published as a conference paper at ICLR 201
Reinforcement Learning
Reinforcement learning (RL) is a general framework for adaptive control,
which has proven to be efficient in many domains, e.g., board games, video
games or autonomous vehicles. In such problems, an agent faces a sequential
decision-making problem where, at every time step, it observes its state,
performs an action, receives a reward and moves to a new state. An RL agent
learns by trial and error a good policy (or controller) based on observations
and numeric reward feedback on the previously performed action. In this
chapter, we present the basic framework of RL and recall the two main families
of approaches that have been developed to learn a good policy. The first one,
which is value-based, consists in estimating the value of an optimal policy,
value from which a policy can be recovered, while the other, called policy
search, directly works in a policy space. Actor-critic methods can be seen as a
policy search technique where the policy value that is learned guides the
policy improvement. Besides, we give an overview of some extensions of the
standard RL framework, notably when risk-averse behavior needs to be taken into
account or when rewards are not available or not known.Comment: Chapter in "A Guided Tour of Artificial Intelligence Research",
Springe
- …