8,493 research outputs found
Deep Reinforcement Learning in Parameterized Action Space
Recent work has shown that deep neural networks are capable of approximating
both value functions and policies in reinforcement learning domains featuring
continuous state and action spaces. However, to the best of our knowledge no
previous work has succeeded at using deep neural networks in structured
(parameterized) continuous action spaces. To fill this gap, this paper focuses
on learning within the domain of simulated RoboCup soccer, which features a
small set of discrete action types, each of which is parameterized with
continuous variables. The best learned agent can score goals more reliably than
the 2012 RoboCup champion agent. As such, this paper represents a successful
extension of deep reinforcement learning to the class of parameterized action
space MDPs
Supervised Policy Update for Deep Reinforcement Learning
We propose a new sample-efficient methodology, called Supervised Policy
Update (SPU), for deep reinforcement learning. Starting with data generated by
the current policy, SPU formulates and solves a constrained optimization
problem in the non-parameterized proximal policy space. Using supervised
regression, it then converts the optimal non-parameterized policy to a
parameterized policy, from which it draws new samples. The methodology is
general in that it applies to both discrete and continuous action spaces, and
can handle a wide variety of proximity constraints for the non-parameterized
optimization problem. We show how the Natural Policy Gradient and Trust Region
Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization
(PPO) problem can be addressed by this methodology. The SPU implementation is
much simpler than TRPO. In terms of sample efficiency, our extensive
experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and
outperforms PPO in Atari video game tasks.Comment: Accepted as a conference paper at ICLR 201
Efficient Entropy for Policy Gradient with Multidimensional Action Space
In recent years, deep reinforcement learning has been shown to be adept at
solving sequential decision processes with high-dimensional state spaces such
as in the Atari games. Many reinforcement learning problems, however, involve
high-dimensional discrete action spaces as well as high-dimensional state
spaces. This paper considers entropy bonus, which is used to encourage
exploration in policy gradient. In the case of high-dimensional action spaces,
calculating the entropy and its gradient requires enumerating all the actions
in the action space and running forward and backpropagation for each action,
which may be computationally infeasible. We develop several novel unbiased
estimators for the entropy bonus and its gradient. We apply these estimators to
several models for the parameterized policies, including Independent Sampling,
CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM.
Finally, we test our algorithms on two environments: a multi-hunter
multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results
show that our entropy estimators substantially improve performance with
marginal additional computational cost
Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space
In this paper we propose a hybrid architecture of actor-critic algorithms for
reinforcement learning in parameterized action space, which consists of
multiple parallel sub-actor networks to decompose the structured action space
into simpler action spaces along with a critic network to guide the training of
all sub-actor networks. While this paper is mainly focused on parameterized
action space, the proposed architecture, which we call hybrid actor-critic, can
be extended for more general action spaces which has a hierarchical structure.
We present an instance of the hybrid actor-critic architecture based on
proximal policy optimization (PPO), which we refer to as hybrid proximal policy
optimization (H-PPO). Our experiments test H-PPO on a collection of tasks with
parameterized action space, where H-PPO demonstrates superior performance over
previous methods of parameterized action reinforcement learning
Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning
As a step towards developing zero-shot task generalization capabilities in
reinforcement learning (RL), we introduce a new RL problem where the agent
should learn to execute sequences of instructions after learning useful skills
that solve subtasks. In this problem, we consider two types of generalizations:
to previously unseen instructions and to longer sequences of instructions. For
generalization over unseen instructions, we propose a new objective which
encourages learning correspondences between similar subtasks by making
analogies. For generalization over sequential instructions, we present a
hierarchical architecture where a meta controller learns to use the acquired
skills for executing the instructions. To deal with delayed reward, we propose
a new neural architecture in the meta controller that learns when to update the
subtask, which makes learning more efficient. Experimental results on a
stochastic 3D domain show that the proposed ideas are crucial for
generalization to longer instructions as well as unseen instructions.Comment: ICML 201
Guide Actor-Critic for Continuous Control
Actor-critic methods solve reinforcement learning problems by updating a
parameterized policy known as an actor in a direction that increases an
estimate of the expected return known as a critic. However, existing
actor-critic methods only use values or gradients of the critic to update the
policy parameter. In this paper, we propose a novel actor-critic method called
the guide actor-critic (GAC). GAC firstly learns a guide actor that locally
maximizes the critic and then it updates the policy parameter based on the
guide actor by supervised learning. Our main theoretical contributions are two
folds. First, we show that GAC updates the guide actor by performing
second-order optimization in the action space where the curvature matrix is
based on the Hessians of the critic. Second, we show that the deterministic
policy gradient method is a special case of GAC when the Hessians are ignored.
Through experiments, we show that our method is a promising reinforcement
learning method for continuous controls.Comment: ICLR 201
DSAC: Distributional Soft Actor Critic for Risk-Sensitive Reinforcement Learning
In this paper, we present a new reinforcement learning (RL) algorithm called
Distributional Soft Actor Critic (DSAC), which exploits the distributional
information of accumulated rewards to achieve better performance. Seamlessly
integrating SAC (which uses entropy to encourage exploration) with a principled
distributional view of the underlying objective, DSAC takes into consideration
the randomness in both action and rewards, and beats the state-of-the-art
baselines in several continuous control benchmarks. Moreover, with the
distributional information of rewards, we propose a unified framework for
risk-sensitive learning, one that goes beyond maximizing only expected
accumulated rewards. Under this framework we discuss three specific
risk-related metrics: percentile, mean-variance and distorted expectation. Our
extensive experiments demonstrate that with distribution modeling in RL, the
agent performs better for both risk-averse and risk-seeking control tasks
Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space
Most existing deep reinforcement learning (DRL) frameworks consider either
discrete action space or continuous action space solely. Motivated by
applications in computer games, we consider the scenario with
discrete-continuous hybrid action space. To handle hybrid action space,
previous works either approximate the hybrid space by discretization, or relax
it into a continuous set. In this paper, we propose a parametrized deep
Q-network (P- DQN) framework for the hybrid action space without approximation
or relaxation. Our algorithm combines the spirits of both DQN (dealing with
discrete action space) and DDPG (dealing with continuous action space) by
seamlessly integrating them. Empirical results on a simulation example, scoring
a goal in simulated RoboCup soccer and the solo mode in game King of Glory
(KOG) validate the efficiency and effectiveness of our method
Randomized Value Functions via Multiplicative Normalizing Flows
Randomized value functions offer a promising approach towards the challenge
of efficient exploration in complex environments with high dimensional state
and action spaces. Unlike traditional point estimate methods, randomized value
functions maintain a posterior distribution over action-space values. This
prevents the agent's behavior policy from prematurely exploiting early
estimates and falling into local optima. In this work, we leverage recent
advances in variational Bayesian neural networks and combine these with
traditional Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG)
to achieve randomized value functions for high-dimensional domains. In
particular, we augment DQN and DDPG with multiplicative normalizing flows in
order to track a rich approximate posterior distribution over the parameters of
the value function. This allows the agent to perform approximate Thompson
sampling in a computationally efficient manner via stochastic gradient methods.
We demonstrate the benefits of our approach through an empirical comparison in
high dimensional environments
QUOTA: The Quantile Option Architecture for Reinforcement Learning
In this paper, we propose the Quantile Option Architecture (QUOTA) for
exploration based on recent advances in distributional reinforcement learning
(RL). In QUOTA, decision making is based on quantiles of a value distribution,
not only the mean. QUOTA provides a new dimension for exploration via making
use of both optimism and pessimism of a value distribution. We demonstrate the
performance advantage of QUOTA in both challenging video games and physical
robot simulators.Comment: AAAI 201
- …