25 research outputs found
Discrete Sequential Prediction of Continuous Actions for Deep RL
It has long been assumed that high dimensional continuous control problems
cannot be solved effectively by discretizing individual dimensions of the
action space due to the exponentially large number of bins over which policies
would have to be learned. In this paper, we draw inspiration from the recent
success of sequence-to-sequence models for structured prediction problems to
develop policies over discretized spaces. Central to this method is the
realization that complex functions over high dimensional spaces can be modeled
by neural networks that predict one dimension at a time. Specifically, we show
how Q-values and policies over continuous spaces can be modeled using a next
step prediction model over discretized dimensions. With this parameterization,
it is possible to both leverage the compositional structure of action spaces
during learning, as well as compute maxima over action spaces (approximately).
On a simple example task we demonstrate empirically that our method can perform
global search, which effectively gets around the local optimization issues that
plague DDPG. We apply the technique to off-policy (Q-learning) methods and show
that our method can achieve the state-of-the-art for off-policy methods on
several continuous control tasks
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
Policy gradient methods in reinforcement learning have become increasingly
prevalent for state-of-the-art performance in continuous control tasks. Novel
methods typically benchmark against a few key algorithms such as deep
deterministic policy gradients and trust region policy optimization. As such,
it is important to present and use consistent baselines experiments. However,
this can be difficult due to general variance in the algorithms,
hyper-parameter tuning, and environment stochasticity. We investigate and
discuss: the significance of hyper-parameters in policy gradients for
continuous control, general variance in the algorithms, and reproducibility of
reported results. We provide guidelines on reporting novel results as
comparisons against baseline methods such that future researchers can make
informed decisions when investigating novel methods.Comment: Accepted to Reproducibility in Machine Learning Workshop, ICML'1
A Robotic Auto-Focus System based on Deep Reinforcement Learning
Considering its advantages in dealing with high-dimensional visual input and
learning control policies in discrete domain, Deep Q Network (DQN) could be an
alternative method of traditional auto-focus means in the future. In this
paper, based on Deep Reinforcement Learning, we propose an end-to-end approach
that can learn auto-focus policies from visual input and finish at a clear spot
automatically. We demonstrate that our method - discretizing the action space
with coarse to fine steps and applying DQN is not only a solution to auto-focus
but also a general approach towards vision-based control problems. Separate
phases of training in virtual and real environments are applied to obtain an
effective model. Virtual experiments, which are carried out after the virtual
training phase, indicates that our method could achieve 100% accuracy on a
certain view with different focus range. Further training on real robots could
eliminate the deviation between the simulator and real scenario, leading to
reliable performances in real applications.Comment: To Appear at ICARCV 201
Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning
Deep reinforcement learning algorithms can learn complex behavioral skills,
but real-world application of these methods requires a large amount of
experience to be collected by the agent. In practical settings, such as
robotics, this involves repeatedly attempting a task, resetting the environment
between each attempt. However, not all tasks are easily or automatically
reversible. In practice, this learning process requires extensive human
intervention. In this work, we propose an autonomous method for safe and
efficient reinforcement learning that simultaneously learns a forward and reset
policy, with the reset policy resetting the environment for a subsequent
attempt. By learning a value function for the reset policy, we can
automatically determine when the forward policy is about to enter a
non-reversible state, providing for uncertainty-aware safety aborts. Our
experiments illustrate that proper use of the reset policy can greatly reduce
the number of manual resets required to learn a task, can reduce the number of
unsafe actions that lead to non-reversible states, and can automatically induce
a curriculum.Comment: Videos of our experiments are available at:
https://sites.google.com/site/mlleavenotrace
Trust-PCL: An Off-Policy Trust Region Method for Continuous Control
Trust region methods, such as TRPO, are often used to stabilize policy
optimization algorithms in reinforcement learning (RL). While current trust
region strategies are effective for continuous control, they typically require
a prohibitively large amount of on-policy interaction with the environment. To
address this problem, we propose an off-policy trust region method, Trust-PCL.
The algorithm is the result of observing that the optimal policy and state
values of a maximum reward objective with a relative-entropy regularizer
satisfy a set of multi-step pathwise consistencies along any path. Thus,
Trust-PCL is able to maintain optimization stability while exploiting
off-policy data to improve sample efficiency. When evaluated on a number of
continuous control tasks, Trust-PCL improves the solution quality and sample
efficiency of TRPO.Comment: ICLR 201
Temporal Difference Models: Model-Free Deep RL for Model-Based Control
Model-free reinforcement learning (RL) is a powerful, general tool for
learning complex behaviors. However, its sample efficiency is often
impractically large for solving challenging real-world problems, even with
off-policy algorithms such as Q-learning. A limiting factor in classic
model-free RL is that the learning signal consists only of scalar rewards,
ignoring much of the rich information contained in state transition tuples.
Model-based RL uses this information, by training a predictive model, but often
does not achieve the same asymptotic performance as model-free RL due to model
bias. We introduce temporal difference models (TDMs), a family of
goal-conditioned value functions that can be trained with model-free learning
and used for model-based control. TDMs combine the benefits of model-free and
model-based RL: they leverage the rich information in state transitions to
learn very efficiently, while still attaining asymptotic performance that
exceeds that of direct model-based RL methods. Our experimental results show
that, on a range of continuous control tasks, TDMs provide a substantial
improvement in efficiency compared to state-of-the-art model-based and
model-free methods.Comment: Appeared in ICLR 2018; typos correcte
Hindsight Experience Replay
Dealing with sparse rewards is one of the biggest challenges in Reinforcement
Learning (RL). We present a novel technique called Hindsight Experience Replay
which allows sample-efficient learning from rewards which are sparse and binary
and therefore avoid the need for complicated reward engineering. It can be
combined with an arbitrary off-policy RL algorithm and may be seen as a form of
implicit curriculum.
We demonstrate our approach on the task of manipulating objects with a
robotic arm. In particular, we run experiments on three different tasks:
pushing, sliding, and pick-and-place, in each case using only binary rewards
indicating whether or not the task is completed. Our ablation studies show that
Hindsight Experience Replay is a crucial ingredient which makes training
possible in these challenging environments. We show that our policies trained
on a physics simulation can be deployed on a physical robot and successfully
complete the task
Autoregressive Policies for Continuous Control Deep Reinforcement Learning
Reinforcement learning algorithms rely on exploration to discover new
behaviors, which is typically achieved by following a stochastic policy. In
continuous control tasks, policies with a Gaussian distribution have been
widely adopted. Gaussian exploration however does not result in smooth
trajectories that generally correspond to safe and rewarding behaviors in
practical tasks. In addition, Gaussian policies do not result in an effective
exploration of an environment and become increasingly inefficient as the action
rate increases. This contributes to a low sample efficiency often observed in
learning continuous control tasks. We introduce a family of stationary
autoregressive (AR) stochastic processes to facilitate exploration in
continuous control domains. We show that proposed processes possess two
desirable features: subsequent process observations are temporally coherent
with continuously adjustable degree of coherence, and the process stationary
distribution is standard normal. We derive an autoregressive policy (ARP) that
implements such processes maintaining the standard agent-environment interface.
We show how ARPs can be easily used with the existing off-the-shelf learning
algorithms. Empirically we demonstrate that using ARPs results in improved
exploration and sample efficiency in both simulated and real world domains,
and, furthermore, provides smooth exploration trajectories that enable safe
operation of robotic hardware.Comment: Submitted to 28th International Joint Conference on Artificial
Intelligence (IJCAI 2019). Video: https://youtu.be/NCpyXBNqNmw Code:
https://github.com/dkorenkevych/ar
Efficient Entropy for Policy Gradient with Multidimensional Action Space
In recent years, deep reinforcement learning has been shown to be adept at
solving sequential decision processes with high-dimensional state spaces such
as in the Atari games. Many reinforcement learning problems, however, involve
high-dimensional discrete action spaces as well as high-dimensional state
spaces. This paper considers entropy bonus, which is used to encourage
exploration in policy gradient. In the case of high-dimensional action spaces,
calculating the entropy and its gradient requires enumerating all the actions
in the action space and running forward and backpropagation for each action,
which may be computationally infeasible. We develop several novel unbiased
estimators for the entropy bonus and its gradient. We apply these estimators to
several models for the parameterized policies, including Independent Sampling,
CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM.
Finally, we test our algorithms on two environments: a multi-hunter
multi-rabbit grid game and a multi-agent multi-arm bandit problem. The results
show that our entropy estimators substantially improve performance with
marginal additional computational cost
Deep Reinforcement Learning Based Volt-VAR Optimization in Smart Distribution Systems
This paper develops a model-free volt-VAR optimization (VVO) algorithm via
multi-agent deep reinforcement learning (MADRL) in unbalanced distribution
systems. This method is novel since we cast the VVO problem in unbalanced
distribution networks to an intelligent deep Q-network (DQN) framework, which
avoids solving a specific optimization model directly when facing time-varying
operating conditions of the systems. We consider statuses/ratios of switchable
capacitors, voltage regulators, and smart inverters installed at distributed
generators as the action variables of the DQN agents. A delicately designed
reward function guides these agents to interact with the distribution system,
in the direction of reinforcing voltage regulation and power loss reduction
simultaneously. The forward-backward sweep method for radial three-phase
distribution systems provides accurate power flow results within a few
iterations to the DQN environment. Finally, the proposed multi-objective MADRL
method realizes the dual goals for VVO. We test this algorithm on the
unbalanced IEEE 13-bus and 123-bus systems. Numerical simulations validate the
excellent performance of this method in voltage regulation and power loss
reduction