11,629 research outputs found
Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning
Entropy augmented to reward is known to soften the greedy argmax policy to
softmax policy. Entropy augmentation is reformulated and leads to a motivation
to introduce an additional entropy term to the objective function in the form
of KL-divergence to regularize optimization process. It results in a policy
which monotonically improves while interpolating from the current policy to the
softmax greedy policy. This policy is used to build a continuously
parameterized algorithm which optimize policy and Q-function simultaneously and
whose extreme limits correspond to policy gradient and Q-learning,
respectively. Experiments show that there can be a performance gain using an
intermediate algorithm.Comment: 16 pages, 1 figure. refined a few expressions, proofs. added source
cod
Divide-and-Conquer Reinforcement Learning
Standard model-free deep reinforcement learning (RL) algorithms sample a new
initial state for each trial, allowing them to optimize policies that can
perform well even in highly stochastic environments. However, problems that
exhibit considerable initial state variation typically produce high-variance
gradient estimates for model-free RL, making direct policy or value function
optimization challenging. In this paper, we develop a novel algorithm that
instead partitions the initial state space into "slices", and optimizes an
ensemble of policies, each on a different slice. The ensemble is gradually
unified into a single policy that can succeed on the whole state space. This
approach, which we term divide-and-conquer RL, is able to solve complex tasks
where conventional deep RL methods are ineffective. Our results show that
divide-and-conquer RL greatly outperforms conventional policy gradient methods
on challenging grasping, manipulation, and locomotion tasks, and exceeds the
performance of a variety of prior methods. Videos of policies learned by our
algorithm can be viewed at http://bit.ly/dnc-rlComment: Presented at ICLR 2018. Videos and supporting materials are located
at http://bit.ly/dnc-r
Learning Deep Neural Network Policies with Continuous Memory States
Policy learning for partially observed control tasks requires policies that
can remember salient information from past observations. In this paper, we
present a method for learning policies with internal memory for
high-dimensional, continuous systems, such as robotic manipulators. Our
approach consists of augmenting the state and action space of the system with
continuous-valued memory states that the policy can read from and write to.
Learning general-purpose policies with this type of memory representation
directly is difficult, because the policy must automatically figure out the
most salient information to memorize at each time step. We show that, by
decomposing this policy search problem into a trajectory optimization phase and
a supervised learning phase through a method called guided policy search, we
can acquire policies with effective memorization and recall strategies.
Intuitively, the trajectory optimization phase chooses the values of the memory
states that will make it easier for the policy to produce the right action in
future states, while the supervised learning phase encourages the policy to use
memorization actions to produce those memory states. We evaluate our method on
tasks involving continuous control in manipulation and navigation settings, and
show that our method can learn complex policies that successfully complete a
range of tasks that require memory
End-to-End Training of Deep Visuomotor Policies
Policy search methods can allow robots to learn control policies for a wide
range of tasks, but practical applications of policy search often require
hand-engineered components for perception, state estimation, and low-level
control. In this paper, we aim to answer the following question: does training
the perception and control systems jointly end-to-end provide better
performance than training each component separately? To this end, we develop a
method that can be used to learn policies that map raw image observations
directly to torques at the robot's motors. The policies are represented by deep
convolutional neural networks (CNNs) with 92,000 parameters, and are trained
using a partially observed guided policy search method, which transforms policy
search into supervised learning, with supervision provided by a simple
trajectory-centric reinforcement learning method. We evaluate our method on a
range of real-world manipulation tasks that require close coordination between
vision and control, such as screwing a cap onto a bottle, and present simulated
comparisons to a range of prior policy search methods.Comment: updating with revisions for JMLR final versio
Guided Policy Search as Approximate Mirror Descent
Guided policy search algorithms can be used to optimize complex nonlinear
policies, such as deep neural networks, without directly computing policy
gradients in the high-dimensional parameter space. Instead, these methods use
supervised learning to train the policy to mimic a "teacher" algorithm, such as
a trajectory optimizer or a trajectory-centric reinforcement learning method.
Guided policy search methods provide asymptotic local convergence guarantees by
construction, but it is not clear how much the policy improves within a small,
finite number of iterations. We show that guided policy search algorithms can
be interpreted as an approximate variant of mirror descent, where the
projection onto the constraint manifold is not exact. We derive a new guided
policy search algorithm that is simpler and provides appealing improvement and
convergence guarantees in simplified convex and linear settings, and show that
in the more general nonlinear setting, the error in the projection step can be
bounded. We provide empirical results on several simulated robotic navigation
and manipulation tasks that show that our method is stable and achieves similar
or better performance when compared to prior guided policy search methods, with
a simpler formulation and fewer hyperparameters
A Primal-Dual Approach to Markovian Network Optimization
We formulate a novel class of stochastic network optimization problems,
termed \emph{Markovian network optimization}, as a primal-dual pair, whose
solutions provide an dynamic stochastic extension to Wadrop equilibrium
principle. We further generalize such network optimization to accommodate
variable amount of flow and multi-commodity flows with heterogeneous planning
time windows, features that are well-motivated in applications arise from
game-theoretic settings. Finally, in order to solve the primal-dual pair, we
design dynamic programming based numerical algorithms that outperform
state-of-the-art commercial software (Gurobi) in extensive numerical
experiments
Boosting the Actor with Dual Critic
This paper proposes a new actor-critic-style algorithm called Dual
Actor-Critic or Dual-AC. It is derived in a principled way from the Lagrangian
dual form of the Bellman optimality equation, which can be viewed as a
two-player game between the actor and a critic-like function, which is named as
dual critic. Compared to its actor-critic relatives, Dual-AC has the desired
property that the actor and dual critic are updated cooperatively to optimize
the same objective function, providing a more transparent way for learning the
critic that is directly related to the objective function of the actor. We then
provide a concrete algorithm that can effectively solve the minimax
optimization problem, using techniques of multi-step bootstrapping, path
regularization, and stochastic dual ascent algorithm. We demonstrate that the
proposed algorithm achieves the state-of-the-art performances across several
benchmarks.Comment: 21 pages, 9 figure
Model-Based Reinforcement Learning via Meta-Policy Optimization
Model-based reinforcement learning approaches carry the promise of being data
efficient. However, due to challenges in learning dynamics models that
sufficiently match the real-world dynamics, they struggle to achieve the same
asymptotic performance as model-free methods. We propose Model-Based
Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong
reliance on accurate learned dynamics models. Using an ensemble of learned
dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model
in the ensemble with one policy gradient step. This steers the meta-policy
towards internalizing consistent dynamics predictions among the ensemble while
shifting the burden of behaving optimally w.r.t. the model discrepancies
towards the adaptation step. Our experiments show that MB-MPO is more robust to
model imperfections than previous model-based approaches. Finally, we
demonstrate that our approach is able to match the asymptotic performance of
model-free methods while requiring significantly less experience.Comment: First 2 authors contributed equally. Accepted for Conference on Robot
Learning (CoRL
Neural Sequence Model Training via -divergence Minimization
We propose a new neural sequence model training method in which the objective
function is defined by -divergence. We demonstrate that the objective
function generalizes the maximum-likelihood (ML)-based and reinforcement
learning (RL)-based objective functions as special cases (i.e., ML corresponds
to and RL to ). We also show that the gradient of
the objective function can be considered a mixture of ML- and RL-based
objective gradients. The experimental results of a machine translation task
show that minimizing the objective function with outperforms
, which corresponds to ML-based methods.Comment: 2017 ICML Workshop on Learning to Generate Natural Language (LGNL
2017
Stochastic Bregman Parallel Direction Method of Multipliers for Distributed Optimization
Bregman parallel direction method of multipliers (BPDMM) efficiently solves
distributed optimization over a network, which arises in a wide spectrum of
collaborative multi-agent learning applications. In this paper, we generalize
BPDMM to stochastic BPDMM, where each iteration only solves local optimization
on a randomly selected subset of nodes rather than all the nodes in the
network. Such generalization reduce the need for computational resources and
allows applications to larger scale networks. We establish both the global
convergence and the iteration complexity of stochastic BPDMM. We
demonstrate our results via numerical examples
- …