11,629 research outputs found

    Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning

    Full text link
    Entropy augmented to reward is known to soften the greedy argmax policy to softmax policy. Entropy augmentation is reformulated and leads to a motivation to introduce an additional entropy term to the objective function in the form of KL-divergence to regularize optimization process. It results in a policy which monotonically improves while interpolating from the current policy to the softmax greedy policy. This policy is used to build a continuously parameterized algorithm which optimize policy and Q-function simultaneously and whose extreme limits correspond to policy gradient and Q-learning, respectively. Experiments show that there can be a performance gain using an intermediate algorithm.Comment: 16 pages, 1 figure. refined a few expressions, proofs. added source cod

    Divide-and-Conquer Reinforcement Learning

    Full text link
    Standard model-free deep reinforcement learning (RL) algorithms sample a new initial state for each trial, allowing them to optimize policies that can perform well even in highly stochastic environments. However, problems that exhibit considerable initial state variation typically produce high-variance gradient estimates for model-free RL, making direct policy or value function optimization challenging. In this paper, we develop a novel algorithm that instead partitions the initial state space into "slices", and optimizes an ensemble of policies, each on a different slice. The ensemble is gradually unified into a single policy that can succeed on the whole state space. This approach, which we term divide-and-conquer RL, is able to solve complex tasks where conventional deep RL methods are ineffective. Our results show that divide-and-conquer RL greatly outperforms conventional policy gradient methods on challenging grasping, manipulation, and locomotion tasks, and exceeds the performance of a variety of prior methods. Videos of policies learned by our algorithm can be viewed at http://bit.ly/dnc-rlComment: Presented at ICLR 2018. Videos and supporting materials are located at http://bit.ly/dnc-r

    Learning Deep Neural Network Policies with Continuous Memory States

    Full text link
    Policy learning for partially observed control tasks requires policies that can remember salient information from past observations. In this paper, we present a method for learning policies with internal memory for high-dimensional, continuous systems, such as robotic manipulators. Our approach consists of augmenting the state and action space of the system with continuous-valued memory states that the policy can read from and write to. Learning general-purpose policies with this type of memory representation directly is difficult, because the policy must automatically figure out the most salient information to memorize at each time step. We show that, by decomposing this policy search problem into a trajectory optimization phase and a supervised learning phase through a method called guided policy search, we can acquire policies with effective memorization and recall strategies. Intuitively, the trajectory optimization phase chooses the values of the memory states that will make it easier for the policy to produce the right action in future states, while the supervised learning phase encourages the policy to use memorization actions to produce those memory states. We evaluate our method on tasks involving continuous control in manipulation and navigation settings, and show that our method can learn complex policies that successfully complete a range of tasks that require memory

    End-to-End Training of Deep Visuomotor Policies

    Full text link
    Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a partially observed guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.Comment: updating with revisions for JMLR final versio

    Guided Policy Search as Approximate Mirror Descent

    Full text link
    Guided policy search algorithms can be used to optimize complex nonlinear policies, such as deep neural networks, without directly computing policy gradients in the high-dimensional parameter space. Instead, these methods use supervised learning to train the policy to mimic a "teacher" algorithm, such as a trajectory optimizer or a trajectory-centric reinforcement learning method. Guided policy search methods provide asymptotic local convergence guarantees by construction, but it is not clear how much the policy improves within a small, finite number of iterations. We show that guided policy search algorithms can be interpreted as an approximate variant of mirror descent, where the projection onto the constraint manifold is not exact. We derive a new guided policy search algorithm that is simpler and provides appealing improvement and convergence guarantees in simplified convex and linear settings, and show that in the more general nonlinear setting, the error in the projection step can be bounded. We provide empirical results on several simulated robotic navigation and manipulation tasks that show that our method is stable and achieves similar or better performance when compared to prior guided policy search methods, with a simpler formulation and fewer hyperparameters

    A Primal-Dual Approach to Markovian Network Optimization

    Full text link
    We formulate a novel class of stochastic network optimization problems, termed \emph{Markovian network optimization}, as a primal-dual pair, whose solutions provide an dynamic stochastic extension to Wadrop equilibrium principle. We further generalize such network optimization to accommodate variable amount of flow and multi-commodity flows with heterogeneous planning time windows, features that are well-motivated in applications arise from game-theoretic settings. Finally, in order to solve the primal-dual pair, we design dynamic programming based numerical algorithms that outperform state-of-the-art commercial software (Gurobi) in extensive numerical experiments

    Boosting the Actor with Dual Critic

    Full text link
    This paper proposes a new actor-critic-style algorithm called Dual Actor-Critic or Dual-AC. It is derived in a principled way from the Lagrangian dual form of the Bellman optimality equation, which can be viewed as a two-player game between the actor and a critic-like function, which is named as dual critic. Compared to its actor-critic relatives, Dual-AC has the desired property that the actor and dual critic are updated cooperatively to optimize the same objective function, providing a more transparent way for learning the critic that is directly related to the objective function of the actor. We then provide a concrete algorithm that can effectively solve the minimax optimization problem, using techniques of multi-step bootstrapping, path regularization, and stochastic dual ascent algorithm. We demonstrate that the proposed algorithm achieves the state-of-the-art performances across several benchmarks.Comment: 21 pages, 9 figure

    Model-Based Reinforcement Learning via Meta-Policy Optimization

    Full text link
    Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.Comment: First 2 authors contributed equally. Accepted for Conference on Robot Learning (CoRL

    Neural Sequence Model Training via α\alpha-divergence Minimization

    Full text link
    We propose a new neural sequence model training method in which the objective function is defined by α\alpha-divergence. We demonstrate that the objective function generalizes the maximum-likelihood (ML)-based and reinforcement learning (RL)-based objective functions as special cases (i.e., ML corresponds to α→0\alpha \to 0 and RL to α→1\alpha \to1). We also show that the gradient of the objective function can be considered a mixture of ML- and RL-based objective gradients. The experimental results of a machine translation task show that minimizing the objective function with α>0\alpha > 0 outperforms α→0\alpha \to 0, which corresponds to ML-based methods.Comment: 2017 ICML Workshop on Learning to Generate Natural Language (LGNL 2017

    Stochastic Bregman Parallel Direction Method of Multipliers for Distributed Optimization

    Full text link
    Bregman parallel direction method of multipliers (BPDMM) efficiently solves distributed optimization over a network, which arises in a wide spectrum of collaborative multi-agent learning applications. In this paper, we generalize BPDMM to stochastic BPDMM, where each iteration only solves local optimization on a randomly selected subset of nodes rather than all the nodes in the network. Such generalization reduce the need for computational resources and allows applications to larger scale networks. We establish both the global convergence and the O(1/T)O(1/T) iteration complexity of stochastic BPDMM. We demonstrate our results via numerical examples
    • …
    corecore