65 research outputs found

    A unified view of entropy-regularized Markov decision processes

    Full text link
    We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup

    Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning

    Full text link
    Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning. Recently Farahmand et al. (2017) proposed a value-aware model learning (VAML) objective that captures the structure of value function during model learning. Using tools from Asadi et al. (2018), we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric. This equivalence improves our understanding of value-aware models, and also creates a theoretical foundation for applications of Wasserstein in model-based reinforcement~learning.Comment: Accepted at the FAIM workshop "Prediction and Generative Modeling in Reinforcement Learning", Stockholm, Sweden, 201

    Cold-Start Reinforcement Learning with Softmax Policy Gradient

    Full text link
    Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks.Comment: Conference on Neural Information Processing Systems 2017. Main paper and supplementary materia

    Lagrangian Duality in Reinforcement Learning

    Full text link
    Although duality is used extensively in certain fields, such as supervised learning in machine learning, it has been much less explored in others, such as reinforcement learning (RL). In this paper, we show how duality is involved in a variety of RL work, from that which spearheaded the field, such as Richard Bellman's value iteration, to that which was done within just the past few years yet has already had significant impact, such as TRPO, A3C, and GAIL. We show that duality is not uncommon in reinforcement learning, especially when value iteration, or dynamic programming, is used or when first or second order approximations are made to transform initially intractable problems into tractable convex programs.Comment: 8 pages, 0 figures; fixed typo in abstrac

    On Connections between Constrained Optimization and Reinforcement Learning

    Full text link
    Dynamic Programming (DP) provides standard algorithms to solve Markov Decision Processes. However, these algorithms generally do not optimize a scalar objective function. In this paper, we draw connections between DP and (constrained) convex optimization. Specifically, we show clear links in the algorithmic structure between three DP schemes and optimization algorithms. We link Conservative Policy Iteration to Frank-Wolfe, Mirror-Descent Modified Policy Iteration to Mirror Descent, and Politex (Policy Iteration Using Expert Prediction) to Dual Averaging. These abstract DP schemes are representative of a number of (deep) Reinforcement Learning (RL) algorithms. By highlighting these connections (most of which have been noticed earlier, but in a scattered way), we would like to encourage further studies linking RL and convex optimization, that could lead to the design of new, more efficient, and better understood RL algorithms.Comment: Optimization Foundations of Reinforcement Learning Workshop at NeurIPS 201

    Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

    Full text link
    A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Moreover, such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances matrices of the composable policies. Moreover, we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. The algorithm is built on a maximum entropy RL approach to favor exploration during the learning process. The results of the experiments show that the experience collected with the compound policy permits not only to solve the complex task but also to obtain useful composable policies that successfully perform in their corresponding subtasks.Comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019

    Reparameterized Variational Divergence Minimization for Stable Imitation

    Full text link
    While recent state-of-the-art results for adversarial imitation-learning algorithms are encouraging, recent works exploring the imitation learning from observation (ILO) setting, where trajectories \textit{only} contain expert observations, have not been met with the same success. Inspired by recent investigations of ff-divergence manipulation for the standard imitation learning setting(Ke et al., 2019; Ghasemipour et al., 2019), we here examine the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms. We unfortunately find that ff-divergence minimization through reinforcement learning is susceptible to numerical instabilities. We contribute a reparameterization trick for adversarial imitation learning to alleviate the optimization challenges of the promising ff-divergence minimization framework. Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks

    PAC-Bayes Control: Learning Policies that Provably Generalize to Novel Environments

    Full text link
    Our goal is to learn control policies for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between generalization of control policies to novel environments and generalization of hypotheses in the supervised learning setting. In particular, we utilize the Probably Approximately Correct (PAC)-Bayes framework, which allows us to obtain upper bounds that hold with high probability on the expected cost of (stochastic) control policies across novel environments. We propose policy learning algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (Relative Entropy Programming in particular) in the setting where we are optimizing over a finite policy space. In the more general setting of continuously parameterized policies (e.g., neural network policies), we minimize this upper bound using stochastic gradient descent. We present simulated results of our approach applied to learning (1) reactive obstacle avoidance policies and (2) neural network-based grasping policies. We also present hardware results for the Parrot Swing drone navigating through different obstacle environments. Our examples demonstrate the potential of our approach to provide strong generalization guarantees for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, rich sensory inputs (e.g., depth images), and neural network-based policies.Comment: Extended version of paper presented at the 2018 Conference on Robot Learning (CoRL

    Path Consistency Learning in Tsallis Entropy Regularized MDPs

    Full text link
    We study the sparse entropy-regularized reinforcement learning (ERL) problem in which the entropy term is a special form of the Tsallis entropy. The optimal policy of this formulation is sparse, i.e.,~at each state, it has non-zero probability for only a small number of actions. This addresses the main drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation, in which the optimal policy is softmax, and thus, may assign a non-negligible probability mass to non-optimal actions. This problem is aggravated as the number of actions is increased. In this paper, we follow the work of Nachum et al. (2017) in the soft ERL setting, and propose a class of novel path consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse ERL problem that can work with both on-policy and off-policy data. We first derive a {\em sparse consistency} equation that specifies a relationship between the optimal value function and policy of the sparse ERL along any system trajectory. Crucially, a weak form of the converse is also true, and we quantify the sub-optimality of a policy which satisfies sparse consistency, and show that as we increase the number of actions, this sub-optimality is better than that of the soft ERL optimal policy. We then use this result to derive the sparse PCL algorithms. We empirically compare sparse PCL with its soft counterpart, and show its advantage, especially in problems with a large number of actions

    Unifying Value Iteration, Advantage Learning, and Dynamic Policy Programming

    Full text link
    Approximate dynamic programming algorithms, such as approximate value iteration, have been successfully applied to many complex reinforcement learning tasks, and a better approximate dynamic programming algorithm is expected to further extend the applicability of reinforcement learning to various tasks. In this paper we propose a new, robust dynamic programming algorithm that unifies value iteration, advantage learning, and dynamic policy programming. We call it generalized value iteration (GVI) and its approximated version, approximate GVI (AGVI). We show AGVI's performance guarantee, which includes performance guarantees for existing algorithms, as special cases. We discuss theoretical weaknesses of existing algorithms, and explain the advantages of AGVI. Numerical experiments in a simple environment support theoretical arguments, and suggest that AGVI is a promising alternative to previous algorithms
    • …
    corecore