Search CORE

65 research outputs found

A unified view of entropy-regularized Markov decision processes

Author: Gómez Vicenç
Jonsson Anders
Neu Gergely
Publication venue
Publication date: 22/05/2017
Field of study

We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup

arXiv.org e-Print Archive

Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning

Author: Asadi Kavosh
Cater Evan
Littman Michael L.
Misra Dipendra
Publication venue
Publication date: 08/07/2018
Field of study

Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning. Recently Farahmand et al. (2017) proposed a value-aware model learning (VAML) objective that captures the structure of value function during model learning. Using tools from Asadi et al. (2018), we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric. This equivalence improves our understanding of value-aware models, and also creates a theoretical foundation for applications of Wasserstein in model-based reinforcement~learning.Comment: Accepted at the FAIM workshop "Prediction and Generative Modeling in Reinforcement Learning", Stockholm, Sweden, 201

arXiv.org e-Print Archive

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Author: Ding Nan
Soricut Radu
Publication venue
Publication date: 13/10/2017
Field of study

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks.Comment: Conference on Neural Information Processing Systems 2017. Main paper and supplementary materia

arXiv.org e-Print Archive

Lagrangian Duality in Reinforcement Learning

Author: Pasula Pranay
Publication venue
Publication date: 24/07/2020
Field of study

Although duality is used extensively in certain fields, such as supervised learning in machine learning, it has been much less explored in others, such as reinforcement learning (RL). In this paper, we show how duality is involved in a variety of RL work, from that which spearheaded the field, such as Richard Bellman's value iteration, to that which was done within just the past few years yet has already had significant impact, such as TRPO, A3C, and GAIL. We show that duality is not uncommon in reinforcement learning, especially when value iteration, or dynamic programming, is used or when first or second order approximations are made to transform initially intractable problems into tractable convex programs.Comment: 8 pages, 0 figures; fixed typo in abstrac

arXiv.org e-Print Archive

On Connections between Constrained Optimization and Reinforcement Learning

Author: Geist Matthieu
Pietquin Olivier
Vieillard Nino
Publication venue
Publication date: 29/10/2019
Field of study

Dynamic Programming (DP) provides standard algorithms to solve Markov Decision Processes. However, these algorithms generally do not optimize a scalar objective function. In this paper, we draw connections between DP and (constrained) convex optimization. Specifically, we show clear links in the algorithmic structure between three DP schemes and optimization algorithms. We link Conservative Policy Iteration to Frank-Wolfe, Mirror-Descent Modified Policy Iteration to Mirror Descent, and Politex (Policy Iteration Using Expert Prediction) to Dual Averaging. These abstract DP schemes are representative of a number of (deep) Reinforcement Learning (RL) algorithms. By highlighting these connections (most of which have been noticed earlier, but in a scattered way), we would like to encourage further studies linking RL and convex optimization, that could lead to the design of new, more efficient, and better understood RL algorithms.Comment: Optimization Foundations of Reinforcement Learning Workshop at NeurIPS 201

arXiv.org e-Print Archive

Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

Author: Caldwell Darwin G.
Esteban Domingo
Rozo Leonel
Publication venue
Publication date: 01/08/2019
Field of study

A common strategy to deal with the expensive reinforcement learning (RL) of complex tasks is to decompose them into a collection of subtasks that are usually simpler to learn as well as reusable for new problems. However, when a robot learns the policies for these subtasks, common approaches treat every policy learning process separately. Therefore, all these individual (composable) policies need to be learned before tackling the learning process of the complex task through policies composition. Moreover, such composition of individual policies is usually performed sequentially, which is not suitable for tasks that require to perform the subtasks concurrently. In this paper, we propose to combine a set of composable Gaussian policies corresponding to these subtasks using a set of activation vectors, resulting in a complex Gaussian policy that is a function of the means and covariances matrices of the composable policies. Moreover, we propose an algorithm for learning both compound and composable policies within the same learning process by exploiting the off-policy data generated from the compound policy. The algorithm is built on a maximum entropy RL approach to favor exploration during the learning process. The results of the experiments show that the experience collected with the compound policy permits not only to solve the complex task but also to obtain useful composable policies that successfully perform in their corresponding subtasks.Comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019

arXiv.org e-Print Archive

Reparameterized Variational Divergence Minimization for Stable Imitation

Author: Agarwal Alekh
Arumugam Dilip
Celikyilmaz Asli
Dey Debadeepta
Dolan Bill
Nouri Elnaz
Publication venue
Publication date: 18/06/2020
Field of study

While recent state-of-the-art results for adversarial imitation-learning algorithms are encouraging, recent works exploring the imitation learning from observation (ILO) setting, where trajectories \textit{only} contain expert observations, have not been met with the same success. Inspired by recent investigations of

f

-divergence manipulation for the standard imitation learning setting(Ke et al., 2019; Ghasemipour et al., 2019), we here examine the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms. We unfortunately find that

f

-divergence minimization through reinforcement learning is susceptible to numerical instabilities. We contribute a reparameterization trick for adversarial imitation learning to alleviate the optimization challenges of the promising

f

-divergence minimization framework. Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks

arXiv.org e-Print Archive

PAC-Bayes Control: Learning Policies that Provably Generalize to Novel Environments

Author: Farid Alec
Majumdar Anirudha
Sonar Anoopkumar
Publication venue
Publication date: 25/08/2020
Field of study

Our goal is to learn control policies for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between generalization of control policies to novel environments and generalization of hypotheses in the supervised learning setting. In particular, we utilize the Probably Approximately Correct (PAC)-Bayes framework, which allows us to obtain upper bounds that hold with high probability on the expected cost of (stochastic) control policies across novel environments. We propose policy learning algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (Relative Entropy Programming in particular) in the setting where we are optimizing over a finite policy space. In the more general setting of continuously parameterized policies (e.g., neural network policies), we minimize this upper bound using stochastic gradient descent. We present simulated results of our approach applied to learning (1) reactive obstacle avoidance policies and (2) neural network-based grasping policies. We also present hardware results for the Parrot Swing drone navigating through different obstacle environments. Our examples demonstrate the potential of our approach to provide strong generalization guarantees for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, rich sensory inputs (e.g., depth images), and neural network-based policies.Comment: Extended version of paper presented at the 2018 Conference on Robot Learning (CoRL

arXiv.org e-Print Archive

Path Consistency Learning in Tsallis Entropy Regularized MDPs

Author: Chow Yinlam
Ghavamzadeh Mohammad
Nachum Ofir
Publication venue
Publication date: 09/02/2018
Field of study

We study the sparse entropy-regularized reinforcement learning (ERL) problem in which the entropy term is a special form of the Tsallis entropy. The optimal policy of this formulation is sparse, i.e.,~at each state, it has non-zero probability for only a small number of actions. This addresses the main drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation, in which the optimal policy is softmax, and thus, may assign a non-negligible probability mass to non-optimal actions. This problem is aggravated as the number of actions is increased. In this paper, we follow the work of Nachum et al. (2017) in the soft ERL setting, and propose a class of novel path consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse ERL problem that can work with both on-policy and off-policy data. We first derive a {\em sparse consistency} equation that specifies a relationship between the optimal value function and policy of the sparse ERL along any system trajectory. Crucially, a weak form of the converse is also true, and we quantify the sub-optimality of a policy which satisfies sparse consistency, and show that as we increase the number of actions, this sub-optimality is better than that of the soft ERL optimal policy. We then use this result to derive the sparse PCL algorithms. We empirically compare sparse PCL with its soft counterpart, and show its advantage, especially in problems with a large number of actions

arXiv.org e-Print Archive

Unifying Value Iteration, Advantage Learning, and Dynamic Policy Programming

Author: Doya Kenji
Kozuno Tadashi
Uchibe Eiji
Publication venue
Publication date: 30/10/2017
Field of study

Approximate dynamic programming algorithms, such as approximate value iteration, have been successfully applied to many complex reinforcement learning tasks, and a better approximate dynamic programming algorithm is expected to further extend the applicability of reinforcement learning to various tasks. In this paper we propose a new, robust dynamic programming algorithm that unifies value iteration, advantage learning, and dynamic policy programming. We call it generalized value iteration (GVI) and its approximated version, approximate GVI (AGVI). We show AGVI's performance guarantee, which includes performance guarantees for existing algorithms, as special cases. We discuss theoretical weaknesses of existing algorithms, and explain the advantages of AGVI. Numerical experiments in a simple environment support theoretical arguments, and suggest that AGVI is a promising alternative to previous algorithms

arXiv.org e-Print Archive