65 research outputs found
A unified view of entropy-regularized Markov decision processes
We propose a general framework for entropy-regularized average-reward
reinforcement learning in Markov decision processes (MDPs). Our approach is
based on extending the linear-programming formulation of policy optimization in
MDPs to accommodate convex regularization functions. Our key result is showing
that using the conditional entropy of the joint state-action distributions as
regularization yields a dual optimization problem closely resembling the
Bellman optimality equations. This result enables us to formalize a number of
state-of-the-art entropy-regularized reinforcement learning algorithms as
approximate variants of Mirror Descent or Dual Averaging, and thus to argue
about the convergence properties of these methods. In particular, we show that
the exact version of the TRPO algorithm of Schulman et al. (2015) actually
converges to the optimal policy, while the entropy-regularized policy gradient
methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally,
we illustrate empirically the effects of using various regularization
techniques on learning performance in a simple reinforcement learning setup
Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning
Learning a generative model is a key component of model-based reinforcement
learning. Though learning a good model in the tabular setting is a simple task,
learning a useful model in the approximate setting is challenging. In this
context, an important question is the loss function used for model learning as
varying the loss function can have a remarkable impact on effectiveness of
planning. Recently Farahmand et al. (2017) proposed a value-aware model
learning (VAML) objective that captures the structure of value function during
model learning. Using tools from Asadi et al. (2018), we show that minimizing
the VAML objective is in fact equivalent to minimizing the Wasserstein metric.
This equivalence improves our understanding of value-aware models, and also
creates a theoretical foundation for applications of Wasserstein in model-based
reinforcement~learning.Comment: Accepted at the FAIM workshop "Prediction and Generative Modeling in
Reinforcement Learning", Stockholm, Sweden, 201
Cold-Start Reinforcement Learning with Softmax Policy Gradient
Policy-gradient approaches to reinforcement learning have two common and
undesirable overhead procedures, namely warm-start training and sample variance
reduction. In this paper, we describe a reinforcement learning method based on
a softmax value function that requires neither of these procedures. Our method
combines the advantages of policy-gradient methods with the efficiency and
simplicity of maximum-likelihood approaches. We apply this new cold-start
reinforcement learning method in training sequence generation models for
structured output prediction problems. Empirical evidence validates this method
on automatic summarization and image captioning tasks.Comment: Conference on Neural Information Processing Systems 2017. Main paper
and supplementary materia
Lagrangian Duality in Reinforcement Learning
Although duality is used extensively in certain fields, such as supervised
learning in machine learning, it has been much less explored in others, such as
reinforcement learning (RL). In this paper, we show how duality is involved in
a variety of RL work, from that which spearheaded the field, such as Richard
Bellman's value iteration, to that which was done within just the past few
years yet has already had significant impact, such as TRPO, A3C, and GAIL. We
show that duality is not uncommon in reinforcement learning, especially when
value iteration, or dynamic programming, is used or when first or second order
approximations are made to transform initially intractable problems into
tractable convex programs.Comment: 8 pages, 0 figures; fixed typo in abstrac
On Connections between Constrained Optimization and Reinforcement Learning
Dynamic Programming (DP) provides standard algorithms to solve Markov
Decision Processes. However, these algorithms generally do not optimize a
scalar objective function. In this paper, we draw connections between DP and
(constrained) convex optimization. Specifically, we show clear links in the
algorithmic structure between three DP schemes and optimization algorithms. We
link Conservative Policy Iteration to Frank-Wolfe, Mirror-Descent Modified
Policy Iteration to Mirror Descent, and Politex (Policy Iteration Using Expert
Prediction) to Dual Averaging. These abstract DP schemes are representative of
a number of (deep) Reinforcement Learning (RL) algorithms. By highlighting
these connections (most of which have been noticed earlier, but in a scattered
way), we would like to encourage further studies linking RL and convex
optimization, that could lead to the design of new, more efficient, and better
understood RL algorithms.Comment: Optimization Foundations of Reinforcement Learning Workshop at
NeurIPS 201
Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies
A common strategy to deal with the expensive reinforcement learning (RL) of
complex tasks is to decompose them into a collection of subtasks that are
usually simpler to learn as well as reusable for new problems. However, when a
robot learns the policies for these subtasks, common approaches treat every
policy learning process separately. Therefore, all these individual
(composable) policies need to be learned before tackling the learning process
of the complex task through policies composition. Moreover, such composition of
individual policies is usually performed sequentially, which is not suitable
for tasks that require to perform the subtasks concurrently. In this paper, we
propose to combine a set of composable Gaussian policies corresponding to these
subtasks using a set of activation vectors, resulting in a complex Gaussian
policy that is a function of the means and covariances matrices of the
composable policies. Moreover, we propose an algorithm for learning both
compound and composable policies within the same learning process by exploiting
the off-policy data generated from the compound policy. The algorithm is built
on a maximum entropy RL approach to favor exploration during the learning
process. The results of the experiments show that the experience collected with
the compound policy permits not only to solve the complex task but also to
obtain useful composable policies that successfully perform in their
corresponding subtasks.Comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS 2019
Reparameterized Variational Divergence Minimization for Stable Imitation
While recent state-of-the-art results for adversarial imitation-learning
algorithms are encouraging, recent works exploring the imitation learning from
observation (ILO) setting, where trajectories \textit{only} contain expert
observations, have not been met with the same success. Inspired by recent
investigations of -divergence manipulation for the standard imitation
learning setting(Ke et al., 2019; Ghasemipour et al., 2019), we here examine
the extent to which variations in the choice of probabilistic divergence may
yield more performant ILO algorithms. We unfortunately find that -divergence
minimization through reinforcement learning is susceptible to numerical
instabilities. We contribute a reparameterization trick for adversarial
imitation learning to alleviate the optimization challenges of the promising
-divergence minimization framework. Empirically, we demonstrate that our
design choices allow for ILO algorithms that outperform baseline approaches and
more closely match expert performance in low-dimensional continuous-control
tasks
PAC-Bayes Control: Learning Policies that Provably Generalize to Novel Environments
Our goal is to learn control policies for robots that provably generalize
well to novel environments given a dataset of example environments. The key
technical idea behind our approach is to leverage tools from generalization
theory in machine learning by exploiting a precise analogy (which we present in
the form of a reduction) between generalization of control policies to novel
environments and generalization of hypotheses in the supervised learning
setting. In particular, we utilize the Probably Approximately Correct
(PAC)-Bayes framework, which allows us to obtain upper bounds that hold with
high probability on the expected cost of (stochastic) control policies across
novel environments. We propose policy learning algorithms that explicitly seek
to minimize this upper bound. The corresponding optimization problem can be
solved using convex optimization (Relative Entropy Programming in particular)
in the setting where we are optimizing over a finite policy space. In the more
general setting of continuously parameterized policies (e.g., neural network
policies), we minimize this upper bound using stochastic gradient descent. We
present simulated results of our approach applied to learning (1) reactive
obstacle avoidance policies and (2) neural network-based grasping policies. We
also present hardware results for the Parrot Swing drone navigating through
different obstacle environments. Our examples demonstrate the potential of our
approach to provide strong generalization guarantees for robotic systems with
continuous state and action spaces, complicated (e.g., nonlinear) dynamics,
rich sensory inputs (e.g., depth images), and neural network-based policies.Comment: Extended version of paper presented at the 2018 Conference on Robot
Learning (CoRL
Path Consistency Learning in Tsallis Entropy Regularized MDPs
We study the sparse entropy-regularized reinforcement learning (ERL) problem
in which the entropy term is a special form of the Tsallis entropy. The optimal
policy of this formulation is sparse, i.e.,~at each state, it has non-zero
probability for only a small number of actions. This addresses the main
drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation,
in which the optimal policy is softmax, and thus, may assign a non-negligible
probability mass to non-optimal actions. This problem is aggravated as the
number of actions is increased. In this paper, we follow the work of Nachum et
al. (2017) in the soft ERL setting, and propose a class of novel path
consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse
ERL problem that can work with both on-policy and off-policy data. We first
derive a {\em sparse consistency} equation that specifies a relationship
between the optimal value function and policy of the sparse ERL along any
system trajectory. Crucially, a weak form of the converse is also true, and we
quantify the sub-optimality of a policy which satisfies sparse consistency, and
show that as we increase the number of actions, this sub-optimality is better
than that of the soft ERL optimal policy. We then use this result to derive the
sparse PCL algorithms. We empirically compare sparse PCL with its soft
counterpart, and show its advantage, especially in problems with a large number
of actions
Unifying Value Iteration, Advantage Learning, and Dynamic Policy Programming
Approximate dynamic programming algorithms, such as approximate value
iteration, have been successfully applied to many complex reinforcement
learning tasks, and a better approximate dynamic programming algorithm is
expected to further extend the applicability of reinforcement learning to
various tasks. In this paper we propose a new, robust dynamic programming
algorithm that unifies value iteration, advantage learning, and dynamic policy
programming. We call it generalized value iteration (GVI) and its approximated
version, approximate GVI (AGVI). We show AGVI's performance guarantee, which
includes performance guarantees for existing algorithms, as special cases. We
discuss theoretical weaknesses of existing algorithms, and explain the
advantages of AGVI. Numerical experiments in a simple environment support
theoretical arguments, and suggest that AGVI is a promising alternative to
previous algorithms
- …