7,342 research outputs found
A Theory of Regularized Markov Decision Processes
Many recent successful (deep) reinforcement learning algorithms make use of
regularization, generally based on entropy or Kullback-Leibler divergence. We
propose a general theory of regularized Markov Decision Processes that
generalizes these approaches in two directions: we consider a larger class of
regularizers, and we consider the general modified policy iteration approach,
encompassing both policy iteration and value iteration. The core building
blocks of this theory are a notion of regularized Bellman operator and the
Legendre-Fenchel transform, a classical tool of convex optimization. This
approach allows for error propagation analyses of general algorithmic schemes
of which (possibly variants of) classical algorithms such as Trust Region
Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy
Programming are special cases. This also draws connections to proximal convex
optimization, especially to Mirror Descent.Comment: ICML 201
A Theory of Regularized Markov Decision Processes
Many recent successful (deep) reinforcement learning algorithms make use of
regularization, generally based on entropy or on Kullback-Leibler divergence.
We propose a general theory of regularized Markov Decision Processes that
generalizes these approaches in two directions: we consider a larger class of
regularizers, and we consider the general modified policy iteration approach,
encompassing both policy iteration and value iteration. The core building
blocks of this theory are a notion of regularized Bellman operator and the
Legendre-Fenchel transform, a classical tool of convex optimization. This
approach allows for error propagation analyses of general algorithmic schemes
of which (possibly variants of) classical algorithms such as Trust Region
Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy
Programming are special cases. This also draws connections to proximal convex
optimization, especially to Mirror Descent
Regularized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity
This paper focuses on reinforcement learning for the regularized robust
Markov decision process (MDP) problem, an extension of the robust MDP
framework. We first introduce the risk-sensitive MDP and establish the
equivalence between risk-sensitive MDP and regularized robust MDP. This
equivalence offers an alternative perspective for addressing the regularized
RMDP and enables the design of efficient learning algorithms. Given this
equivalence, we further derive the policy gradient theorem for the regularized
robust MDP problem and prove the global convergence of the exact policy
gradient method under the tabular setting with direct parameterization. We also
propose a sample-based offline learning algorithm, namely the robust fitted-Z
iteration (RFZI), for a specific regularized robust MDP problem with a
KL-divergence regularization term and analyze the sample complexity of the
algorithm. Our results are also supported by numerical simulations
In-Sample Policy Iteration for Offline Reinforcement Learning
Offline reinforcement learning (RL) seeks to derive an effective control
policy from previously collected data. To circumvent errors due to inadequate
data coverage, behavior-regularized methods optimize the control policy while
concurrently minimizing deviation from the data collection policy.
Nevertheless, these methods often exhibit subpar practical performance,
particularly when the offline dataset is collected by sub-optimal policies. In
this paper, we propose a novel algorithm employing in-sample policy iteration
that substantially enhances behavior-regularized methods in offline RL. The
core insight is that by continuously refining the policy used for behavior
regularization, in-sample policy iteration gradually improves itself while
implicitly avoids querying out-of-sample actions to avert catastrophic learning
failures. Our theoretical analysis verifies its ability to learn the in-sample
optimal policy, exclusively utilizing actions well-covered by the dataset.
Moreover, we propose competitive policy improvement, a technique applying two
competitive policies, both of which are trained by iteratively improving over
the best competitor. We show that this simple yet potent technique
significantly enhances learning efficiency when function approximation is
applied. Lastly, experimental results on the D4RL benchmark indicate that our
algorithm outperforms previous state-of-the-art methods in most tasks
Batch Policy Learning under Constraints
When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting
- …