99 research outputs found
Rectangularity and duality of distributionally robust Markov Decision Processes
The main goal of this paper is to discuss several approaches to formulation
of distributionally robust counterparts of Markov Decision Processes, where the
transition kernels are not specified exactly but rather are assumed to be
elements of the corresponding ambiguity sets. The intent is to clarify some
connections between the game and static formulations of distributionally robust
MDPs, and delineate the role of rectangularity associated with ambiguity sets
in determining these connections
Distributionally Robust Markov Decision Processes and their Connection to Risk Measures
We consider robust Markov Decision Processes with Borel state and action
spaces, unbounded cost and finite time horizon. Our formulation leads to a
Stackelberg game against nature. Under integrability, continuity and
compactness assumptions we derive a robust cost iteration for a fixed policy of
the decision maker and a value iteration for the robust optimization problem.
Moreover, we show the existence of deterministic optimal policies for both
players. This is in contrast to classical zero-sum games. In case the state
space is the real line we show under some convexity assumptions that the
interchange of supremum and infimum is possible with the help of Sion's minimax
Theorem. Further, we consider the problem with special ambiguity sets. In
particular we are able to derive some cases where the robust optimization
problem coincides with the minimization of a coherent risk measure. In the
final section we discuss two applications: A robust LQ problem and a robust
problem for managing regenerative energy
Distributionally Robust Optimization for Sequential Decision Making
The distributionally robust Markov Decision Process (MDP) approach asks for a
distributionally robust policy that achieves the maximal expected total reward
under the most adversarial distribution of uncertain parameters. In this paper,
we study distributionally robust MDPs where ambiguity sets for the uncertain
parameters are of a format that can easily incorporate in its description the
uncertainty's generalized moment as well as statistical distance information.
In this way, we generalize existing works on distributionally robust MDP with
generalized-moment-based and statistical-distance-based ambiguity sets to
incorporate information from the former class such as moments and dispersions
to the latter class that critically depends on empirical observations of the
uncertain parameters. We show that, under this format of ambiguity sets, the
resulting distributionally robust MDP remains tractable under mild technical
conditions. To be more specific, a distributionally robust policy can be
constructed by solving a sequence of one-stage convex optimization subproblems
Distributionally Robust Model-based Reinforcement Learning with Large State Spaces
Three major challenges in reinforcement learning are the complex dynamical
systems with large state spaces, the costly data acquisition processes, and the
deviation of real-world dynamics from the training environment deployment. To
overcome these issues, we study distributionally robust Markov decision
processes with continuous state spaces under the widely used Kullback-Leibler,
chi-square, and total variation uncertainty sets. We propose a model-based
approach that utilizes Gaussian Processes and the maximum variance reduction
algorithm to efficiently learn multi-output nominal transition dynamics,
leveraging access to a generative model (i.e., simulator). We further
demonstrate the statistical sample complexity of the proposed method for
different uncertainty sets. These complexity bounds are independent of the
number of states and extend beyond linear dynamics, ensuring the effectiveness
of our approach in identifying near-optimal distributionally-robust policies.
The proposed method can be further combined with other model-free
distributionally robust reinforcement learning methods to obtain a near-optimal
robust policy. Experimental results demonstrate the robustness of our algorithm
to distributional shifts and its superior performance in terms of the number of
samples needed
On the Foundation of Distributionally Robust Reinforcement Learning
Motivated by the need for a robust policy in the face of environment shifts
between training and the deployment, we contribute to the theoretical
foundation of distributionally robust reinforcement learning (DRRL). This is
accomplished through a comprehensive modeling framework centered around
distributionally robust Markov decision processes (DRMDPs). This framework
obliges the decision maker to choose an optimal policy under the worst-case
distributional shift orchestrated by an adversary. By unifying and extending
existing formulations, we rigorously construct DRMDPs that embraces various
modeling attributes for both the decision maker and the adversary. These
attributes include adaptability granularity, exploring history-dependent,
Markov, and Markov time-homogeneous decision maker and adversary dynamics.
Additionally, we delve into the flexibility of shifts induced by the adversary,
examining SA and S-rectangularity. Within this DRMDP framework, we investigate
conditions for the existence or absence of the dynamic programming principle
(DPP). From an algorithmic standpoint, the existence of DPP holds significant
implications, as the vast majority of existing data and computationally
efficiency RL algorithms are reliant on the DPP. To study its existence, we
comprehensively examine combinations of controller and adversary attributes,
providing streamlined proofs grounded in a unified methodology. We also offer
counterexamples for settings in which a DPP with full generality is absent
Distributionally Robust Off-Dynamics Reinforcement Learning: Provable Efficiency with Linear Function Approximation
We study off-dynamics Reinforcement Learning (RL), where the policy is
trained on a source domain and deployed to a distinct target domain. We aim to
solve this problem via online distributionally robust Markov decision processes
(DRMDPs), where the learning algorithm actively interacts with the source
domain while seeking the optimal performance under the worst possible dynamics
that is within an uncertainty set of the source domain's transition kernel. We
provide the first study on online DRMDPs with function approximation for
off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity,
even when the nominal transition kernel is linear, leading to error
propagation. By designing a -rectangular uncertainty set using the total
variation distance, we remove this additional nonlinearity and bypass the error
propagation. We then introduce DR-LSVI-UCB, the first provably efficient online
DRMDP algorithm for off-dynamics RL with function approximation, and establish
a polynomial suboptimality bound that is independent of the state and action
space sizes. Our work makes the first step towards a deeper understanding of
the provable efficiency of online DRMDPs with linear function approximation.
Finally, we substantiate the performance and robustness of DR-LSVI-UCB through
different numerical experiments.Comment: 30 pages, 4 figures. To appear in the proceedings of the 27th
International Conference on Artificial Intelligence and Statistics (AISTATS
On the Convergence of Modified Policy Iteration in Risk Sensitive Exponential Cost Markov Decision Processes
Modified policy iteration (MPI) is a dynamic programming algorithm that
combines elements of policy iteration and value iteration. The convergence of
MPI has been well studied in the context of discounted and average-cost MDPs.
In this work, we consider the exponential cost risk-sensitive MDP formulation,
which is known to provide some robustness to model parameters. Although policy
iteration and value iteration have been well studied in the context of risk
sensitive MDPs, MPI is unexplored. We provide the first proof that MPI also
converges for the risk-sensitive problem in the case of finite state and action
spaces. Since the exponential cost formulation deals with the multiplicative
Bellman equation, our main contribution is a convergence proof which is quite
different than existing results for discounted and risk-neutral average-cost
problems as well as risk sensitive value and policy iteration approaches. We
conclude our analysis with simulation results, assessing MPI's performance
relative to alternative dynamic programming methods like value iteration and
policy iteration across diverse problem parameters. Our findings highlight
risk-sensitive MPI's enhanced computational efficiency compared to both value
and policy iteration techniques.Comment: 25 pages, 3 figures, Under review at Operations Researc
- …