Search CORE

12 research outputs found

Absorbing Markov Decision Processes

Author: Dufour François
Prieto-Rumeau Tomás
Publication venue
Publication date: 13/09/2023
Field of study

In this paper, we study discrete-time absorbing Markov Decision Processes (MDP) with measurable state space and Borel action space with a given initial distribution. For such models, solutions to the characteristic equation that are not occupation measures may exist. Several necessary and sufficient conditions are provided to guarantee that any solution to the characteristic equation is an occupation measure. Under the so-called continuity-compactness conditions, it is shown that the set of occupation measures is compact in the weak-strong topology if and only if the model is uniformly absorbing. Finally, it is shown that the occupation measures are characterized by the characteristic equation and an additional condition. Several examples are provided to illustrate our results.Comment: arXiv admin note: text overlap with arXiv:2305.0451

arXiv.org e-Print Archive

A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion

Author: Dufour François
Genadot Alexandre
Publication venue: HAL CCSD
Publication date: 17/04/2019
Field of study

In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. An example is presented to illustrate our results with respect to those of the literature

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Oskar Bordeaux

Extreme occupation measures in Markov decision processes with a cemetery

Author: Piunovskiy Alexey
Zhang Yi
Publication venue
Publication date: 06/07/2023
Field of study

In this paper, we consider a Markov decision process (MDP) with a Borel state space

\textbf{X}\cup\{\Delta\}

, where

\Delta

is an absorbing state (cemetery), and a Borel action space

\textbf{A}

. We consider the space of finite occupation measures restricted on

\textbf{X}\times \textbf{A}

, and the extreme points in it. It is possible that some strategies have infinite occupation measures. Nevertheless, we prove that every finite extreme occupation measure is generated by a deterministic stationary strategy. Then, for this MDP, we consider a constrained problem with total undiscounted criteria and

J

constraints, where the cost functions are nonnegative. By assumption, the strategies inducing infinite occupation measures are not optimal. Then, our second main result is that, under mild conditions, the solution to this constrained MDP is given by a mixture of no more than

J+1

occupation measures generated by deterministic stationary strategies

arXiv.org e-Print Archive

Goodhart's Law in Reinforcement Learning

Author: Bai Xingjian
Griffin Charlie
Hayman Oliver
Karwowski Jacek
Kiendlhofer Klaus
Skalse Joar
Publication venue
Publication date: 13/10/2023
Field of study

Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification

arXiv.org e-Print Archive

Analysis and Simplex-type Algorithms for Countably Infinite Linear Programming Models of Markov Decision Processes.

Author: Lee Ilbin
Publication venue
Publication date: 01/01/2015
Field of study

The class of Markov decision processes (MDPs) provides a popular framework which covers a wide variety of sequential decision-making problems. We consider infinite-horizon discounted MDPs with countably infinite state space and finite action space. Our goal is to establish theoretical properties and develop new solution methods for such MDPs by studying their linear programming (LP) formulations. The LP formulations have countably infinite numbers of variables and constraints and therefore are called countably infinite linear programs (CILPs). General CILPs are challenging to analyze or solve, mainly because useful theoretical properties and techniques of finite LPs fail to extend to general CILPs. Another goal of this thesis is to deepen the limited current understanding of CILPs, resulting in new algorithmic approaches to find their solutions. Recently, Ghate and Smith (2013) developed an implementable simplex-type algorithm for solving a CILP formulation of a non-stationary MDP with finite state space. We establish rate of convergence results for their simplex algorithm with a particular pivoting rule and another existing solution method for such MDPs, and compare empirical performance of the algorithms. We also present ways to accelerate their simplex algorithm. The class of non-stationary MDPs with finite state space can be considered to be a subclass of stationary MDPs with countably infinite state space. We present a simplex-type algorithm for solving a CILP formulation of a stationary MDP with countably infinite state space that is implementable (using only finite data and computation in each iteration). We show that the algorithm finds a sequence of policies that improves monotonically and converges to optimality in value, and present a numerical illustration. An important extension of MDPs considered so far are constrained MDPs, which optimize an objective function while satisfying constraints, typically on budget, quality, and so on. For constrained non-stationary MDPs with finite state space, we provide a necessary and sufficient condition for a feasible solution of its CILP formulation to be an extreme point. Since simplex-type algorithms are expected to navigate between extreme points, this result sets a foundation for developing a simplex-type algorithm for constrained non-stationary MDPs.PhDIndustrial and Operations EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113486/1/ilbinlee_1.pd

Deep Blue Documents at the University of Michigan