12 research outputs found
Absorbing Markov Decision Processes
In this paper, we study discrete-time absorbing Markov Decision Processes
(MDP) with measurable state space and Borel action space with a given initial
distribution. For such models, solutions to the characteristic equation that
are not occupation measures may exist. Several necessary and sufficient
conditions are provided to guarantee that any solution to the characteristic
equation is an occupation measure. Under the so-called continuity-compactness
conditions, it is shown that the set of occupation measures is compact in the
weak-strong topology if and only if the model is uniformly absorbing. Finally,
it is shown that the occupation measures are characterized by the
characteristic equation and an additional condition. Several examples are
provided to illustrate our results.Comment: arXiv admin note: text overlap with arXiv:2305.0451
A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion
In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. An example is presented to illustrate our results with respect to those of the literature
Extreme occupation measures in Markov decision processes with a cemetery
In this paper, we consider a Markov decision process (MDP) with a Borel state
space , where is an absorbing state
(cemetery), and a Borel action space . We consider the space of
finite occupation measures restricted on , and the
extreme points in it. It is possible that some strategies have infinite
occupation measures. Nevertheless, we prove that every finite extreme
occupation measure is generated by a deterministic stationary strategy. Then,
for this MDP, we consider a constrained problem with total undiscounted
criteria and constraints, where the cost functions are nonnegative. By
assumption, the strategies inducing infinite occupation measures are not
optimal. Then, our second main result is that, under mild conditions, the
solution to this constrained MDP is given by a mixture of no more than
occupation measures generated by deterministic stationary strategies
Goodhart's Law in Reinforcement Learning
Implementing a reward function that perfectly captures a complex task in the
real world is impractical. As a result, it is often appropriate to think of the
reward function as a proxy for the true objective rather than as its
definition. We study this phenomenon through the lens of Goodhart's law, which
predicts that increasing optimisation of an imperfect proxy beyond some
critical point decreases performance on the true objective. First, we propose a
way to quantify the magnitude of this effect and show empirically that
optimising an imperfect proxy reward often leads to the behaviour predicted by
Goodhart's law for a wide range of environments and reward functions. We then
provide a geometric explanation for why Goodhart's law occurs in Markov
decision processes. We use these theoretical insights to propose an optimal
early stopping method that provably avoids the aforementioned pitfall and
derive theoretical regret bounds for this method. Moreover, we derive a
training method that maximises worst-case reward, for the setting where there
is uncertainty about the true reward function. Finally, we evaluate our early
stopping method experimentally. Our results support a foundation for a
theoretically-principled study of reinforcement learning under reward
misspecification
Analysis and Simplex-type Algorithms for Countably Infinite Linear Programming Models of Markov Decision Processes.
The class of Markov decision processes (MDPs) provides a popular framework which covers a wide variety of sequential decision-making problems. We consider infinite-horizon discounted MDPs with countably infinite state space and finite action space. Our goal is to establish theoretical properties and develop new solution methods for such MDPs by studying their linear programming (LP) formulations. The LP formulations have countably infinite numbers of variables and constraints and therefore are called countably infinite linear programs (CILPs). General CILPs are challenging to analyze or solve, mainly because useful theoretical properties and techniques of finite LPs fail to extend to general CILPs. Another goal of this thesis is to deepen the limited current understanding of CILPs, resulting in new algorithmic approaches to find their solutions.
Recently, Ghate and Smith (2013) developed an implementable simplex-type algorithm for solving a CILP formulation of a non-stationary MDP with finite state space. We establish rate of convergence results for their simplex algorithm with a particular pivoting rule and another existing solution method for such MDPs, and compare empirical performance of the algorithms. We also present ways to accelerate their simplex algorithm.
The class of non-stationary MDPs with finite state space can be considered to be a subclass of stationary MDPs with countably infinite state space. We present a simplex-type algorithm for solving a CILP formulation of a stationary MDP with countably infinite state space that is implementable (using only finite data and computation in each iteration). We show that the algorithm finds a sequence of policies that improves monotonically and converges to optimality in value, and present a numerical illustration.
An important extension of MDPs considered so far are constrained MDPs, which optimize an objective function while satisfying constraints, typically on budget, quality, and so on. For constrained non-stationary MDPs with finite state space, we provide a necessary and sufficient condition for a feasible solution of its CILP formulation to be an extreme point. Since simplex-type algorithms are expected to navigate between extreme points, this result sets a foundation for developing a simplex-type algorithm for constrained non-stationary MDPs.PhDIndustrial and Operations EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113486/1/ilbinlee_1.pd