Search CORE

1,726 research outputs found

NOTE ON DISCOUNTED CONTINUOUS-TIME MARKOV DECISION PROCESSES WITH A LOWER BOUNDING FUNCTION

Author: Guo Xin
Piunovskiy Alexey
Zhang Yi
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 02/12/2016
Field of study

In this paper, we consider the discounted continuous-time Markov decision process (CTMDP) with a lower bounding function. In this model, the negative part of each cost rate is bounded by the drift function, say

w

, whereas the positive part is allowed to be arbitrarily unbounded. Our focus is on the existence of a stationary optimal policy for the discounted CTMDP problems out of the more general class. Both constrained and unconstrained problems are considered. Our investigations are based on a useful transformation for nonhomogeneous Markov pure jump processes that has not yet been widely applied to the study of CTMDPs. This technique was not employed in previous literature, but it clarifies the roles of the imposed conditions in a rather transparent way. As a consequence, we withdraw and weaken several conditions commonly imposed in the literature

arXiv.org e-Print Archive

University of Liverpool Repository

Crossref

University of Birmingham Research Portal

Average optimality for continuous-time Markov decision processes under weak continuity conditions

Author: Zhang Yi
Publication venue
Publication date: 04/03/2014
Field of study

This article considers the average optimality for a continuous-time Markov decision process with Borel state and action spaces and an arbitrarily unbounded nonnegative cost rate. The existence of a deterministic stationary optimal policy is proved under a different and general set of conditions as compared to the previous literature; the controlled process can be explosive, the transition rates can be arbitrarily unbounded and are weakly continuous, the multifunction defining the admissible action spaces can be neither compact-valued nor upper semi-continuous, and the cost rate is not necessarily inf-compact

arXiv.org e-Print Archive

Crossref

University of Birmingham Research Portal

Smoothing Policies and Safe Policy Gradients

Author: Papini Matteo
Pirotta Matteo
Restelli Marcello
Publication venue
Publication date: 08/05/2019
Field of study

Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

UPF Digital Repository

Risk Aversion in Finite Markov Decision Processes Using Total Cost Criteria and Average Value at Risk

Author: Carpin Stefano
Chow Yin-Lam
Pavone Marco
Publication venue
Publication date: 01/01/2016
Field of study

In this paper we present an algorithm to compute risk averse policies in Markov Decision Processes (MDP) when the total cost criterion is used together with the average value at risk (AVaR) metric. Risk averse policies are needed when large deviations from the expected behavior may have detrimental effects, and conventional MDP algorithms usually ignore this aspect. We provide conditions for the structure of the underlying MDP ensuring that approximations for the exact problem can be derived and solved efficiently. Our findings are novel inasmuch as average value at risk has not previously been considered in association with the total cost criterion. Our method is demonstrated in a rapid deployment scenario, whereby a robot is tasked with the objective of reaching a target location within a temporal deadline where increased speed is associated with increased probability of failure. We demonstrate that the proposed algorithm not only produces a risk averse policy reducing the probability of exceeding the expected temporal deadline, but also provides the statistical distribution of costs, thus offering a valuable analysis tool

arXiv.org e-Print Archive

Crossref

eScholarship - University of California