1,726 research outputs found
NOTE ON DISCOUNTED CONTINUOUS-TIME MARKOV DECISION PROCESSES WITH A LOWER BOUNDING FUNCTION
In this paper, we consider the discounted continuous-time Markov decision
process (CTMDP) with a lower bounding function. In this model, the negative
part of each cost rate is bounded by the drift function, say , whereas the
positive part is allowed to be arbitrarily unbounded. Our focus is on the
existence of a stationary optimal policy for the discounted CTMDP problems out
of the more general class. Both constrained and unconstrained problems are
considered. Our investigations are based on a useful transformation for
nonhomogeneous Markov pure jump processes that has not yet been widely applied
to the study of CTMDPs. This technique was not employed in previous literature,
but it clarifies the roles of the imposed conditions in a rather transparent
way. As a consequence, we withdraw and weaken several conditions commonly
imposed in the literature
Average optimality for continuous-time Markov decision processes under weak continuity conditions
This article considers the average optimality for a continuous-time Markov
decision process with Borel state and action spaces and an arbitrarily
unbounded nonnegative cost rate. The existence of a deterministic stationary
optimal policy is proved under a different and general set of conditions as
compared to the previous literature; the controlled process can be explosive,
the transition rates can be arbitrarily unbounded and are weakly continuous,
the multifunction defining the admissible action spaces can be neither
compact-valued nor upper semi-continuous, and the cost rate is not necessarily
inf-compact
Smoothing Policies and Safe Policy Gradients
Policy gradient algorithms are among the best candidates for the much
anticipated application of reinforcement learning to real-world control tasks,
such as the ones arising in robotics. However, the trial-and-error nature of
these methods introduces safety issues whenever the learning phase itself must
be performed on a physical system. In this paper, we address a specific safety
formulation, where danger is encoded in the reward signal and the learning
agent is constrained to never worsen its performance. By studying actor-only
policy gradient from a stochastic optimization perspective, we establish
improvement guarantees for a wide class of parametric policies, generalizing
existing results on Gaussian policies. This, together with novel upper bounds
on the variance of policy gradient estimators, allows to identify those
meta-parameter schedules that guarantee monotonic improvement with high
probability. The two key meta-parameters are the step size of the parameter
updates and the batch size of the gradient estimators. By a joint, adaptive
selection of these meta-parameters, we obtain a safe policy gradient algorithm
Risk Aversion in Finite Markov Decision Processes Using Total Cost Criteria and Average Value at Risk
In this paper we present an algorithm to compute risk averse policies in
Markov Decision Processes (MDP) when the total cost criterion is used together
with the average value at risk (AVaR) metric. Risk averse policies are needed
when large deviations from the expected behavior may have detrimental effects,
and conventional MDP algorithms usually ignore this aspect. We provide
conditions for the structure of the underlying MDP ensuring that approximations
for the exact problem can be derived and solved efficiently. Our findings are
novel inasmuch as average value at risk has not previously been considered in
association with the total cost criterion. Our method is demonstrated in a
rapid deployment scenario, whereby a robot is tasked with the objective of
reaching a target location within a temporal deadline where increased speed is
associated with increased probability of failure. We demonstrate that the
proposed algorithm not only produces a risk averse policy reducing the
probability of exceeding the expected temporal deadline, but also provides the
statistical distribution of costs, thus offering a valuable analysis tool
- …