3,509 research outputs found
Value Iteration for Long-run Average Reward in Markov Decision Processes
Markov decision processes (MDPs) are standard models for probabilistic
systems with non-deterministic behaviours. Long-run average rewards provide a
mathematically elegant formalism for expressing long term performance. Value
iteration (VI) is one of the simplest and most efficient algorithmic approaches
to MDPs with other properties, such as reachability objectives. Unfortunately,
a naive extension of VI does not work for MDPs with long-run average rewards,
as there is no known stopping criterion. In this work our contributions are
threefold. (1) We refute a conjecture related to stopping criteria for MDPs
with long-run average rewards. (2) We present two practical algorithms for MDPs
with long-run average rewards based on VI. First, we show that a combination of
applying VI locally for each maximal end-component (MEC) and VI for
reachability objectives can provide approximation guarantees. Second, extending
the above approach with a simulation-guided on-demand variant of VI, we present
an anytime algorithm that is able to deal with very large models. (3) Finally,
we present experimental results showing that our methods significantly
outperform the standard approaches on several benchmarks
Efficient Strategy Iteration for Mean Payoff in Markov Decision Processes
Markov decision processes (MDPs) are standard models for probabilistic
systems with non-deterministic behaviours. Mean payoff (or long-run average
reward) provides a mathematically elegant formalism to express performance
related properties. Strategy iteration is one of the solution techniques
applicable in this context. While in many other contexts it is the technique of
choice due to advantages over e.g. value iteration, such as precision or
possibility of domain-knowledge-aware initialization, it is rarely used for
MDPs, since there it scales worse than value iteration. We provide several
techniques that speed up strategy iteration by orders of magnitude for many
MDPs, eliminating the performance disadvantage while preserving all its
advantages
Robust Satisfaction of Temporal Logic Specifications via Reinforcement Learning
We consider the problem of steering a system with unknown, stochastic
dynamics to satisfy a rich, temporally layered task given as a signal temporal
logic formula. We represent the system as a Markov decision process in which
the states are built from a partition of the state space and the transition
probabilities are unknown. We present provably convergent reinforcement
learning algorithms to maximize the probability of satisfying a given formula
and to maximize the average expected robustness, i.e., a measure of how
strongly the formula is satisfied. We demonstrate via a pair of robot
navigation simulation case studies that reinforcement learning with robustness
maximization performs better than probability maximization in terms of both
probability of satisfaction and expected robustness.Comment: 8 pages, 4 figure
Robust satisfaction of temporal logic specifications via reinforcement learning
We consider the problem of steering a system with unknown, stochastic dynamics to satisfy a rich, temporally-layered task given as a signal temporal logic formula. We represent the system as a finite-memory Markov decision process with unknown transition probabilities and whose states are built from a partition of the state space. We present provably convergent reinforcement learning algorithms to maximize the probability of satisfying a given specification and to maximize the average expected robustness, i.e. a measure of how strongly the formula is satisfied. Robustness allows us to quantify progress towards satisfying a given specification. We demonstrate via a pair of robot navigation simulation case studies that, due to the quantification of progress towards satisfaction, reinforcement learning with robustness maximization performs better than probability maximization in terms of both probability of satisfaction and expected robustness with a low number of training examples
- …