4,537 research outputs found
Value Iteration for Long-run Average Reward in Markov Decision Processes
Markov decision processes (MDPs) are standard models for probabilistic
systems with non-deterministic behaviours. Long-run average rewards provide a
mathematically elegant formalism for expressing long term performance. Value
iteration (VI) is one of the simplest and most efficient algorithmic approaches
to MDPs with other properties, such as reachability objectives. Unfortunately,
a naive extension of VI does not work for MDPs with long-run average rewards,
as there is no known stopping criterion. In this work our contributions are
threefold. (1) We refute a conjecture related to stopping criteria for MDPs
with long-run average rewards. (2) We present two practical algorithms for MDPs
with long-run average rewards based on VI. First, we show that a combination of
applying VI locally for each maximal end-component (MEC) and VI for
reachability objectives can provide approximation guarantees. Second, extending
the above approach with a simulation-guided on-demand variant of VI, we present
an anytime algorithm that is able to deal with very large models. (3) Finally,
we present experimental results showing that our methods significantly
outperform the standard approaches on several benchmarks
Correlated Equilibria in Competitive Staff Selection Problem
This paper deals with an extension of the concept of correlated strategies to
Markov stopping games. The Nash equilibrium approach to solving nonzero-sum
stopping games may give multiple solutions. An arbitrator can suggest to each
player the decision to be applied at each stage based on a joint distribution
over the players' decisions. This is a form of equilibrium selection. Examples
of correlated equilibria in nonzero-sum games related to the staff selection
competition in the case of two departments are given. Utilitarian, egalitarian,
republican and libertarian concepts of correlated equilibria selection are
used.Comment: The idea of this paper was presented at Game Theory and Mathematical
Economics, International Conference in Memory of Jerzy Los(1920 - 1998),
Warsaw, September 200
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Weighted Markov Decision Processes with perturbation
In this paper we consider the weighted reward MDP’s
with perturbation. We give the proof of existence of a
delta-optimal simple ultimately deterministic policy under
the assumption of “scalar value”. We also prove
that there exists a delta-i-optimal simple ultimately deterministic
policy in the perturbed weighted MDP, for
all e E [0, e*) even without the assumption of “scalar
value”
- …