4,537 research outputs found

    Value Iteration for Long-run Average Reward in Markov Decision Processes

    Full text link
    Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks

    Correlated Equilibria in Competitive Staff Selection Problem

    Full text link
    This paper deals with an extension of the concept of correlated strategies to Markov stopping games. The Nash equilibrium approach to solving nonzero-sum stopping games may give multiple solutions. An arbitrator can suggest to each player the decision to be applied at each stage based on a joint distribution over the players' decisions. This is a form of equilibrium selection. Examples of correlated equilibria in nonzero-sum games related to the staff selection competition in the case of two departments are given. Utilitarian, egalitarian, republican and libertarian concepts of correlated equilibria selection are used.Comment: The idea of this paper was presented at Game Theory and Mathematical Economics, International Conference in Memory of Jerzy Los(1920 - 1998), Warsaw, September 200

    An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

    Full text link
    In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.Comment: Entrop

    Weighted Markov Decision Processes with perturbation

    Get PDF
    In this paper we consider the weighted reward MDP’s with perturbation. We give the proof of existence of a delta-optimal simple ultimately deterministic policy under the assumption of “scalar value”. We also prove that there exists a delta-i-optimal simple ultimately deterministic policy in the perturbed weighted MDP, for all e E [0, e*) even without the assumption of “scalar value”
    • …
    corecore