Search CORE

34,376 research outputs found

Multi-Objective Approaches to Markov Decision Processes with Uncertain Transition Parameters

Author: Buchholz Peter
Hashemi Vahid
Hermanns Holger
Scheftelowitsch Dimitri
Publication venue
Publication date: 20/10/2017
Field of study

Markov decision processes (MDPs) are a popular model for performance analysis and optimization of stochastic systems. The parameters of stochastic behavior of MDPs are estimates from empirical observations of a system; their values are not known precisely. Different types of MDPs with uncertain, imprecise or bounded transition rates or probabilities and rewards exist in the literature. Commonly, analysis of models with uncertainties amounts to searching for the most robust policy which means that the goal is to generate a policy with the greatest lower bound on performance (or, symmetrically, the lowest upper bound on costs). However, hedging against an unlikely worst case may lead to losses in other situations. In general, one is interested in policies that behave well in all situations which results in a multi-objective view on decision making. In this paper, we consider policies for the expected discounted reward measure of MDPs with uncertain parameters. In particular, the approach is defined for bounded-parameter MDPs (BMDPs) [8]. In this setting the worst, best and average case performances of a policy are analyzed simultaneously, which yields a multi-scenario multi-objective optimization problem. The paper presents and evaluates approaches to compute the pure Pareto optimal policies in the value vector space.Comment: 9 pages, 5 figures, preprint for VALUETOOLS 201

arXiv.org e-Print Archive

Crossref

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Author: Hallak Assaf
Mannor Shie
Munos Remi
Tamar Aviv
Publication venue
Publication date: 27/11/2015
Field of study

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD(

\lambda

), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter

\beta

controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling

\beta

, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.Comment: arXiv admin note: text overlap with arXiv:1508.0341

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Author: Mannor Shie
Mebel Ofir
Xu Huan
Publication venue
Publication date: 01/01/2012
Field of study

We consider Markov decision processes under parameter uncertainty. Previous studies all restrict to the case that uncertainties among different states are uncoupled, which leads to conservative solutions. In contrast, we introduce an intuitive concept, termed "Lightning Does not Strike Twice," to model coupled uncertain parameters. Specifically, we require that the system can deviate from its nominal parameters only a bounded number of times. We give probabilistic guarantees indicating that this model represents real life situations and devise tractable algorithms for computing optimal control policies using this concept.Comment: ICML201

arXiv.org e-Print Archive

CiteSeerX

ScholarBank@NUS

Average optimality for continuous-time Markov decision processes in polish spaces

Author: Guo Xianping
Rieder Ulrich
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 04/07/2006
Field of study

This paper is devoted to studying the average optimality in continuous-time Markov decision processes with fairly general state and action spaces. The criterion to be maximized is expected average rewards. The transition rates of underlying continuous-time jump Markov processes are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds. We first provide two optimality inequalities with opposed directions, and also give suitable conditions under which the existence of solutions to the two optimality inequalities is ensured. Then, from the two optimality inequalities we prove the existence of optimal (deterministic) stationary policies by using the Dynkin formula. Moreover, we present a ``semimartingale characterization'' of an optimal stationary policy. Finally, we use a generalized Potlach process with control to illustrate the difference between our conditions and those in the previous literature, and then further apply our results to average optimal control problems of generalized birth--death systems, upwardly skip-free processes and two queueing systems. The approach developed in this paper is slightly different from the ``optimality inequality approach'' widely used in the previous literature.Comment: Published at http://dx.doi.org/10.1214/105051606000000105 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Power Aware Wireless File Downloading: A Constrained Restless Bandit Approach

Author: Neely Michael J.
Wei Xiaohan
Publication venue
Publication date: 01/01/2014
Field of study

This paper treats power-aware throughput maximization in a multi-user file downloading system. Each user can receive a new file only after its previous file is finished. The file state processes for each user act as coupled Markov chains that form a generalized restless bandit system. First, an optimal algorithm is derived for the case of one user. The algorithm maximizes throughput subject to an average power constraint. Next, the one-user algorithm is extended to a low complexity heuristic for the multi-user problem. The heuristic uses a simple online index policy and its effectiveness is shown via simulation. For simple 3-user cases where the optimal solution can be computed offline, the heuristic is shown to be near-optimal for a wide range of parameters

arXiv.org e-Print Archive

CiteSeerX

Crossref

Evolutionary game of coalition building under external pressure

Author: A Saichev
CH Papadimitriou
DO Pushkin
E Weese
H Inal
J Norris
JB Jouida
JM Lasry
JN Tsitsiklis
M Finus
MF Chen
MG Crandall
N Gast
PL Krapivsky
VN Kolokoltsov
VN Kolokoltsov
W Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

We study the fragmentation-coagulation (or merging and splitting) evolutionary control model as introduced recently by one of the authors, where

N

small players can form coalitions to resist to the pressure exerted by the principal. It is a Markov chain in continuous time and the players have a common reward to optimize. We study the behavior as

N

grows and show that the problem converges to a (one player) deterministic optimization problem in continuous time, in the infinite dimensional state space

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Archivio istituzionale della ricerca - Università di Padova