34,376 research outputs found

    Multi-Objective Approaches to Markov Decision Processes with Uncertain Transition Parameters

    Full text link
    Markov decision processes (MDPs) are a popular model for performance analysis and optimization of stochastic systems. The parameters of stochastic behavior of MDPs are estimates from empirical observations of a system; their values are not known precisely. Different types of MDPs with uncertain, imprecise or bounded transition rates or probabilities and rewards exist in the literature. Commonly, analysis of models with uncertainties amounts to searching for the most robust policy which means that the goal is to generate a policy with the greatest lower bound on performance (or, symmetrically, the lowest upper bound on costs). However, hedging against an unlikely worst case may lead to losses in other situations. In general, one is interested in policies that behave well in all situations which results in a multi-objective view on decision making. In this paper, we consider policies for the expected discounted reward measure of MDPs with uncertain parameters. In particular, the approach is defined for bounded-parameter MDPs (BMDPs) [8]. In this setting the worst, best and average case performances of a policy are analyzed simultaneously, which yields a multi-scenario multi-objective optimization problem. The paper presents and evaluates approaches to compute the pure Pareto optimal policies in the value vector space.Comment: 9 pages, 5 figures, preprint for VALUETOOLS 201

    Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

    Full text link
    We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD(λ\lambda), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter β\beta controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β\beta, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.Comment: arXiv admin note: text overlap with arXiv:1508.0341

    Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

    Full text link
    We consider Markov decision processes under parameter uncertainty. Previous studies all restrict to the case that uncertainties among different states are uncoupled, which leads to conservative solutions. In contrast, we introduce an intuitive concept, termed "Lightning Does not Strike Twice," to model coupled uncertain parameters. Specifically, we require that the system can deviate from its nominal parameters only a bounded number of times. We give probabilistic guarantees indicating that this model represents real life situations and devise tractable algorithms for computing optimal control policies using this concept.Comment: ICML201

    Average optimality for continuous-time Markov decision processes in polish spaces

    Full text link
    This paper is devoted to studying the average optimality in continuous-time Markov decision processes with fairly general state and action spaces. The criterion to be maximized is expected average rewards. The transition rates of underlying continuous-time jump Markov processes are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds. We first provide two optimality inequalities with opposed directions, and also give suitable conditions under which the existence of solutions to the two optimality inequalities is ensured. Then, from the two optimality inequalities we prove the existence of optimal (deterministic) stationary policies by using the Dynkin formula. Moreover, we present a ``semimartingale characterization'' of an optimal stationary policy. Finally, we use a generalized Potlach process with control to illustrate the difference between our conditions and those in the previous literature, and then further apply our results to average optimal control problems of generalized birth--death systems, upwardly skip-free processes and two queueing systems. The approach developed in this paper is slightly different from the ``optimality inequality approach'' widely used in the previous literature.Comment: Published at http://dx.doi.org/10.1214/105051606000000105 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Power Aware Wireless File Downloading: A Constrained Restless Bandit Approach

    Full text link
    This paper treats power-aware throughput maximization in a multi-user file downloading system. Each user can receive a new file only after its previous file is finished. The file state processes for each user act as coupled Markov chains that form a generalized restless bandit system. First, an optimal algorithm is derived for the case of one user. The algorithm maximizes throughput subject to an average power constraint. Next, the one-user algorithm is extended to a low complexity heuristic for the multi-user problem. The heuristic uses a simple online index policy and its effectiveness is shown via simulation. For simple 3-user cases where the optimal solution can be computed offline, the heuristic is shown to be near-optimal for a wide range of parameters

    Evolutionary game of coalition building under external pressure

    Get PDF
    We study the fragmentation-coagulation (or merging and splitting) evolutionary control model as introduced recently by one of the authors, where NN small players can form coalitions to resist to the pressure exerted by the principal. It is a Markov chain in continuous time and the players have a common reward to optimize. We study the behavior as NN grows and show that the problem converges to a (one player) deterministic optimization problem in continuous time, in the infinite dimensional state space
    corecore