Computing Optimal Stationary Policies for Multi-objective Markov Decision Processes

Abstract

This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multi-objective sequential decision problems. CON-MODP is a value iteration based multi-objective dynamic programming algorithm that only computes stationary policies. We observe that for guaranteeing convergence to the unique Pareto optimal set of deterministic stationary policies, the algorithm needs to perform a policy evaluation step on particular policies that are inconsistent in a single state that is being expanded. We prove that the algorithm converges to the Pareto optimal set of value functions and policies for deterministic infinite horizon discounted multiobjective Markov decision processes. Experiments show that CON-MODP is much faster than previous multi-objective value iteration algorithm

Similar works

Full text

thumbnail-image

Utrecht University Repository

redirect
Last time updated on 14/06/2016

This paper was published in Utrecht University Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.