Two heuristics for solving POMDPs having a delayed need to observe

Dietterich, Thomas Glen; Oregon State University. Dept. of Computer Science; Zubek, Valentina Bayer

Two heuristics for solving POMDPs having a delayed need to observe

Authors: Thomas Glen Dietterich
Oregon State University. Dept. of Computer Science
Valentina Bayer Zubek
Publication date
Publisher: Corvallis, OR : Oregon State University, Dept. of Computer Science

Abstract

A common heuristic for solving Partially Observable Markov Decision Problems POMDPs is to first solve the underlying Markov Decision Process MDP and then construct a POMDP policy by performing a fixed depth lookahead search in the POMDP and evaluating the leaf nodes using the MDP value function. A problem with this approximation is that it does not account for the need to choose actions in order to gain information about the state of the world particularly when those observation actions are needed at some point in the future. This paper proposes two heuristics that are better than the MDP approximation in POMDPs where there is a delayed need to observe. The first approximation introduced in [2] is the even-odd POMDP in which the world is assumed to be fully observable every other time step. The even-odd POMDP can be converted into an equivalent MDP the even-MDP whose value function captures some of the sensing costs of the original POMDP. An online policy consisting in a step lookahead search combined with the value function of the even-MDP gives an approximation to the POMDPs value function that is at least as good as the method based on the value function of the underlying MDP. The second POMDP approximation is applicable to a special kind of POMDP which we call the Cost Observable Markov Decision Problem COMDP. In a COMDP the actions are partitioned into those that change the state of the world and those that are pure observation actions. For such problems we describe the chain-MDP algorithm which in many cases is able to capture more of the sensing costs than the even-odd POMDP approximation. We prove that both heuristics compute value functions that are upper bounded by (i.e., better than) the value function of the underlying MDP and in the case of the even-MDP also lower bounded by the POMDPs optimal value function. We show cases where the chain-MDP online policy is better equal or worse than the even-MDP online policy.Keywords: Cost Observable Markov Decision Problem, POMDP, COMDP, Partially Observable Markov Decision ProblemsKeywords: Cost Observable Markov Decision Problem, POMDP, COMDP, Partially Observable Markov Decision Problem

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Sustaining member

ScholarsArchive@OSU

ir.library.oregonstate.edu:sf2...

Last time updated on 03/01/2024