2 research outputs found
Hindsight is Only 50/50: Unsuitability of MDP based Approximate POMDP Solvers for Multi-resolution Information Gathering
Partially Observable Markov Decision Processes (POMDPs) offer an elegant
framework to model sequential decision making in uncertain environments.
Solving POMDPs online is an active area of research and given the size of
real-world problems approximate solvers are used. Recently, a few approaches
have been suggested for solving POMDPs by using MDP solvers in conjunction with
imitation learning. MDP based POMDP solvers work well for some cases, while
catastrophically failing for others. The main failure point of such solvers is
the lack of motivation for MDP solvers to gain information, since under their
assumption the environment is either already known as much as it can be or the
uncertainty will disappear after the next step. However for solving POMDP
problems gaining information can lead to efficient solutions. In this paper we
derive a set of conditions where MDP based POMDP solvers are provably
sub-optimal. We then use the well-known tiger problem to demonstrate such
sub-optimality. We show that multi-resolution, budgeted information gathering
cannot be addressed using MDP based POMDP solvers. The contribution of the
paper helps identify the properties of a POMDP problem for which the use of MDP
based POMDP solvers is inappropriate, enabling better design choices.Comment: 6 pages, 1 figur
Robust Asymmetric Learning in POMDPs
Policies for partially observed Markov decision processes can be efficiently
learned by imitating policies for the corresponding fully observed Markov
decision processes. Unfortunately, existing approaches for this kind of
imitation learning have a serious flaw: the expert does not know what the
trainee cannot see, and so may encourage actions that are sub-optimal, even
unsafe, under partial information. We derive an objective to instead train the
expert to maximize the expected reward of the imitating agent policy, and use
it to construct an efficient algorithm, adaptive asymmetric DAgger (A2D), that
jointly trains the expert and the agent. We show that A2D produces an expert
policy that the agent can safely imitate, in turn outperforming policies
learned by imitating a fixed expert.Comment: ICML 202