302 research outputs found
Stochastic Shortest Path with Energy Constraints in POMDPs
We consider partially observable Markov decision processes (POMDPs) with a
set of target states and positive integer costs associated with every
transition. The traditional optimization objective (stochastic shortest path)
asks to minimize the expected total cost until the target set is reached. We
extend the traditional framework of POMDPs to model energy consumption, which
represents a hard constraint. The energy levels may increase and decrease with
transitions, and the hard constraint requires that the energy level must remain
positive in all steps till the target is reached. First, we present a novel
algorithm for solving POMDPs with energy levels, developing on existing POMDP
solvers and using RTDP as its main method. Our second contribution is related
to policy representation. For larger POMDP instances the policies computed by
existing solvers are too large to be understandable. We present an automated
procedure based on machine learning techniques that automatically extracts
important decisions of the policy allowing us to compute succinct human
readable policies. Finally, we show experimentally that our algorithm performs
well and computes succinct policies on a number of POMDP instances from the
literature that were naturally enhanced with energy levels.Comment: Technical report accompanying a paper published in proceedings of
AAMAS 201
Perseus: Randomized Point-based Value Iteration for POMDPs
Partially observable Markov decision processes (POMDPs) form an attractive
and principled framework for agent planning under uncertainty. Point-based
approximate techniques for POMDPs compute a policy based on a finite set of
points collected in advance from the agents belief space. We present a
randomized point-based value iteration algorithm called Perseus. The algorithm
performs approximate value backup stages, ensuring that in each backup stage
the value of each point in the belief set is improved; the key observation is
that a single backup may improve the value of many belief points. Contrary to
other point-based methods, Perseus backs up only a (randomly selected) subset
of points in the belief set, sufficient for improving the value of each belief
point in the set. We show how the same idea can be extended to dealing with
continuous action spaces. Experimental results show the potential of Perseus in
large scale POMDP problems
Strengthening Deterministic Policies for POMDPs
The synthesis problem for partially observable Markov decision processes
(POMDPs) is to compute a policy that satisfies a given specification. Such
policies have to take the full execution history of a POMDP into account,
rendering the problem undecidable in general. A common approach is to use a
limited amount of memory and randomize over potential choices. Yet, this
problem is still NP-hard and often computationally intractable in practice. A
restricted problem is to use neither history nor randomization, yielding
policies that are called stationary and deterministic. Previous approaches to
compute such policies employ mixed-integer linear programming (MILP). We
provide a novel MILP encoding that supports sophisticated specifications in the
form of temporal logic constraints. It is able to handle an arbitrary number of
such specifications. Yet, randomization and memory are often mandatory to
achieve satisfactory policies. First, we extend our encoding to deliver a
restricted class of randomized policies. Second, based on the results of the
original MILP, we employ a preprocessing of the POMDP to encompass memory-based
decisions. The advantages of our approach over state-of-the-art POMDP solvers
lie (1) in the flexibility to strengthen simple deterministic policies without
losing computational tractability and (2) in the ability to enforce the
provable satisfaction of arbitrarily many specifications. The latter point
allows taking trade-offs between performance and safety aspects of typical
POMDP examples into account. We show the effectiveness of our method on a broad
range of benchmarks
Algorithms for stochastic finite memory control of partially observable systems
A partially observable Markov decision process (POMDP) is a mathematical framework for planning and control problems in which actions have stochastic effects and observations provide uncertain state information. It is widely used for research in decision-theoretic planning and reinforcement learning. % To cope with partial observability, a policy (or plan) must use memory, and previous work has shown that a finite-state controller provides a good policy representation. This thesis considers a previously-developed bounded policy iteration algorithm for POMDPs that finds policies that take the form of stochastic finite-state controllers. Two new improvements of this algorithm are developed. First improvement provides a simplification of the basic linear program, which is used to find improved controllers. This results in a considerable speed-up in efficiency of the original algorithm. Secondly, a branch and bound algorithm for adding the best possible node to the controller is presented, which provides an error bound and a test for global optimality. Experimental results show that these enhancements significantly improve the algorithm\u27s performance
On Polynomial Sized MDP Succinct Policies
Policies of Markov Decision Processes (MDPs) determine the next action to
execute from the current state and, possibly, the history (the past states).
When the number of states is large, succinct representations are often used to
compactly represent both the MDPs and the policies in a reduced amount of
space. In this paper, some problems related to the size of succinctly
represented policies are analyzed. Namely, it is shown that some MDPs have
policies that can only be represented in space super-polynomial in the size of
the MDP, unless the polynomial hierarchy collapses. This fact motivates the
study of the problem of deciding whether a given MDP has a policy of a given
size and reward. Since some algorithms for MDPs work by finding a succinct
representation of the value function, the problem of deciding the existence of
a succinct representation of a value function of a given size and reward is
also considered
Task-Guided Inverse Reinforcement Learning Under Partial Information
We study the problem of inverse reinforcement learning (IRL), where the
learning agent recovers a reward function using expert demonstrations. Most of
the existing IRL techniques make the often unrealistic assumption that the
agent has access to full information about the environment. We remove this
assumption by developing an algorithm for IRL in partially observable Markov
decision processes (POMDPs). The algorithm addresses several limitations of
existing techniques that do not take the information asymmetry between the
expert and the learner into account. First, it adopts causal entropy as the
measure of the likelihood of the expert demonstrations as opposed to entropy in
most existing IRL techniques, and avoids a common source of algorithmic
complexity. Second, it incorporates task specifications expressed in temporal
logic into IRL. Such specifications may be interpreted as side information
available to the learner a priori in addition to the demonstrations and may
reduce the information asymmetry. Nevertheless, the resulting formulation is
still nonconvex due to the intrinsic nonconvexity of the so-called forward
problem, i.e., computing an optimal policy given a reward function, in POMDPs.
We address this nonconvexity through sequential convex programming and
introduce several extensions to solve the forward problem in a scalable manner.
This scalability allows computing policies that incorporate memory at the
expense of added computational cost yet also outperform memoryless policies. We
demonstrate that, even with severely limited data, the algorithm learns reward
functions and policies that satisfy the task and induce a similar behavior to
the expert by leveraging the side information and incorporating memory into the
policy.Comment: Initial submission to ICAPS 202
- …