312 research outputs found
Expectation Optimization with Probabilistic Guarantees in POMDPs with Discounted-sum Objectives
Partially-observable Markov decision processes (POMDPs) with discounted-sum
payoff are a standard framework to model a wide range of problems related to
decision making under uncertainty. Traditionally, the goal has been to obtain
policies that optimize the expectation of the discounted-sum payoff. A key
drawback of the expectation measure is that even low probability events with
extreme payoff can significantly affect the expectation, and thus the obtained
policies are not necessarily risk-averse. An alternate approach is to optimize
the probability that the payoff is above a certain threshold, which allows
obtaining risk-averse policies, but ignores optimization of the expectation. We
consider the expectation optimization with probabilistic guarantee (EOPG)
problem, where the goal is to optimize the expectation ensuring that the payoff
is above a given threshold with at least a specified probability. We present
several results on the EOPG problem, including the first algorithm to solve it.Comment: Full version of a paper published at IJCAI/ECAI 201
The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models
Partially Observable Markov Decision Processes (POMDPs) are used to model
environments where the full state cannot be perceived by an agent. As such the
agent needs to reason taking into account the past observations and actions.
However, simply remembering the full history is generally intractable due to
the exponential growth in the history space. Maintaining a probability
distribution that models the belief over what the true state is can be used as
a sufficient statistic of the history, but its computation requires access to
the model of the environment and is often intractable. While SOTA algorithms
use Recurrent Neural Networks to compress the observation-action history aiming
to learn a sufficient statistic, they lack guarantees of success and can lead
to sub-optimal policies. To overcome this, we propose the Wasserstein Belief
Updater, an RL algorithm that learns a latent model of the POMDP and an
approximation of the belief update. Our approach comes with theoretical
guarantees on the quality of our approximation ensuring that our outputted
beliefs allow for learning the optimal value function
Offline RL with Observation Histories: Analyzing and Improving Sample Complexity
Offline reinforcement learning (RL) can in principle synthesize more optimal
behavior from a dataset consisting only of suboptimal trials. One way that this
can happen is by "stitching" together the best parts of otherwise suboptimal
trajectories that overlap on similar states, to create new behaviors where each
individual state is in-distribution, but the overall returns are higher.
However, in many interesting and complex applications, such as autonomous
navigation and dialogue systems, the state is partially observed. Even worse,
the state representation is unknown or not easy to define. In such cases,
policies and value functions are often conditioned on observation histories
instead of states. In these cases, it is not clear if the same kind of
"stitching" is feasible at the level of observation histories, since two
different trajectories would always have different histories, and thus "similar
states" that might lead to effective stitching cannot be leveraged.
Theoretically, we show that standard offline RL algorithms conditioned on
observation histories suffer from poor sample complexity, in accordance with
the above intuition. We then identify sufficient conditions under which offline
RL can still be efficient -- intuitively, it needs to learn a compact
representation of history comprising only features relevant for action
selection. We introduce a bisimulation loss that captures the extent to which
this happens, and propose that offline RL can explicitly optimize this loss to
aid worst-case sample complexity. Empirically, we show that across a variety of
tasks either our proposed loss improves performance, or the value of this loss
is already minimized as a consequence of standard offline RL, indicating that
it correlates well with good performance.Comment: 21 pages, 4 figure
Shielding in Resource-Constrained Goal POMDPs
We consider partially observable Markov decision processes (POMDPs) modeling
an agent that needs a supply of a certain resource (e.g., electricity stored in
batteries) to operate correctly. The resource is consumed by agent's actions
and can be replenished only in certain states. The agent aims to minimize the
expected cost of reaching some goal while preventing resource exhaustion, a
problem we call \emph{resource-constrained goal optimization} (RSGO). We take a
two-step approach to the RSGO problem. First, using formal methods techniques,
we design an algorithm computing a \emph{shield} for a given scenario: a
procedure that observes the agent and prevents it from using actions that might
eventually lead to resource exhaustion. Second, we augment the POMCP heuristic
search algorithm for POMDP planning with our shields to obtain an algorithm
solving the RSGO problem. We implement our algorithm and present experiments
showing its applicability to benchmarks from the literature
Sample-based Search Methods for Bayes-Adaptive Planning
A fundamental issue for control is acting in the face of uncertainty about the environment. Amongst other things, this induces a trade-off between exploration and exploitation. A model-based Bayesian agent optimizes its return by maintaining a posterior distribution over possible environments, and considering all possible future paths. This optimization is equivalent to solving a Markov Decision Process (MDP) whose hyperstate comprises the agent's beliefs about the environment, as well as its current state in that environment. This corresponding process is called a Bayes-Adaptive MDP (BAMDP). Even for MDPs with only a few states, it is generally intractable to solve the corresponding BAMDP exactly. Various heuristics have been devised, but those that are computationally tractable often perform indifferently, whereas those that perform well are typically so expensive as to be applicable only in small domains with limited structure. Here, we develop new tractable methods for planning in BAMDPs based on recent advances in the solution to large MDPs and general partially observable MDPs. Our algorithms are sample-based, plan online in a way that is focused on the current belief, and, critically, avoid expensive belief updates during simulations. In discrete domains, we use Monte-Carlo tree search to search forward in an aggressive manner. The derived algorithm can scale to large MDPs and provably converges to the Bayes-optimal solution asymptotically. We then consider a more general class of simulation-based methods in which approximation methods can be employed to allow value function estimates to generalize between hyperstates during search. This allows us to tackle continuous domains. We validate our approach empirically in standard domains by comparison with existing approximations. Finally, we explore Bayes-adaptive planning in environments that are modelled by rich, non-parametric probabilistic models. We demonstrate that a fully Bayesian agent can be advantageous in the exploration of complex and even infinite, structured domains
- ā¦