118 research outputs found
Safe Learning for Near Optimal Scheduling
In this paper, we investigate the combination of synthesis, model-based
learning, and online sampling techniques to obtain safe and near-optimal
schedulers for a preemptible task scheduling problem. Our algorithms can handle
Markov decision processes (MDPs) that have 1020 states and beyond which cannot
be handled with state-of-the art probabilistic model-checkers. We provide
probably approximately correct (PAC) guarantees for learning the model.
Additionally, we extend Monte-Carlo tree search with advice, computed using
safety games or obtained using the earliest-deadline-first scheduler, to safely
explore the learned model online. Finally, we implemented and compared our
algorithms empirically against shielded deep Q-learning on large task systems
Shielding in Resource-Constrained Goal POMDPs
We consider partially observable Markov decision processes (POMDPs) modeling
an agent that needs a supply of a certain resource (e.g., electricity stored in
batteries) to operate correctly. The resource is consumed by agent's actions
and can be replenished only in certain states. The agent aims to minimize the
expected cost of reaching some goal while preventing resource exhaustion, a
problem we call \emph{resource-constrained goal optimization} (RSGO). We take a
two-step approach to the RSGO problem. First, using formal methods techniques,
we design an algorithm computing a \emph{shield} for a given scenario: a
procedure that observes the agent and prevents it from using actions that might
eventually lead to resource exhaustion. Second, we augment the POMCP heuristic
search algorithm for POMDP planning with our shields to obtain an algorithm
solving the RSGO problem. We implement our algorithm and present experiments
showing its applicability to benchmarks from the literature
Barrier Functions for Multiagent-POMDPs with DTL Specifications
Multi-agent partially observable Markov decision processes (MPOMDPs) provide a framework to represent heterogeneous autonomous agents subject to uncertainty and partial observation. In this paper, given a nominal policy provided by a human operator or a conventional planning method, we propose a technique based on barrier functions to design a minimally interfering safety-shield ensuring satisfaction of high-level specifications in terms of linear distribution temporal logic (LDTL). To this end, we use sufficient and necessary conditions for the invariance of a given set based on discrete-time barrier functions (DTBFs) and formulate sufficient conditions for finite time DTBF to study finite time convergence to a set. We then show that different LDTL mission/safety specifications can be cast as a set of invariance or finite time reachability problems. We demonstrate that the proposed method for safety-shield synthesis can be implemented online by a sequence of one-step greedy algorithms. We demonstrate the efficacy of the proposed method using experiments involving a team of robots
Certified Reinforcement Learning with Logic Guidance
This paper proposes the first model-free Reinforcement Learning (RL)
framework to synthesise policies for unknown, and continuous-state Markov
Decision Processes (MDPs), such that a given linear temporal property is
satisfied. We convert the given property into a Limit Deterministic Buchi
Automaton (LDBA), namely a finite-state machine expressing the property.
Exploiting the structure of the LDBA, we shape a synchronous reward function
on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces
that probabilistically satisfy the linear temporal property. This probability
(certificate) is also calculated in parallel with policy learning when the
state space of the MDP is finite: as such, the RL algorithm produces a policy
that is certified with respect to the property. Under the assumption of finite
state space, theoretical guarantees are provided on the convergence of the RL
algorithm to an optimal policy, maximising the above probability. We also show
that our method produces ''best available'' control policies when the logical
property cannot be satisfied. In the general case of a continuous state space,
we propose a neural network architecture for RL and we empirically show that
the algorithm finds satisfying policies, if there exist such policies. The
performance of the proposed framework is evaluated via a set of numerical
examples and benchmarks, where we observe an improvement of one order of
magnitude in the number of iterations required for the policy synthesis,
compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782
Safe Policy Synthesis in Multi-Agent POMDPs via Discrete-Time Barrier Functions
A multi-agent partially observable Markov decision process (MPOMDP) is a
modeling paradigm used for high-level planning of heterogeneous autonomous
agents subject to uncertainty and partial observation. Despite their modeling
efficiency, MPOMDPs have not received significant attention in safety-critical
settings. In this paper, we use barrier functions to design policies for
MPOMDPs that ensure safety. Notably, our method does not rely on discretization
of the belief space, or finite memory. To this end, we formulate sufficient and
necessary conditions for the safety of a given set based on discrete-time
barrier functions (DTBFs) and we demonstrate that our formulation also allows
for Boolean compositions of DTBFs for representing more complicated safe sets.
We show that the proposed method can be implemented online by a sequence of
one-step greedy algorithms as a standalone safe controller or as a
safety-filter given a nominal planning policy. We illustrate the efficiency of
the proposed methodology based on DTBFs using a high-fidelity simulation of
heterogeneous robots.Comment: 8 pages and 4 figure
- …