4 research outputs found
Human-in-the-Loop Synthesis for Partially Observable Markov Decision Processes
We study planning problems where autonomous agents operate inside
environments that are subject to uncertainties and not fully observable.
Partially observable Markov decision processes (POMDPs) are a natural formal
model to capture such problems. Because of the potentially huge or even
infinite belief space in POMDPs, synthesis with safety guarantees is, in
general, computationally intractable. We propose an approach that aims to
circumvent this difficulty: in scenarios that can be partially or fully
simulated in a virtual environment, we actively integrate a human user to
control an agent. While the user repeatedly tries to safely guide the agent in
the simulation, we collect data from the human input. Via behavior cloning, we
translate the data into a strategy for the POMDP. The strategy resolves all
nondeterminism and non-observability of the POMDP, resulting in a discrete-time
Markov chain (MC). The efficient verification of this MC gives quantitative
insights into the quality of the inferred human strategy by proving or
disproving given system specifications. For the case that the quality of the
strategy is not sufficient, we propose a refinement method using
counterexamples presented to the human. Experiments show that by including
humans into the POMDP verification loop we improve the state of the art by
orders of magnitude in terms of scalability
Human-in-the-Loop Mixed-Initiative Control under Temporal Tasks
This paper considers the motion control and task planning problem of mobile
robots under complex high-level tasks and human initiatives. The assigned task
is specified as Linear Temporal Logic (LTL) formulas that consist of hard and
soft constraints. The human initiative influences the robot autonomy in two
explicit ways: with additive terms in the continuous controller and with
contingent task assignments. We propose an online coordination scheme that
encapsulates (i) a mixed-initiative continuous controller that ensures all-time
safety despite of possible human errors, (ii) a plan adaptation scheme that
accommodates new features discovered in the workspace and short-term tasks
assigned by the operator during run time, and (iii) an iterative inverse
reinforcement learning (IRL) algorithm that allows the robot to asymptotically
learn the human preference on the parameters during the plan synthesis. The
results are demonstrated by both realistic human-in-the-loop simulations and
experiments.Comment: 8 pages, 7 figures, IEEE International Conference on Robotics and
Automatio
Synthesis of Provably Correct Autonomy Protocols for Shared Control
We synthesize shared control protocols subject to probabilistic temporal
logic specifications. More specifically, we develop a framework in which a
human and an autonomy protocol can issue commands to carry out a certain task.
We blend these commands into a joint input to a robot. We model the interaction
between the human and the robot as a Markov decision process (MDP) that
represents the shared control scenario. Using inverse reinforcement learning,
we obtain an abstraction of the human's behavior and decisions. We use
randomized strategies to account for randomness in human's decisions, caused by
factors such as complexity of the task specifications or imperfect interfaces.
We design the autonomy protocol to ensure that the resulting robot behavior
satisfies given safety and performance specifications in probabilistic temporal
logic. Additionally, the resulting strategies generate behavior as similar to
the behavior induced by the human's commands as possible. We solve the
underlying problem efficiently using quasiconvex programming. Case studies
involving autonomous wheelchair navigation and unmanned aerial vehicle mission
planning showcase the applicability of our approach.Comment: Submitted to IEEE Transactions of Automatic Contro
Blending Controllers via Multi-Objective Bandits
Safety and performance are often two competing objectives in sequential
decision-making problems. Existing performant controllers, such as controllers
derived from reinforcement learning algorithms, often fall short of safety
guarantees. On the contrary, controllers that guarantee safety, such as those
derived from classical control theory, require restrictive assumptions and are
often conservative in performance. Our goal is to blend a performant and a safe
controller to generate a single controller that is safer than the performant
and accumulates higher rewards than the safe controller. To this end, we
propose a blending algorithm using the framework of contextual multi-armed
multi-objective bandits. At each stage, the algorithm observes the
environment's current context alongside an immediate reward and cost, which is
the underlying safety measure. The algorithm then decides which controller to
employ based on its observations. We demonstrate that the algorithm achieves
sublinear Pareto regret, a performance measure that models coherence with an
expert that always avoids picking the controller with both inferior safety and
performance. We derive an upper bound on the loss in individual objectives,
which imposes no additional computational complexity. We empirically
demonstrate the algorithm's success in blending a safe and a performant
controller in a safety-focused testbed, the Safety Gym environment. A
statistical analysis of the blended controller's total reward and cost reflects
two key takeaways: The blended controller shows a strict improvement in
performance compared to the safe controller, and it is safer than the
performant controller.Comment: Under review at NeurIPS 202