1,080 research outputs found
Certified Reinforcement Learning with Logic Guidance
This paper proposes the first model-free Reinforcement Learning (RL)
framework to synthesise policies for unknown, and continuous-state Markov
Decision Processes (MDPs), such that a given linear temporal property is
satisfied. We convert the given property into a Limit Deterministic Buchi
Automaton (LDBA), namely a finite-state machine expressing the property.
Exploiting the structure of the LDBA, we shape a synchronous reward function
on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces
that probabilistically satisfy the linear temporal property. This probability
(certificate) is also calculated in parallel with policy learning when the
state space of the MDP is finite: as such, the RL algorithm produces a policy
that is certified with respect to the property. Under the assumption of finite
state space, theoretical guarantees are provided on the convergence of the RL
algorithm to an optimal policy, maximising the above probability. We also show
that our method produces ''best available'' control policies when the logical
property cannot be satisfied. In the general case of a continuous state space,
we propose a neural network architecture for RL and we empirically show that
the algorithm finds satisfying policies, if there exist such policies. The
performance of the proposed framework is evaluated via a set of numerical
examples and benchmarks, where we observe an improvement of one order of
magnitude in the number of iterations required for the policy synthesis,
compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782
Reinforcement Learning of Action and Query Policies with LTL Instructions under Uncertain Event Detector
Reinforcement learning (RL) with linear temporal logic (LTL) objectives can
allow robots to carry out symbolic event plans in unknown environments. Most
existing methods assume that the event detector can accurately map
environmental states to symbolic events; however, uncertainty is inevitable for
real-world event detectors. Such uncertainty in an event detector generates
multiple branching possibilities on LTL instructions, confusing action
decisions. Moreover, the queries to the uncertain event detector, necessary for
the task's progress, may increase the uncertainty further. To cope with those
issues, we propose an RL framework, Learning Action and Query over Belief LTL
(LAQBL), to learn an agent that can consider the diversity of LTL instructions
due to uncertain event detection while avoiding task failure due to the
unnecessary event-detection query. Our framework simultaneously learns 1) an
embedding of belief LTL, which is multiple branching possibilities on LTL
instructions using a graph neural network, 2) an action policy, and 3) a query
policy which decides whether or not to query for the event detector.
Simulations in a 2D grid world and image-input robotic inspection environments
show that our method successfully learns actions to follow LTL instructions
even with uncertain event detectors.Comment: 8 pages, Accepted by Robotics and Automation Letters (RA-L
Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of LTL formulas
We demonstrate a reinforcement learning agent which uses a compositional
recurrent neural network that takes as input an LTL formula and determines
satisfying actions. The input LTL formulas have never been seen before, yet the
network performs zero-shot generalization to satisfy them. This is a novel form
of multi-task learning for RL agents where agents learn from one diverse set of
tasks and generalize to a new set of diverse tasks. The formulation of the
network enables this capacity to generalize. We demonstrate this ability in two
domains. In a symbolic domain, the agent finds a sequence of letters that is
accepted. In a Minecraft-like environment, the agent finds a sequence of
actions that conform to the formula. While prior work could learn to execute
one formula reliably given examples of that formula, we demonstrate how to
encode all formulas reliably. This could form the basis of new multitask agents
that discover sub-tasks and execute them without any additional training, as
well as the agents which follow more complex linguistic commands. The
structures required for this generalization are specific to LTL formulas, which
opens up an interesting theoretical question: what structures are required in
neural networks for zero-shot generalization to different logics?Comment: Accepted in IROS 202
Learning to Follow Instructions in Text-Based Games
Text-based games present a unique class of sequential decision making problem
in which agents interact with a partially observable, simulated environment via
actions and observations conveyed through natural language. Such observations
typically include instructions that, in a reinforcement learning (RL) setting,
can directly or indirectly guide a player towards completing reward-worthy
tasks. In this work, we study the ability of RL agents to follow such
instructions. We conduct experiments that show that the performance of
state-of-the-art text-based game agents is largely unaffected by the presence
or absence of such instructions, and that these agents are typically unable to
execute tasks to completion. To further study and address the task of
instruction following, we equip RL agents with an internal structured
representation of natural language instructions in the form of Linear Temporal
Logic (LTL), a formal language that is increasingly used for temporally
extended reward specification in RL. Our framework both supports and highlights
the benefit of understanding the temporal semantics of instructions and in
measuring progress towards achievement of such a temporally extended behaviour.
Experiments with 500+ games in TextWorld demonstrate the superior performance
of our approach.Comment: NeurIPS 202
Temporal-Logic-Based Reward Shaping for Continuing Learning Tasks
In continuing tasks, average-reward reinforcement learning may be a more
appropriate problem formulation than the more common discounted reward
formulation. As usual, learning an optimal policy in this setting typically
requires a large amount of training experiences. Reward shaping is a common
approach for incorporating domain knowledge into reinforcement learning in
order to speed up convergence to an optimal policy. However, to the best of our
knowledge, the theoretical properties of reward shaping have thus far only been
established in the discounted setting. This paper presents the first reward
shaping framework for average-reward learning and proves that, under standard
assumptions, the optimal policy under the original reward function can be
recovered. In order to avoid the need for manual construction of the shaping
function, we introduce a method for utilizing domain knowledge expressed as a
temporal logic formula. The formula is automatically translated to a shaping
function that provides additional reward throughout the learning process. We
evaluate the proposed method on three continuing tasks. In all cases, shaping
speeds up the average-reward learning rate without any reduction in the
performance of the learned policy compared to relevant baselines
Nonparametric Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks
This thesis focuses on two key problems in reinforcement learning: How to design reward functions to obtain intended behaviors in autonomous systems using the learning-based control? Given complex mission specification, how to shape the reward function to achieve fast convergence and reduce sample complexity while learning the optimal policy? To answer these questions, the first part of this thesis investigates inverse reinforcement learning (IRL) method with a purpose of learning a reward function from expert demonstrations. However, existing algorithms often assume that the expert demonstrations are generated by the same reward function. Such an assumption may be invalid as one may need to aggregate data from multiple experts to obtain a sufficient set of demonstrations. In the first and the major part of the thesis, we develop a novel method, called Non-parametric Behavior Clustering IRL. This algorithm allows one to simultaneously cluster behaviors while learning their reward functions from demonstrations that are generated from more than one expert/behavior. Our approach is built upon the expectation-maximization formulation and non-parametric clustering in the IRL setting. We apply the algorithm to learn, from driving demonstrations, multiple driver behaviors (e.g., aggressive vs. evasive driving behaviors). In the second task, we study whether reinforcement learning can be used to generate complex behaviors specified in formal logic — Linear Temporal Logic (LTL). Such LTL tasks may specify temporally extended goals, safety, surveillance, and reactive behaviors in a dynamic environment. We introduce reward shaping under LTL constraints to improve the rate of convergence in learning the optimal and probably correct policies. Our approach exploits the relation between reward shaping and actor-critic methods for speeding up the convergence and, as a consequence, reducing training samples. We integrate compositional reasoning in formal methods with actor-critic reinforcement learning algorithms to initialize a heuristic value function for reward shaping. This initialization can direct the agent towards efficient planning subject to more complex behavior specifications in LTL. The investigation takes the initial step to integrate machine learning with formal methods and contributes to building highly autonomous and self-adaptive robots under complex missions
- …