18,258 research outputs found
Reinforcement Learning With Temporal Logic Rewards
Reinforcement learning (RL) depends critically on the choice of reward
functions used to capture the de- sired behavior and constraints of a robot.
Usually, these are handcrafted by a expert designer and represent heuristics
for relatively simple tasks. Real world applications typically involve more
complex tasks with rich temporal and logical structure. In this paper we take
advantage of the expressive power of temporal logic (TL) to specify complex
rules the robot should follow, and incorporate domain knowledge into learning.
We propose Truncated Linear Temporal Logic (TLTL) as specifications language,
that is arguably well suited for the robotics applications, together with
quantitative semantics, i.e., robustness degree. We propose a RL approach to
learn tasks expressed as TLTL formulae that uses their associated robustness
degree as reward functions, instead of the manually crafted heuristics trying
to capture the same specifications. We show in simulated trials that learning
is faster and policies obtained using the proposed approach outperform the ones
learned using heuristic rewards in terms of the robustness degree, i.e., how
well the tasks are satisfied. Furthermore, we demonstrate the proposed RL
approach in a toast-placing task learned by a Baxter robot
Using Experience Classification for Training Non-Markovian Tasks
Unlike the standard Reinforcement Learning (RL) model, many real-world tasks
are non-Markovian, whose rewards are predicated on state history rather than
solely on the current state. Solving a non-Markovian task, frequently applied
in practical applications such as autonomous driving, financial trading, and
medical diagnosis, can be quite challenging. We propose a novel RL approach to
achieve non-Markovian rewards expressed in temporal logic LTL (Linear
Temporal Logic over Finite Traces). To this end, an encoding of linear
complexity from LTL into MDPs (Markov Decision Processes) is introduced to
take advantage of advanced RL algorithms. Then, a prioritized experience replay
technique based on the automata structure (semantics equivalent to LTL
specification) is utilized to improve the training process. We empirically
evaluate several benchmark problems augmented with non-Markovian tasks to
demonstrate the feasibility and effectiveness of our approach
On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning
To solve a task with reinforcement learning (RL), it is necessary to formally
specify the goal of that task. Although most RL algorithms require that the
goal is formalised as a Markovian reward function, alternatives have been
developed (such as Linear Temporal Logic and Multi-Objective Reinforcement
Learning). Moreover, it is well known that some of these formalisms are able to
express certain tasks that other formalisms cannot express. However, there has
not yet been any thorough analysis of how these formalisms relate to each other
in terms of expressivity. In this work, we fill this gap in the existing
literature by providing a comprehensive comparison of the expressivities of 17
objective-specification formalisms in RL. We place these formalisms in a
preorder based on their expressive power, and present this preorder as a Hasse
diagram. We find a variety of limitations for the different formalisms, and
that no formalism is both dominantly expressive and straightforward to optimise
with current techniques. For example, we prove that each of Regularised RL,
Outer Nonlinear Markov Rewards, Reward Machines, Linear Temporal Logic, and
Limit Average Rewards can express an objective that the others cannot. Our
findings have implications for both policy optimisation and reward learning.
Firstly, we identify expressivity limitations which are important to consider
when specifying objectives in practice. Secondly, our results highlight the
need for future research which adapts reward learning to work with a variety of
formalisms, since many existing reward learning methods implicitly assume that
desired objectives can be expressed with Markovian rewards. Our work
contributes towards a more cohesive understanding of the costs and benefits of
different RL objective-specification formalisms
Automata guided hierarchical reinforcement learning for zero-shot skill composition
An obstacle that prevents the wide adoption of (deep) reinforcement learning (RL) in control systems is its need for a large amount of interactions with the environment in order to master a skill. The learned skill usually generalizes poorly across domains and re-training is often necessary when presented with a new task. We present a framework that combines methods in formal methods with hierarchical reinforcement learning (HRL). The set of techniques we provide allows for convenient specification of tasks with complex logic, learn hierarchical policies (meta-controller and low-level controllers) with well-defined intrinsic rewards using any RL methods and is able to construct new skills from existing ones without additional learning. We evaluate the proposed methods in a simple grid world simulation as well as simulation on a Baxter robot
Learning Task Specifications from Demonstrations
Real world applications often naturally decompose into several sub-tasks. In
many settings (e.g., robotics) demonstrations provide a natural way to specify
the sub-tasks. However, most methods for learning from demonstrations either do
not provide guarantees that the artifacts learned for the sub-tasks can be
safely recombined or limit the types of composition available. Motivated by
this deficit, we consider the problem of inferring Boolean non-Markovian
rewards (also known as logical trace properties or specifications) from
demonstrations provided by an agent operating in an uncertain, stochastic
environment. Crucially, specifications admit well-defined composition rules
that are typically easy to interpret. In this paper, we formulate the
specification inference task as a maximum a posteriori (MAP) probability
inference problem, apply the principle of maximum entropy to derive an analytic
demonstration likelihood model and give an efficient approach to search for the
most likely specification in a large candidate pool of specifications. In our
experiments, we demonstrate how learning specifications can help avoid common
problems that often arise due to ad-hoc reward composition.Comment: NIPS 201
Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning
A novel reinforcement learning scheme to synthesize policies for
continuous-space Markov decision processes (MDPs) is proposed. This scheme
enables one to apply model-free, off-the-shelf reinforcement learning
algorithms for finite MDPs to compute optimal strategies for the corresponding
continuous-space MDPs without explicitly constructing the finite-state
abstraction. The proposed approach is based on abstracting the system with a
finite MDP (without constructing it explicitly) with unknown transition
probabilities, synthesizing strategies over the abstract MDP, and then mapping
the results back over the concrete continuous-space MDP with approximate
optimality guarantees. The properties of interest for the system belong to a
fragment of linear temporal logic, known as syntactically co-safe linear
temporal logic (scLTL), and the synthesis requirement is to maximize the
probability of satisfaction within a given bounded time horizon. A key
contribution of the paper is to leverage the classical convergence results for
reinforcement learning on finite MDPs and provide control strategies maximizing
the probability of satisfaction over unknown, continuous-space MDPs while
providing probabilistic closeness guarantees. Automata-based reward functions
are often sparse; we present a novel potential-based reward shaping technique
to produce dense rewards to speed up learning. The effectiveness of the
proposed approach is demonstrated by applying it to three physical benchmarks
concerning the regulation of a room's temperature, control of a road traffic
cell, and of a 7-dimensional nonlinear model of a BMW 320i car.Comment: This work is accepted at the 11th ACM/IEEE Conference on
Cyber-Physical Systems (ICCPS
A Hierarchical Reinforcement Learning Method for Persistent Time-Sensitive Tasks
Reinforcement learning has been applied to many interesting problems such as
the famous TD-gammon and the inverted helicopter flight. However, little effort
has been put into developing methods to learn policies for complex persistent
tasks and tasks that are time-sensitive. In this paper, we take a step towards
solving this problem by using signal temporal logic (STL) as task
specification, and taking advantage of the temporal abstraction feature that
the options framework provide. We show via simulation that a relatively easy to
implement algorithm that combines STL and options can learn a satisfactory
policy with a small number of training case
Multi-Agent Reinforcement Learning Guided by Signal Temporal Logic Specifications
Reward design is a key component of deep reinforcement learning, yet some
tasks and designer's objectives may be unnatural to define as a scalar cost
function. Among the various techniques, formal methods integrated with DRL have
garnered considerable attention due to their expressiveness and flexibility to
define the reward and requirements for different states and actions of the
agent. However, how to leverage Signal Temporal Logic (STL) to guide
multi-agent reinforcement learning reward design remains unexplored. Complex
interactions, heterogeneous goals and critical safety requirements in
multi-agent systems make this problem even more challenging. In this paper, we
propose a novel STL-guided multi-agent reinforcement learning framework. The
STL requirements are designed to include both task specifications according to
the objective of each agent and safety specifications, and the robustness
values of the STL specifications are leveraged to generate rewards. We validate
the advantages of our method through empirical studies. The experimental
results demonstrate significant reward performance improvements compared to
MARL without STL guidance, along with a remarkable increase in the overall
safety rate of the multi-agent systems
- …