8,798 research outputs found
Verifiable Reinforcement Learning via Policy Extraction
While deep reinforcement learning has successfully solved many challenging
control tasks, its real-world applicability has been limited by the inability
to ensure the safety of learned policies. We propose an approach to verifiable
reinforcement learning by training decision tree policies, which can represent
complex policies (since they are nonparametric), yet can be efficiently
verified using existing techniques (since they are highly structured). The
challenge is that decision tree policies are difficult to train. We propose
VIPER, an algorithm that combines ideas from model compression and imitation
learning to learn decision tree policies guided by a DNN policy (called the
oracle) and its Q-function, and show that it substantially outperforms two
baselines. We use VIPER to (i) learn a provably robust decision tree policy for
a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree
policy for a toy game based on Pong that provably never loses, and (iii) learn
a provably stable decision tree policy for cart-pole. In each case, the
decision tree policy achieves performance equal to that of the original DNN
policy
Certified Reinforcement Learning with Logic Guidance
This paper proposes the first model-free Reinforcement Learning (RL)
framework to synthesise policies for unknown, and continuous-state Markov
Decision Processes (MDPs), such that a given linear temporal property is
satisfied. We convert the given property into a Limit Deterministic Buchi
Automaton (LDBA), namely a finite-state machine expressing the property.
Exploiting the structure of the LDBA, we shape a synchronous reward function
on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces
that probabilistically satisfy the linear temporal property. This probability
(certificate) is also calculated in parallel with policy learning when the
state space of the MDP is finite: as such, the RL algorithm produces a policy
that is certified with respect to the property. Under the assumption of finite
state space, theoretical guarantees are provided on the convergence of the RL
algorithm to an optimal policy, maximising the above probability. We also show
that our method produces ''best available'' control policies when the logical
property cannot be satisfied. In the general case of a continuous state space,
we propose a neural network architecture for RL and we empirically show that
the algorithm finds satisfying policies, if there exist such policies. The
performance of the proposed framework is evaluated via a set of numerical
examples and benchmarks, where we observe an improvement of one order of
magnitude in the number of iterations required for the policy synthesis,
compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782
Learning Task Specifications from Demonstrations
Real world applications often naturally decompose into several sub-tasks. In
many settings (e.g., robotics) demonstrations provide a natural way to specify
the sub-tasks. However, most methods for learning from demonstrations either do
not provide guarantees that the artifacts learned for the sub-tasks can be
safely recombined or limit the types of composition available. Motivated by
this deficit, we consider the problem of inferring Boolean non-Markovian
rewards (also known as logical trace properties or specifications) from
demonstrations provided by an agent operating in an uncertain, stochastic
environment. Crucially, specifications admit well-defined composition rules
that are typically easy to interpret. In this paper, we formulate the
specification inference task as a maximum a posteriori (MAP) probability
inference problem, apply the principle of maximum entropy to derive an analytic
demonstration likelihood model and give an efficient approach to search for the
most likely specification in a large candidate pool of specifications. In our
experiments, we demonstrate how learning specifications can help avoid common
problems that often arise due to ad-hoc reward composition.Comment: NIPS 201
Neural Network Repair with Reachability Analysis
Safety is a critical concern for the next generation of autonomy that is likely to rely heavily on deep neural networks for perception and control. Formally verifying the safety and robustness of well-trained DNNs and learning-enabled cyber-physical systems (Le-CPS) under adversarial attacks, model uncertainties, and sensing errors is essential for safe autonomy. This research proposes a framework to repair unsafe DNNs in safety-critical systems with reachability analysis. The repair process is inspired by adversarial training which has demonstrated high effectiveness in improving the safety and robustness of DNNs. Different from traditional adversarial training approaches where adversarial examples are utilized from random attacks and may not be representative of all unsafe behaviors, our repair process uses reachability analysis to compute the exact unsafe regions and identify sufficiently representative examples to enhance the efficacy and efficiency of the adversarial training.
The performance of our repair framework is evaluated on two types of benchmarks without safe models as references. One is a DNN controller for aircraft collision avoidance with access to training data. The other is a rocket lander where our framework can be seamlessly integrated with the well-known deep deterministic policy gradient (DDPG) reinforcement learning algorithm. The experimental results show that our framework can successfully repair all instances on multiple safety specifications with negligible performance degradation. In addition, to increase the computational and memory efficiency of the reachability analysis algorithm in the framework, we propose a depth-first-search algorithm that combines an existing exact analysis method with an over-approximation approach based on a new set representation. Experimental results show that our method achieves a five-fold improvement in runtime and a two-fold improvement in memory usage compared to exact analysis
- …