3,478 research outputs found
Recommended from our members
Chapter 2Â -Â Data-Driven Energy Efficient Driving Control in Connected Vehicle Environment
Control Regularization for Reduced Variance Reinforcement Learning
Dealing with high variance is a significant challenge in model-free
reinforcement learning (RL). Existing methods are unreliable, exhibiting high
variance in performance from run to run using different initializations/seeds.
Focusing on problems arising in continuous control, we propose a functional
regularization approach to augmenting model-free RL. In particular, we
regularize the behavior of the deep policy to be similar to a policy prior,
i.e., we regularize in function space. We show that functional regularization
yields a bias-variance trade-off, and propose an adaptive tuning strategy to
optimize this trade-off. When the policy prior has control-theoretic stability
guarantees, we further show that this regularization approximately preserves
those stability guarantees throughout learning. We validate our approach
empirically on a range of settings, and demonstrate significantly reduced
variance, guaranteed dynamic stability, and more efficient learning than deep
RL alone.Comment: Appearing in ICML 201
Correct-by-synthesis reinforcement learning with temporal logic constraints
We consider a problem on the synthesis of reactive controllers that optimize
some a priori unknown performance criterion while interacting with an
uncontrolled environment such that the system satisfies a given temporal logic
specification. We decouple the problem into two subproblems. First, we extract
a (maximally) permissive strategy for the system, which encodes multiple
(possibly all) ways in which the system can react to the adversarial
environment and satisfy the specifications. Then, we quantify the a priori
unknown performance criterion as a (still unknown) reward function and compute
an optimal strategy for the system within the operating envelope allowed by the
permissive strategy by using the so-called maximin-Q learning algorithm. We
establish both correctness (with respect to the temporal logic specifications)
and optimality (with respect to the a priori unknown performance criterion) of
this two-step technique for a fragment of temporal logic specifications. For
specifications beyond this fragment, correctness can still be preserved, but
the learned strategy may be sub-optimal. We present an algorithm to the overall
problem, and demonstrate its use and computational requirements on a set of
robot motion planning examples.Comment: 8 pages, 3 figures, 2 tables, submitted to IROS 201
- …