2,447 research outputs found
A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret
In various control task domains, existing controllers provide a baseline
level of performance that -- though possibly suboptimal -- should be
maintained. Reinforcement learning (RL) algorithms that rely on extensive
exploration of the state and action space can be used to optimize a control
policy. However, fully exploratory RL algorithms may decrease performance below
a baseline level during training. In this paper, we address the issue of online
optimization of a control policy while minimizing regret w.r.t a baseline
policy performance. We present a joint imitation-reinforcement learning
framework, denoted JIRL. The learning process in JIRL assumes the availability
of a baseline policy and is designed with two objectives in mind \textbf{(a)}
leveraging the baseline's online demonstrations to minimize the regret w.r.t
the baseline policy during training, and \textbf{(b)} eventually surpassing the
baseline performance. JIRL addresses these objectives by initially learning to
imitate the baseline policy and gradually shifting control from the baseline to
an RL agent. Experimental results show that JIRL effectively accomplishes the
aforementioned objectives in several, continuous action-space domains. The
results demonstrate that JIRL is comparable to a state-of-the-art algorithm in
its final performance while incurring significantly lower baseline regret
during training in all of the presented domains. Moreover, the results show a
reduction factor of up to in baseline regret over a state-of-the-art
baseline regret minimization approach.Comment: IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), 202
Batch Policy Learning under Constraints
When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting
- …