9 research outputs found
An Optimal Online Method of Selecting Source Policies for Reinforcement Learning
Transfer learning significantly accelerates the reinforcement learning
process by exploiting relevant knowledge from previous experiences. The problem
of optimally selecting source policies during the learning process is of great
importance yet challenging. There has been little theoretical analysis of this
problem. In this paper, we develop an optimal online method to select source
policies for reinforcement learning. This method formulates online source
policy selection as a multi-armed bandit problem and augments Q-learning with
policy reuse. We provide theoretical guarantees of the optimal selection
process and convergence to the optimal policy. In addition, we conduct
experiments on a grid-based robot navigation domain to demonstrate its
efficiency and robustness by comparing to the state-of-the-art transfer
learning method
Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning
Bayesian policy reuse (BPR) is a general policy transfer framework for
selecting a source policy from an offline library by inferring the task belief
based on some observation signals and a trained observation model. In this
paper, we propose an improved BPR method to achieve more efficient policy
transfer in deep reinforcement learning (DRL). First, most BPR algorithms use
the episodic return as the observation signal that contains limited information
and cannot be obtained until the end of an episode. Instead, we employ the
state transition sample, which is informative and instantaneous, as the
observation signal for faster and more accurate task inference. Second, BPR
algorithms usually require numerous samples to estimate the probability
distribution of the tabular-based observation model, which may be expensive and
even infeasible to learn and maintain, especially when using the state
transition sample as the signal. Hence, we propose a scalable observation model
based on fitting state transition functions of source tasks from only a small
number of samples, which can generalize to any signals observed in the target
task. Moreover, we extend the offline-mode BPR to the continual learning
setting by expanding the scalable observation model in a plug-and-play fashion,
which can avoid negative transfer when faced with new unknown tasks.
Experimental results show that our method can consistently facilitate faster
and more efficient policy transfer.Comment: 16 pages, 6 figures, under revie
TempLe: Learning Template of Transitions for Sample Efficient Multi-task RL
Transferring knowledge among various environments is important to efficiently
learn multiple tasks online. Most existing methods directly use the previously
learned models or previously learned optimal policies to learn new tasks.
However, these methods may be inefficient when the underlying models or optimal
policies are substantially different across tasks. In this paper, we propose
Template Learning (TempLe), the first PAC-MDP method for multi-task
reinforcement learning that could be applied to tasks with varying state/action
space. TempLe generates transition dynamics templates, abstractions of the
transition dynamics across tasks, to gain sample efficiency by extracting
similarities between tasks even when their underlying models or optimal
policies have limited commonalities. We present two algorithms for an "online"
and a "finite-model" setting respectively. We prove that our proposed TempLe
algorithms achieve much lower sample complexity than single-task learners or
state-of-the-art multi-task methods. We show via systematically designed
experiments that our TempLe method universally outperforms the state-of-the-art
multi-task methods (PAC-MDP or not) in various settings and regimes
Asking for Help with a Cost in Reinforcement Learning
Reinforcement learning (RL) is a powerful tool for developing
intelligent agents, and the use of neural networks makes RL techniques more
scalable to challenging real-world applications, from task-oriented dialogue
systems to autonomous driving. However, one of the major bottlenecks to the
adoption of RL is efficiency, as it often takes many time steps to learn an
acceptable policy. To address this problem, we investigate the idea of
allowing the agent to ask for advice from a teacher. We formalize this
concept in a framework called ask-for-help RL, which entails augmenting a
Markov decision process with a teacher-query action that can be taken at a
fixed cost in any state. In this task, the agent faces a dilemma between
exploration, exploitation, and teacher-querying. To make this trade-off, we
propose an action selection strategy that is rooted in the classical notion
of value-of-information, and suggest a practical implementation that is based
on deep Q-learning. This algorithm, called VOE/Q, can jointly decide between
taking a particular environment action or querying the teacher, and is
sensitive to the query cost. We then perform experiments in two domains: a
maze navigation task and the Atari game Freeway. When the teacher is
excluded, the algorithm shows substantial gains over many other exploration
strategies from the literature. With the teacher included, we again find that
the algorithm outperforms baselines. By taking advantage of the teacher,
higher cumulative reward can be achieved than with standard RL alone.
Together, our results point to a promising approach to both RL and
ask-for-help RL