13 research outputs found
Efficient Deep Reinforcement Learning via Adaptive Policy Transfer
Transfer Learning (TL) has shown great potential to accelerate Reinforcement
Learning (RL) by leveraging prior knowledge from past learned policies of
relevant tasks. Existing transfer approaches either explicitly computes the
similarity between tasks or select appropriate source policies to provide
guided explorations for the target task. However, how to directly optimize the
target policy by alternatively utilizing knowledge from appropriate source
policies without explicitly measuring the similarity is currently missing. In
this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL
by taking advantage of this idea. Our framework learns when and which source
policy is the best to reuse for the target policy and when to terminate it by
modeling multi-policy transfer as the option learning problem. PTF can be
easily combined with existing deep RL approaches. Experimental results show it
significantly accelerates the learning process and surpasses state-of-the-art
policy transfer methods in terms of learning efficiency and final performance
in both discrete and continuous action spaces.Comment: Accepted by IJCAI'202
IOB: Integrating Optimization Transfer and Behavior Transfer for Multi-Policy Reuse
Humans have the ability to reuse previously learned policies to solve new
tasks quickly, and reinforcement learning (RL) agents can do the same by
transferring knowledge from source policies to a related target task. Transfer
RL methods can reshape the policy optimization objective (optimization
transfer) or influence the behavior policy (behavior transfer) using source
policies. However, selecting the appropriate source policy with limited samples
to guide target policy learning has been a challenge. Previous methods
introduce additional components, such as hierarchical policies or estimations
of source policies' value functions, which can lead to non-stationary policy
optimization or heavy sampling costs, diminishing transfer effectiveness. To
address this challenge, we propose a novel transfer RL method that selects the
source policy without training extra components. Our method utilizes the Q
function in the actor-critic framework to guide policy selection, choosing the
source policy with the largest one-step improvement over the current target
policy. We integrate optimization transfer and behavior transfer (IOB) by
regularizing the learned policy to mimic the guidance policy and combining them
as the behavior policy. This integration significantly enhances transfer
effectiveness, surpasses state-of-the-art transfer RL baselines in benchmark
tasks, and improves final performance and knowledge transferability in
continual learning scenarios. Additionally, we show that our optimization
transfer technique is guaranteed to improve target policy learning.Comment: 26 pages, 9 figure
Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning
Bayesian policy reuse (BPR) is a general policy transfer framework for
selecting a source policy from an offline library by inferring the task belief
based on some observation signals and a trained observation model. In this
paper, we propose an improved BPR method to achieve more efficient policy
transfer in deep reinforcement learning (DRL). First, most BPR algorithms use
the episodic return as the observation signal that contains limited information
and cannot be obtained until the end of an episode. Instead, we employ the
state transition sample, which is informative and instantaneous, as the
observation signal for faster and more accurate task inference. Second, BPR
algorithms usually require numerous samples to estimate the probability
distribution of the tabular-based observation model, which may be expensive and
even infeasible to learn and maintain, especially when using the state
transition sample as the signal. Hence, we propose a scalable observation model
based on fitting state transition functions of source tasks from only a small
number of samples, which can generalize to any signals observed in the target
task. Moreover, we extend the offline-mode BPR to the continual learning
setting by expanding the scalable observation model in a plug-and-play fashion,
which can avoid negative transfer when faced with new unknown tasks.
Experimental results show that our method can consistently facilitate faster
and more efficient policy transfer.Comment: 16 pages, 6 figures, under revie
MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics
Transfer reinforcement learning (RL) aims at improving the learning
efficiency of an agent by exploiting knowledge from other source agents trained
on relevant tasks. However, it remains challenging to transfer knowledge
between different environmental dynamics without having access to the source
environments. In this work, we explore a new challenge in transfer RL, where
only a set of source policies collected under diverse unknown dynamics is
available for learning a target task efficiently. To address this problem, the
proposed approach, MULTI-source POLicy AggRegation (MULTIPOLAR), comprises two
key techniques. We learn to aggregate the actions provided by the source
policies adaptively to maximize the target task performance. Meanwhile, we
learn an auxiliary network that predicts residuals around the aggregated
actions, which ensures the target policy's expressiveness even when some of the
source policies perform poorly. We demonstrated the effectiveness of MULTIPOLAR
through an extensive experimental evaluation across six simulated environments
ranging from classic control problems to challenging robotics simulations,
under both continuous and discrete action spaces. The demo videos and code are
available on the project webpage: https://omron-sinicx.github.io/multipolar/.Comment: This work was presented at IJCAI 2020. Copyright (c) 2020
International Joint Conferences on Artificial Intelligence, All rights
reserve
EASpace: Enhanced Action Space for Policy Transfer
Formulating expert policies as macro actions promises to alleviate the
long-horizon issue via structured exploration and efficient credit assignment.
However, traditional option-based multi-policy transfer methods suffer from
inefficient exploration of macro action's length and insufficient exploitation
of useful long-duration macro actions. In this paper, a novel algorithm named
EASpace (Enhanced Action Space) is proposed, which formulates macro actions in
an alternative form to accelerate the learning process using multiple available
sub-optimal expert policies. Specifically, EASpace formulates each expert
policy into multiple macro actions with different execution {times}. All the
macro actions are then integrated into the primitive action space directly. An
intrinsic reward, which is proportional to the execution time of macro actions,
is introduced to encourage the exploitation of useful macro actions. The
corresponding learning rule that is similar to Intra-option Q-learning is
employed to improve the data efficiency. Theoretical analysis is presented to
show the convergence of the proposed learning rule. The efficiency of EASpace
is illustrated by a grid-based game and a multi-agent pursuit problem. The
proposed algorithm is also implemented in physical systems to validate its
effectiveness.Comment: 15 Page