49,469 research outputs found
Expert-Free Online Transfer Learning in Multi-Agent Reinforcement Learning
Transfer learning in Reinforcement Learning (RL) has been widely studied to
overcome training issues of Deep-RL, i.e., exploration cost, data availability
and convergence time, by introducing a way to enhance training phase with
external knowledge. Generally, knowledge is transferred from expert-agents to
novices. While this fixes the issue for a novice agent, a good understanding of
the task on expert agent is required for such transfer to be effective. As an
alternative, in this paper we propose Expert-Free Online Transfer Learning
(EF-OnTL), an algorithm that enables expert-free real-time dynamic transfer
learning in multi-agent system. No dedicated expert exists, and transfer source
agent and knowledge to be transferred are dynamically selected at each transfer
step based on agents' performance and uncertainty. To improve uncertainty
estimation, we also propose State Action Reward Next-State Random Network
Distillation (sars-RND), an extension of RND that estimates uncertainty from RL
agent-environment interaction. We demonstrate EF-OnTL effectiveness against a
no-transfer scenario and advice-based baselines, with and without expert
agents, in three benchmark tasks: Cart-Pole, a grid-based Multi-Team
Predator-Prey (mt-pp) and Half Field Offense (HFO). Our results show that
EF-OnTL achieve overall comparable performance when compared against
advice-based baselines while not requiring any external input nor threshold
tuning. EF-OnTL outperforms no-transfer with an improvement related to the
complexity of the task addressed
Human-Machine Collaborative Optimization via Apprenticeship Scheduling
Coordinating agents to complete a set of tasks with intercoupled temporal and
resource constraints is computationally challenging, yet human domain experts
can solve these difficult scheduling problems using paradigms learned through
years of apprenticeship. A process for manually codifying this domain knowledge
within a computational framework is necessary to scale beyond the
``single-expert, single-trainee" apprenticeship model. However, human domain
experts often have difficulty describing their decision-making processes,
causing the codification of this knowledge to become laborious. We propose a
new approach for capturing domain-expert heuristics through a pairwise ranking
formulation. Our approach is model-free and does not require enumerating or
iterating through a large state space. We empirically demonstrate that this
approach accurately learns multifaceted heuristics on a synthetic data set
incorporating job-shop scheduling and vehicle routing problems, as well as on
two real-world data sets consisting of demonstrations of experts solving a
weapon-to-target assignment problem and a hospital resource allocation problem.
We also demonstrate that policies learned from human scheduling demonstration
via apprenticeship learning can substantially improve the efficiency of a
branch-and-bound search for an optimal schedule. We employ this human-machine
collaborative optimization technique on a variant of the weapon-to-target
assignment problem. We demonstrate that this technique generates solutions
substantially superior to those produced by human domain experts at a rate up
to 9.5 times faster than an optimization approach and can be applied to
optimally solve problems twice as complex as those solved by a human
demonstrator.Comment: Portions of this paper were published in the Proceedings of the
International Joint Conference on Artificial Intelligence (IJCAI) in 2016 and
in the Proceedings of Robotics: Science and Systems (RSS) in 2016. The paper
consists of 50 pages with 11 figures and 4 table
Regret Bounds for Reinforcement Learning with Policy Advice
In some reinforcement learning problems an agent may be provided with a set
of input policies, perhaps learned from prior experience or provided by
advisors. We present a reinforcement learning with policy advice (RLPA)
algorithm which leverages this input set and learns to use the best policy in
the set for the reinforcement learning task at hand. We prove that RLPA has a
sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and
that both this regret and its computational complexity are independent of the
size of the state and action space. Our empirical simulations support our
theoretical analysis. This suggests RLPA may offer significant advantages in
large domains where some prior good policies are provided
- …