55 research outputs found
Cycle-of-Learning for Autonomous Systems from Human Interaction
We discuss different types of human-robot interaction paradigms in the
context of training end-to-end reinforcement learning algorithms. We provide a
taxonomy to categorize the types of human interaction and present our
Cycle-of-Learning framework for autonomous systems that combines different
human-interaction modalities with reinforcement learning. Two key concepts
provided by our Cycle-of-Learning framework are how it handles the integration
of the different human-interaction modalities (demonstration, intervention, and
evaluation) and how to define the switching criteria between them.Comment: Presented at AI-HRI AAAI-FSS, 2018 (arXiv:1809.06606
Multi-Preference Actor Critic
Policy gradient algorithms typically combine discounted future rewards with
an estimated value function, to compute the direction and magnitude of
parameter updates. However, for most Reinforcement Learning tasks, humans can
provide additional insight to constrain the policy learning. We introduce a
general method to incorporate multiple different feedback channels into a
single policy gradient loss. In our formulation, the Multi-Preference Actor
Critic (M-PAC), these different types of feedback are implemented as
constraints on the policy. We use a Lagrangian relaxation to satisfy these
constraints using gradient descent while learning a policy that maximizes
rewards. Experiments in Atari and Pendulum verify that constraints are being
respected and can accelerate the learning process.Comment: NeurIPS Workshop on Deep RL, 201
Directed Policy Gradient for Safe Reinforcement Learning with Human Advice
Many currently deployed Reinforcement Learning agents work in an environment
shared with humans, be them co-workers, users or clients. It is desirable that
these agents adjust to people's preferences, learn faster thanks to their help,
and act safely around them. We argue that most current approaches that learn
from human feedback are unsafe: rewarding or punishing the agent a-posteriori
cannot immediately prevent it from wrong-doing. In this paper, we extend Policy
Gradient to make it robust to external directives, that would otherwise break
the fundamentally on-policy nature of Policy Gradient. Our technique, Directed
Policy Gradient (DPG), allows a teacher or backup policy to override the agent
before it acts undesirably, while allowing the agent to leverage human advice
or directives to learn faster. Our experiments demonstrate that DPG makes the
agent learn much faster than reward-based approaches, while requiring an order
of magnitude less advice.Comment: Accepted at the European Workshop on Reinforcement Learning 2018
(EWRL14
DQN-TAMER: Human-in-the-Loop Reinforcement Learning with Intractable Feedback
Exploration has been one of the greatest challenges in reinforcement learning
(RL), which is a large obstacle in the application of RL to robotics. Even with
state-of-the-art RL algorithms, building a well-learned agent often requires
too many trials, mainly due to the difficulty of matching its actions with
rewards in the distant future. A remedy for this is to train an agent with
real-time feedback from a human observer who immediately gives rewards for some
actions. This study tackles a series of challenges for introducing such a
human-in-the-loop RL scheme. The first contribution of this work is our
experiments with a precisely modeled human observer: binary, delay,
stochasticity, unsustainability, and natural reaction. We also propose an RL
method called DQN-TAMER, which efficiently uses both human feedback and distant
rewards. We find that DQN-TAMER agents outperform their baselines in Maze and
Taxi simulated environments. Furthermore, we demonstrate a real-world
human-in-the-loop RL application where a camera automatically recognizes a
user's facial expressions as feedback to the agent while the agent explores a
maze
Extending Policy from One-Shot Learning through Coaching
Humans generally teach their fellow collaborators to perform tasks through a
small number of demonstrations. The learnt task is corrected or extended to
meet specific task goals by means of coaching. Adopting a similar framework for
teaching robots through demonstrations and coaching makes teaching tasks highly
intuitive. Unlike traditional Learning from Demonstration (LfD) approaches
which require multiple demonstrations, we present a one-shot learning from
demonstration approach to learn tasks. The learnt task is corrected and
generalized using two layers of evaluation/modification. First, the robot
self-evaluates its performance and corrects the performance to be closer to the
demonstrated task. Then, coaching is used as a means to extend the policy
learnt to be adaptable to varying task goals. Both the self-evaluation and
coaching are implemented using reinforcement learning (RL) methods. Coaching is
achieved through human feedback on desired goal and action modification to
generalize to specified task goals. The proposed approach is evaluated with a
scooping task, by presenting a single demonstration. The self-evaluation
framework aims to reduce the resistance to scooping in the media. To reduce the
search space for RL, we bootstrap the search using least resistance path
obtained using resistive force theory. Coaching is used to generalize the
learnt task policy to transfer the desired quantity of material. Thus, the
proposed method provides a framework for learning tasks from one demonstration
and generalizing it using human feedback through coaching
Risk-Aware Active Inverse Reinforcement Learning
Active learning from demonstration allows a robot to query a human for
specific types of input to achieve efficient learning. Existing work has
explored a variety of active query strategies; however, to our knowledge, none
of these strategies directly minimize the performance risk of the policy the
robot is learning. Utilizing recent advances in performance bounds for inverse
reinforcement learning, we propose a risk-aware active inverse reinforcement
learning algorithm that focuses active queries on areas of the state space with
the potential for large generalization error. We show that risk-aware active
learning outperforms standard active IRL approaches on gridworld, simulated
driving, and table setting tasks, while also providing a performance-based
stopping criterion that allows a robot to know when it has received enough
demonstrations to safely perform a task.Comment: In proceedings of the 2nd Conference on Robot Learning (CoRL) 201
Robot Learning via Human Adversarial Games
Much work in robotics has focused on "human-in-the-loop" learning techniques
that improve the efficiency of the learning process. However, these algorithms
have made the strong assumption of a cooperating human supervisor that assists
the robot. In reality, human observers tend to also act in an adversarial
manner towards deployed robotic systems. We show that this can in fact improve
the robustness of the learned models by proposing a physical framework that
leverages perturbations applied by a human adversary, guiding the robot towards
more robust models. In a manipulation task, we show that grasping success
improves significantly when the robot trains with a human adversary as compared
to training in a self-supervised manner
Improving Interactive Reinforcement Agent Planning with Human Demonstration
TAMER has proven to be a powerful interactive reinforcement learning method
for allowing ordinary people to teach and personalize autonomous agents'
behavior by providing evaluative feedback. However, a TAMER agent planning with
UCT---a Monte Carlo Tree Search strategy, can only update states along its path
and might induce high learning cost especially for a physical robot. In this
paper, we propose to drive the agent's exploration along the optimal path and
reduce the learning cost by initializing the agent's reward function via
inverse reinforcement learning from demonstration. We test our proposed method
in the RL benchmark domain---Grid World---with different discounts on human
reward. Our results show that learning from demonstration can allow a TAMER
agent to learn a roughly optimal policy up to the deepest search and encourage
the agent to explore along the optimal path. In addition, we find that learning
from demonstration can improve the learning efficiency by reducing total
feedback, the number of incorrect actions and increasing the ratio of correct
actions to obtain an optimal policy, allowing a TAMER agent to converge faster
Actor-Critic Reinforcement Learning with Simultaneous Human Control and Feedback
This paper contributes a first study into how different human users deliver
simultaneous control and feedback signals during human-robot interaction. As
part of this work, we formalize and present a general interactive learning
framework for online cooperation between humans and reinforcement learning
agents. In many human-machine interaction settings, there is a growing gap
between the degrees-of-freedom of complex semi-autonomous systems and the
number of human control channels. Simple human control and feedback mechanisms
are required to close this gap and allow for better collaboration between
humans and machines on complex tasks. To better inform the design of concurrent
control and feedback interfaces, we present experimental results from a
human-robot collaborative domain wherein the human must simultaneously deliver
both control and feedback signals to interactively train an actor-critic
reinforcement learning robot. We compare three experimental conditions: 1)
human delivered control signals, 2) reward-shaping feedback signals, and 3)
simultaneous control and feedback. Our results suggest that subjects provide
less feedback when simultaneously delivering feedback and control signals and
that control signal quality is not significantly diminished. Our data suggest
that subjects may also modify when and how they provide feedback. Through
algorithmic development and tuning informed by this study, we expect
semi-autonomous actions of robotic agents can be better shaped by human
feedback, allowing for seamless collaboration and improved performance in
difficult interactive domains.Comment: 10 pages, 2 pages of references, 8 figures. Under review for the 34th
International Conference on Machine Learning, Sydney, Australia, 2017.
Copyright 2017 by the author
Learning Shaping Strategies in Human-in-the-loop Interactive Reinforcement Learning
Providing reinforcement learning agents with informationally rich human
knowledge can dramatically improve various aspects of learning. Prior work has
developed different kinds of shaping methods that enable agents to learn
efficiently in complex environments. All these methods, however, tailor human
guidance to agents in specialized shaping procedures, thus embodying various
characteristics and advantages in different domains. In this paper, we
investigate the interplay between different shaping methods for more robust
learning performance. We propose an adaptive shaping algorithm which is capable
of learning the most suitable shaping method in an on-line manner. Results in
two classic domains verify its effectiveness from both simulated and real human
studies, shedding some light on the role and impact of human factors in
human-robot collaborative learning
- …