3,026 research outputs found
Accelerating Reinforcement Learning through Implicit Imitation
Imitation can be viewed as a means of enhancing learning in multiagent
environments. It augments an agent's ability to learn useful behaviors by
making intelligent use of the knowledge implicit in behaviors demonstrated by
cooperative teachers or other more experienced agents. We propose and study a
formal model of implicit imitation that can accelerate reinforcement learning
dramatically in certain cases. Roughly, by observing a mentor, a
reinforcement-learning agent can extract information about its own capabilities
in, and the relative value of, unvisited parts of the state space. We study two
specific instantiations of this model, one in which the learning agent and the
mentor have identical abilities, and one designed to deal with agents and
mentors with different action sets. We illustrate the benefits of implicit
imitation by integrating it with prioritized sweeping, and demonstrating
improved performance and convergence through observation of single and multiple
mentors. Though we make some stringent assumptions regarding observability and
possible interactions, we briefly comment on extensions of the model that relax
these restricitions
Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning
We identify two issues with the family of algorithms based on the Adversarial
Imitation Learning framework. The first problem is implicit bias present in the
reward functions used in these algorithms. While these biases might work well
for some environments, they can also lead to sub-optimal behavior in others.
Secondly, even though these algorithms can learn from few expert
demonstrations, they require a prohibitively large number of interactions with
the environment in order to imitate the expert for many real-world
applications. In order to address these issues, we propose a new algorithm
called Discriminator-Actor-Critic that uses off-policy Reinforcement Learning
to reduce policy-environment interaction sample complexity by an average factor
of 10. Furthermore, since our reward function is designed to be unbiased, we
can apply our algorithm to many problems without making any task-specific
adjustments
DropoutDAgger: A Bayesian Approach to Safe Imitation Learning
While imitation learning is becoming common practice in robotics, this
approach often suffers from data mismatch and compounding errors. DAgger is an
iterative algorithm that addresses these issues by continually aggregating
training data from both the expert and novice policies, but does not consider
the impact of safety. We present a probabilistic extension to DAgger, which
uses the distribution over actions provided by the novice policy, for a given
observation. Our method, which we call DropoutDAgger, uses dropout to train the
novice as a Bayesian neural network that provides insight to its confidence.
Using the distribution over the novice's actions, we estimate a probabilistic
measure of safety with respect to the expert action, tuned to balance
exploration and exploitation. The utility of this approach is evaluated on the
MuJoCo HalfCheetah and in a simple driving experiment, demonstrating improved
performance and safety compared to other DAgger variants and classic imitation
learning
Query-Efficient Imitation Learning for End-to-End Autonomous Driving
One way to approach end-to-end autonomous driving is to learn a policy
function that maps from a sensory input, such as an image frame from a
front-facing camera, to a driving action, by imitating an expert driver, or a
reference policy. This can be done by supervised learning, where a policy
function is tuned to minimize the difference between the predicted and
ground-truth actions. A policy function trained in this way however is known to
suffer from unexpected behaviours due to the mismatch between the states
reachable by the reference policy and trained policy functions. More advanced
algorithms for imitation learning, such as DAgger, addresses this issue by
iteratively collecting training examples from both reference and trained
policies. These algorithms often requires a large number of queries to a
reference policy, which is undesirable as the reference policy is often
expensive. In this paper, we propose an extension of the DAgger, called
SafeDAgger, that is query-efficient and more suitable for end-to-end autonomous
driving. We evaluate the proposed SafeDAgger in a car racing simulator and show
that it indeed requires less queries to a reference policy. We observe a
significant speed up in convergence, which we conjecture to be due to the
effect of automated curriculum learning
Multi-Level Discovery of Deep Options
Augmenting an agent's control with useful higher-level behaviors called
options can greatly reduce the sample complexity of reinforcement learning, but
manually designing options is infeasible in high-dimensional and abstract state
spaces. While recent work has proposed several techniques for automated option
discovery, they do not scale to multi-level hierarchies and to expressive
representations such as deep networks. We present Discovery of Deep Options
(DDO), a policy-gradient algorithm that discovers parametrized options from a
set of demonstration trajectories, and can be used recursively to discover
additional levels of the hierarchy. The scalability of our approach to
multi-level hierarchies stems from the decoupling of low-level option discovery
from high-level meta-control policy learning, facilitated by
under-parametrization of the high level. We demonstrate that using the
discovered options to augment the action space of Deep Q-Network agents can
accelerate learning by guiding exploration in tasks where random actions are
unlikely to reach valuable states. We show that DDO is effective in adding
options that accelerate learning in 4 out of 5 Atari RAM environments chosen in
our experiments. We also show that DDO can discover structure in robot-assisted
surgical videos and kinematics that match expert annotation with 72% accuracy
Cross-Domain Perceptual Reward Functions
In reinforcement learning, we often define goals by specifying rewards within
desirable states. One problem with this approach is that we typically need to
redefine the rewards each time the goal changes, which often requires some
understanding of the solution in the agents environment. When humans are
learning to complete tasks, we regularly utilize alternative sources that guide
our understanding of the problem. Such task representations allow one to
specify goals on their own terms, thus providing specifications that can be
appropriately interpreted across various environments. This motivates our own
work, in which we represent goals in environments that are different from the
agents. We introduce Cross-Domain Perceptual Reward (CDPR) functions, learned
rewards that represent the visual similarity between an agents state and a
cross-domain goal image. We report results for learning the CDPRs with a deep
neural network and using them to solve two tasks with deep reinforcement
learning.Comment: A shorter version of this paper was accepted to RLDM
(http://rldm.org/rldm2017/
BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning
Allowing humans to interactively train artificial agents to understand
language instructions is desirable for both practical and scientific reasons,
but given the poor data efficiency of the current learning methods, this goal
may require substantial research efforts. Here, we introduce the BabyAI
research platform to support investigations towards including humans in the
loop for grounded language learning. The BabyAI platform comprises an
extensible suite of 19 levels of increasing difficulty. The levels gradually
lead the agent towards acquiring a combinatorially rich synthetic language
which is a proper subset of English. The platform also provides a heuristic
expert agent for the purpose of simulating a human teacher. We report baseline
results and estimate the amount of human involvement that would be required to
train a neural network-based agent on some of the BabyAI levels. We put forward
strong evidence that current deep learning methods are not yet sufficiently
sample efficient when it comes to learning a language with compositional
properties.Comment: Accepted at ICLR 201
Multi-class Generalized Binary Search for Active Inverse Reinforcement Learning
This paper addresses the problem of learning a task from demonstration. We
adopt the framework of inverse reinforcement learning, where tasks are
represented in the form of a reward function. Our contribution is a novel
active learning algorithm that enables the learning agent to query the expert
for more informative demonstrations, thus leading to more sample-efficient
learning. For this novel algorithm (Generalized Binary Search for Inverse
Reinforcement Learning, or GBS-IRL), we provide a theoretical bound on sample
complexity and illustrate its applicability on several different tasks. To our
knowledge, GBS-IRL is the first active IRL algorithm with provable sample
complexity bounds. We also discuss our method in light of other existing
methods in the literature and its general applicability in multi-class
classification problems. Finally, motivated by recent work on learning from
demonstration in robots, we also discuss how different forms of human feedback
can be integrated in a transparent manner in our learning framework
A Comparison of learning algorithms on the Arcade Learning Environment
Reinforcement learning agents have traditionally been evaluated on small toy
problems. With advances in computing power and the advent of the Arcade
Learning Environment, it is now possible to evaluate algorithms on diverse and
difficult problems within a consistent framework. We discuss some challenges
posed by the arcade learning environment which do not manifest in simpler
environments. We then provide a comparison of model-free, linear learning
algorithms on this challenging problem set
EnsembleDAgger: A Bayesian Approach to Safe Imitation Learning
While imitation learning is often used in robotics, the approach frequently
suffers from data mismatch and compounding errors. DAgger is an iterative
algorithm that addresses these issues by aggregating training data from both
the expert and novice policies, but does not consider the impact of safety. We
present a probabilistic extension to DAgger, which attempts to quantify the
confidence of the novice policy as a proxy for safety. Our method,
EnsembleDAgger, approximates a Gaussian Process using an ensemble of neural
networks. Using the variance as a measure of confidence, we compute a decision
rule that captures how much we doubt the novice, thus determining when it is
safe to allow the novice to act. With this approach, we aim to maximize the
novice's share of actions, while constraining the probability of failure. We
demonstrate improved safety and learning performance compared to other DAgger
variants and classic imitation learning on an inverted pendulum and in the
MuJoCo HalfCheetah environment.Comment: Accepted to the 2019 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2019
- …