62,859 research outputs found
Deep Q-learning from Demonstrations
Deep reinforcement learning (RL) has achieved several high profile successes
in difficult decision-making problems. However, these algorithms typically
require a huge amount of data before they reach reasonable performance. In
fact, their performance during learning can be extremely poor. This may be
acceptable for a simulator, but it severely limits the applicability of deep RL
to many real-world tasks, where the agent must learn in the real environment.
In this paper we study a setting where the agent may access data from previous
control of the system. We present an algorithm, Deep Q-learning from
Demonstrations (DQfD), that leverages small sets of demonstration data to
massively accelerate the learning process even from relatively small amounts of
demonstration data and is able to automatically assess the necessary ratio of
demonstration data while learning thanks to a prioritized replay mechanism.
DQfD works by combining temporal difference updates with supervised
classification of the demonstrator's actions. We show that DQfD has better
initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN)
as it starts with better scores on the first million steps on 41 of 42 games
and on average it takes PDD DQN 83 million steps to catch up to DQfD's
performance. DQfD learns to out-perform the best demonstration given in 14 of
42 games. In addition, DQfD leverages human demonstrations to achieve
state-of-the-art results for 11 games. Finally, we show that DQfD performs
better than three related algorithms for incorporating demonstration data into
DQN.Comment: Published at AAAI 2018. Previously on arxiv as "Learning from
Demonstrations for Real World Reinforcement Learning
Inverse Reinforce Learning with Nonparametric Behavior Clustering
Inverse Reinforcement Learning (IRL) is the task of learning a single reward
function given a Markov Decision Process (MDP) without defining the reward
function, and a set of demonstrations generated by humans/experts. However, in
practice, it may be unreasonable to assume that human behaviors can be
explained by one reward function since they may be inherently inconsistent.
Also, demonstrations may be collected from various users and aggregated to
infer and predict user's behaviors. In this paper, we introduce the
Non-parametric Behavior Clustering IRL algorithm to simultaneously cluster
demonstrations and learn multiple reward functions from demonstrations that may
be generated from more than one behaviors. Our method is iterative: It
alternates between clustering demonstrations into different behavior clusters
and inverse learning the reward functions until convergence. It is built upon
the Expectation-Maximization formulation and non-parametric clustering in the
IRL setting. Further, to improve the computation efficiency, we remove the need
of completely solving multiple IRL problems for multiple clusters during the
iteration steps and introduce a resampling technique to avoid generating too
many unlikely clusters. We demonstrate the convergence and efficiency of the
proposed method through learning multiple driver behaviors from demonstrations
generated from a grid-world environment and continuous trajectories collected
from autonomous robot cars using the Gazebo robot simulator.Comment: 9 pages, 4 figure
Learning human behaviors from motion capture by adversarial imitation
Rapid progress in deep reinforcement learning has made it increasingly
feasible to train controllers for high-dimensional humanoid bodies. However,
methods that use pure reinforcement learning with simple reward functions tend
to produce non-humanlike and overly stereotyped movement behaviors. In this
work, we extend generative adversarial imitation learning to enable training of
generic neural network policies to produce humanlike movement patterns from
limited demonstrations consisting only of partially observed state features,
without access to actions, even when the demonstrations come from a body with
different and unknown physical parameters. We leverage this approach to build
sub-skill policies from motion capture data and show that they can be reused to
solve tasks when controlled by a higher level controller
Bellman Gradient Iteration for Inverse Reinforcement Learning
This paper develops an inverse reinforcement learning algorithm aimed at
recovering a reward function from the observed actions of an agent. We
introduce a strategy to flexibly handle different types of actions with two
approximations of the Bellman Optimality Equation, and a Bellman Gradient
Iteration method to compute the gradient of the Q-value with respect to the
reward function. These methods allow us to build a differentiable relation
between the Q-value and the reward function and learn an approximately optimal
reward function with gradient methods. We test the proposed method in two
simulated environments by evaluating the accuracy of different approximations
and comparing the proposed method with existing solutions. The results show
that even with a linear reward function, the proposed method has a comparable
accuracy with the state-of-the-art method adopting a non-linear reward
function, and the proposed method is more flexible because it is defined on
observed actions instead of trajectories
Integrating kinematics and environment context into deep inverse reinforcement learning for predicting off-road vehicle trajectories
Predicting the motion of a mobile agent from a third-person perspective is an
important component for many robotics applications, such as autonomous
navigation and tracking. With accurate motion prediction of other agents,
robots can plan for more intelligent behaviors to achieve specified objectives,
instead of acting in a purely reactive way. Previous work addresses motion
prediction by either only filtering kinematics, or using hand-designed and
learned representations of the environment. Instead of separating kinematic and
environmental context, we propose a novel approach to integrate both into an
inverse reinforcement learning (IRL) framework for trajectory prediction.
Instead of exponentially increasing the state-space complexity with kinematics,
we propose a two-stage neural network architecture that considers motion and
environment together to recover the reward function. The first-stage network
learns feature representations of the environment using low-level LiDAR
statistics and the second-stage network combines those learned features with
kinematics data. We collected over 30 km of off-road driving data and validated
experimentally that our method can effectively extract useful environmental and
kinematic features. We generate accurate predictions of the distribution of
future trajectories of the vehicle, encoding complex behaviors such as
multi-modal distributions at road intersections, and even show different
predictions at the same intersection depending on the vehicle's speed.Comment: CoRL 201
Solving Markov Decision Processes with Reachability Characterization from Mean First Passage Times
A new mechanism for efficiently solving the Markov decision processes (MDPs)
is proposed in this paper. We introduce the notion of reachability landscape
where we use the Mean First Passage Time (MFPT) as a means to characterize the
reachability of every state in the state space. We show that such reachability
characterization very well assesses the importance of states and thus provides
a natural basis for effectively prioritizing states and approximating policies.
Built on such a novel observation, we design two new algorithms -- Mean First
Passage Time based Value Iteration (MFPT-VI) and Mean First Passage Time based
Policy Iteration (MFPT-PI) -- that have been modified from the state-of-the-art
solution methods. To validate our design, we have performed numerical
evaluations in robotic decision-making scenarios, by comparing the proposed new
methods with corresponding classic baseline mechanisms. The evaluation results
showed that MFPT-VI and MFPT-PI have outperformed the state-of-the-art
solutions in terms of both practical runtime and number of iterations. Aside
from the advantage of fast convergence, this new solution method is intuitively
easy to understand and practically simple to implement.Comment: The paper was published in 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS
Learning Safe Policies with Expert Guidance
We propose a framework for ensuring safe behavior of a reinforcement learning
agent when the reward function may be difficult to specify. In order to do
this, we rely on the existence of demonstrations from expert policies, and we
provide a theoretical framework for the agent to optimize in the space of
rewards consistent with its existing knowledge. We propose two methods to solve
the resulting optimization: an exact ellipsoid-based method and a method in the
spirit of the "follow-the-perturbed-leader" algorithm. Our experiments
demonstrate the behavior of our algorithm in both discrete and continuous
problems. The trained agent safely avoids states with potential negative
effects while imitating the behavior of the expert in the other states.Comment: Appears in NeurIPS 201
ABC-LMPC: Safe Sample-Based Learning MPC for Stochastic Nonlinear Dynamical Systems with Adjustable Boundary Conditions
Sample-based learning model predictive control (LMPC) strategies have
recently attracted attention due to their desirable theoretical properties and
their good empirical performance on robotic tasks. However, prior analysis of
LMPC controllers for stochastic systems has mainly focused on linear systems in
the iterative learning control setting. We present a novel LMPC algorithm,
Adjustable Boundary Condition LMPC (ABC-LMPC), which enables rapid adaptation
to novel start and goal configurations and theoretically show that the
resulting controller guarantees iterative improvement in expectation for
stochastic nonlinear systems. We present results with a practical instantiation
of this algorithm and experimentally demonstrate that the resulting controller
adapts to a variety of initial and terminal conditions on 3 stochastic
continuous control tasks.Comment: Workshop on the Algorithmic Foundations of Robotics (WAFR) 2020.
First two authors contributed equall
Assessing the Usability of a Novel System for Programming Education
The authors present the results of a simple usability test performed on
line_explorer, an innovative tool aimed at letting students explore
programming. The system offers an interactive environment where students can
learn, review, and practice programming independently or through step-by-step
instruction. Students in Information Technology, Computer Science, and
Information Systems were surveyed. The findings show that students have
interest in this tool, whereas some groups find this tool more interesting and
useful. The findings will help refine the user interface for the next phase of
testing which include changes for simplicity, usability and expanded topic
content. Overall the survey on line_explorer in its current design phase seem
more useful for IT and CS majors, however significant changes are still needed.Comment: Presented at Systems, Programming, Languages and Applications:
Software for Humanity - Education (SPLASH-E) 201
Learning Dexterous Manipulation for a Soft Robotic Hand from Human Demonstration
Dexterous multi-fingered hands can accomplish fine manipulation behaviors
that are infeasible with simple robotic grippers. However, sophisticated
multi-fingered hands are often expensive and fragile. Low-cost soft hands offer
an appealing alternative to more conventional devices, but present considerable
challenges in sensing and actuation, making them difficult to apply to more
complex manipulation tasks. In this paper, we describe an approach to learning
from demonstration that can be used to train soft robotic hands to perform
dexterous manipulation tasks. Our method uses object-centric demonstrations,
where a human demonstrates the desired motion of manipulated objects with their
own hands, and the robot autonomously learns to imitate these demonstrations
using reinforcement learning. We propose a novel algorithm that allows us to
blend and select a subset of the most feasible demonstrations to learn to
imitate on the hardware, which we use with an extension of the guided policy
search framework to use multiple demonstrations to learn generalizable neural
network policies. We demonstrate our approach on the RBO Hand 2, with learned
motor skills for turning a valve, manipulating an abacus, and grasping.Comment: Accepted at International Conference on Intelligent Robots and
Systems(IROS) 2016. Pdf file updated for stylistic consistenc
- …