6,974 research outputs found
Recommended from our members
New learning modes for sequential decision making
This thesis considers the problem in which a teacher is interested in teaching action policies to computer agents for sequential decision making. The vast majority of policy
learning algorithms o er teachers little flexibility in how policies are taught. In particular,
one of two learning modes is typically considered: 1) Imitation learning, where
the teacher demonstrates explicit action sequences to the learner, and 2) Reinforcement
learning, where the teacher designs a reward function for the learner to autonomously
optimize via practice. This is in sharp contrast to how humans teach other humans,
where many other learning modes are commonly used besides imitation and practice.
This thesis presents novel learning modes for teaching policies to computer agents, with
the eventual aim of allowing human teachers to teach computer agents more naturally
and efficiently.
Our first learning mode is inspired by how humans learn: through rounds of practice
followed by feedback from a teacher. We adopt this mode to create computer agents that
learn from several rounds of autonomous practice followed by critique feedback from a
teacher. Our results show that this mode of policy learning is more e effective than pure
reinforcement learning, though important usability issues arise when used with human teachers.
Next we consider a learning mode where the computer agent can actively ask questions
to the teacher, which we call active imitation learning. We provide algorithms
for active imitation learning that are proven to require strictly less interaction with the
teacher than passive imitation learning. We also show that empirically active imitation learning algorithms are much more efficient than traditional passive imitation learning in terms of amount of interaction with the teacher.
Lastly, we introduce a novel imitation learning mode that allows a teacher to specify
shaping rewards to a computer agent in addition to demonstrations. Shaping rewards are
additional rewards supplied to an agent for accelerating policy learning via reinforcement
learning. We provide an algorithm to incorporate shaping rewards in imitation learning
and show that it learns from fewer demonstrations than pure imitation learning.
We wrap up by presenting a prototype User-Initiated Learning (UIL) system that
allows an end user to demonstrate procedures containing optional steps and instruct the
system to autonomously learn to predict when the optional steps should be executed, and
remind the user if they forget. Our prototype supports user-initiated demonstration and
learning via a natural interface, and has a built-in automated machine learning engine
to automatically train and install a predictor for the requested prediction problem
Adversarial Imitation Learning from Incomplete Demonstrations
Imitation learning targets deriving a mapping from states to actions, a.k.a.
policy, from expert demonstrations. Existing methods for imitation learning
typically require any actions in the demonstrations to be fully available,
which is hard to ensure in real applications. Though algorithms for learning
with unobservable actions have been proposed, they focus solely on state
information and overlook the fact that the action sequence could still be
partially available and provide useful information for policy deriving. In this
paper, we propose a novel algorithm called Action-Guided Adversarial Imitation
Learning (AGAIL) that learns a policy from demonstrations with incomplete
action sequences, i.e., incomplete demonstrations. The core idea of AGAIL is to
separate demonstrations into state and action trajectories, and train a policy
with state trajectories while using actions as auxiliary information to guide
the training whenever applicable. Built upon the Generative Adversarial
Imitation Learning, AGAIL has three components: a generator, a discriminator,
and a guide. The generator learns a policy with rewards provided by the
discriminator, which tries to distinguish state distributions between
demonstrations and samples generated by the policy. The guide provides
additional rewards to the generator when demonstrated actions for specific
states are available. We compare AGAIL to other methods on benchmark tasks and
show that AGAIL consistently delivers comparable performance to the
state-of-the-art methods even when the action sequence in demonstrations is
only partially available.Comment: Accepted to International Joint Conference on Artificial Intelligence
(IJCAI-19
- …