166 research outputs found
Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning
An appropriate reward function is of paramount importance in specifying a
task in reinforcement learning (RL). Yet, it is known to be extremely
challenging in practice to design a correct reward function for even simple
tasks. Human-in-the-loop (HiL) RL allows humans to communicate complex goals to
the RL agent by providing various types of feedback. However, despite achieving
great empirical successes, HiL RL usually requires too much feedback from a
human teacher and also suffers from insufficient theoretical understanding. In
this paper, we focus on addressing this issue from a theoretical perspective,
aiming to provide provably feedback-efficient algorithmic frameworks that take
human-in-the-loop to specify rewards of given tasks. We provide an
active-learning-based RL algorithm that first explores the environment without
specifying a reward function and then asks a human teacher for only a few
queries about the rewards of a task at some state-action pairs. After that, the
algorithm guarantees to provide a nearly optimal policy for the task with high
probability. We show that, even with the presence of random noise in the
feedback, the algorithm only takes queries on
the reward function to provide an -optimal policy for any . Here is the horizon of the RL environment, and specifies the
complexity of the function class representing the reward function. In contrast,
standard RL algorithms require to query the reward function for at least
state-action pairs where
depends on the complexity of the environmental transition.Comment: 36th Conference on Neural Information Processing Systems (NeurIPS
2022
Discovering Generalizable Spatial Goal Representations via Graph-based Active Reward Learning
In this work, we consider one-shot imitation learning for object
rearrangement tasks, where an AI agent needs to watch a single expert
demonstration and learn to perform the same task in different environments. To
achieve a strong generalization, the AI agent must infer the spatial goal
specification for the task. However, there can be multiple goal specifications
that fit the given demonstration. To address this, we propose a reward learning
approach, Graph-based Equivalence Mappings (GEM), that can discover spatial
goal representations that are aligned with the intended goal specification,
enabling successful generalization in unseen environments. Specifically, GEM
represents a spatial goal specification by a reward function conditioned on i)
a graph indicating important spatial relationships between objects and ii)
state equivalence mappings for each edge in the graph indicating invariant
properties of the corresponding relationship. GEM combines inverse
reinforcement learning and active reward learning to efficiently improve the
reward function by utilizing the graph structure and domain randomization
enabled by the equivalence mappings. We conducted experiments with simulated
oracles and with human subjects. The results show that GEM can drastically
improve the generalizability of the learned goal representations over strong
baselines.Comment: ICML 2022, the first two authors contributed equally, project page
https://www.tshu.io/GE
Active Reward Learning for Co-Robotic Vision Based Exploration in Bandwidth Limited Environments
We present a novel POMDP problem formulation for a robot that must
autonomously decide where to go to collect new and scientifically relevant
images given a limited ability to communicate with its human operator. From
this formulation we derive constraints and design principles for the
observation model, reward model, and communication strategy of such a robot,
exploring techniques to deal with the very high-dimensional observation space
and scarcity of relevant training data. We introduce a novel active reward
learning strategy based on making queries to help the robot minimize path
"regret" online, and evaluate it for suitability in autonomous visual
exploration through simulations. We demonstrate that, in some bandwidth-limited
environments, this novel regret-based criterion enables the robotic explorer to
collect up to 17% more reward per mission than the next-best criterion.Comment: 7 pages, 4 figures; accepted for presentation in IEEE Int. Conf. on
Robotics and Automation, ICRA '20, Paris, France, June 202
Active Inverse Reward Design
Designers of AI agents often iterate on the reward function in a
trial-and-error process until they get the desired behavior, but this only
guarantees good behavior in the training environment. We propose structuring
this process as a series of queries asking the user to compare between
different reward functions. Thus we can actively select queries for maximum
informativeness about the true reward. In contrast to approaches asking the
designer for optimal behavior, this allows us to gather additional information
by eliciting preferences between suboptimal behaviors. After each query, we
need to update the posterior over the true reward function from observing the
proxy reward function chosen by the designer. The recently proposed Inverse
Reward Design (IRD) enables this. Our approach substantially outperforms IRD in
test environments. In particular, it can query the designer about
interpretable, linear reward functions and still infer non-linear ones
Deep reinforcement learning from human preferences
For sophisticated reinforcement learning (RL) systems to interact usefully
with real-world environments, we need to communicate complex goals to these
systems. In this work, we explore goals defined in terms of (non-expert) human
preferences between pairs of trajectory segments. We show that this approach
can effectively solve complex RL tasks without access to the reward function,
including Atari games and simulated robot locomotion, while providing feedback
on less than one percent of our agent's interactions with the environment. This
reduces the cost of human oversight far enough that it can be practically
applied to state-of-the-art RL systems. To demonstrate the flexibility of our
approach, we show that we can successfully train complex novel behaviors with
about an hour of human time. These behaviors and environments are considerably
more complex than any that have been previously learned from human feedback
Feedback-efficient Active Preference Learning for Socially Aware Robot Navigation
Socially aware robot navigation, where a robot is required to optimize its
trajectory to maintain comfortable and compliant spatial interactions with
humans in addition to reaching its goal without collisions, is a fundamental
yet challenging task in the context of human-robot interaction. While existing
learning-based methods have achieved better performance than the preceding
model-based ones, they still have drawbacks: reinforcement learning depends on
the handcrafted reward that is unlikely to effectively quantify broad social
compliance, and can lead to reward exploitation problems; meanwhile, inverse
reinforcement learning suffers from the need for expensive human
demonstrations. In this paper, we propose a feedback-efficient active
preference learning approach, FAPL, that distills human comfort and expectation
into a reward model to guide the robot agent to explore latent aspects of
social compliance. We further introduce hybrid experience learning to improve
the efficiency of human feedback and samples, and evaluate benefits of robot
behaviors learned from FAPL through extensive simulation experiments and a user
study (N=10) employing a physical robot to navigate with human subjects in
real-world scenarios. Source code and experiment videos for this work are
available at:https://sites.google.com/view/san-fapl.Comment: To appear in IROS 202
Explore, Exploit or Listen: Combining Human Feedback and Policy Model to Speed up Deep Reinforcement Learning in 3D Worlds
We describe a method to use discrete human feedback to enhance the
performance of deep learning agents in virtual three-dimensional environments
by extending deep-reinforcement learning to model the confidence and
consistency of human feedback. This enables deep reinforcement learning
algorithms to determine the most appropriate time to listen to the human
feedback, exploit the current policy model, or explore the agent's environment.
Managing the trade-off between these three strategies allows DRL agents to be
robust to inconsistent or intermittent human feedback. Through experimentation
using a synthetic oracle, we show that our technique improves the training
speed and overall performance of deep reinforcement learning in navigating
three-dimensional environments using Minecraft. We further show that our
technique is robust to highly innacurate human feedback and can also operate
when no human feedback is given
- …