253 research outputs found
Successor Feature Sets: Generalizing Successor Representations Across Policies
Successor-style representations have many advantages for reinforcement
learning: for example, they can help an agent generalize from past experience
to new goals, and they have been proposed as explanations of behavioral and
neural data from human and animal learners. They also form a natural bridge
between model-based and model-free RL methods: like the former they make
predictions about future experiences, and like the latter they allow efficient
prediction of total discounted rewards. However, successor-style
representations are not optimized to generalize across policies: typically, we
maintain a limited-length list of policies, and share information among them by
representation learning or GPI. Successor-style representations also typically
make no provision for gathering information or reasoning about latent
variables. To address these limitations, we bring together ideas from
predictive state representations, belief space value iteration, successor
features, and convex analysis: we develop a new, general successor-style
representation, together with a Bellman equation that connects multiple sources
of information within this representation, including different latent states,
policies, and reward functions. The new representation is highly expressive:
for example, it lets us efficiently read off an optimal policy for a new reward
function, or a policy that imitates a new demonstration. For this paper, we
focus on exact computation of the new representation in small, known
environments, since even this restricted setting offers plenty of interesting
questions. Our implementation does not scale to large, unknown environments --
nor would we expect it to, since it generalizes POMDP value iteration, which is
difficult to scale. However, we believe that future work will allow us to
extend our ideas to approximate reasoning in large, unknown environments
Efficient Model Learning for Human-Robot Collaborative Tasks
We present a framework for learning human user models from joint-action
demonstrations that enables the robot to compute a robust policy for a
collaborative task with a human. The learning takes place completely
automatically, without any human intervention. First, we describe the
clustering of demonstrated action sequences into different human types using an
unsupervised learning algorithm. These demonstrated sequences are also used by
the robot to learn a reward function that is representative for each type,
through the employment of an inverse reinforcement learning algorithm. The
learned model is then used as part of a Mixed Observability Markov Decision
Process formulation, wherein the human type is a partially observable variable.
With this framework, we can infer, either offline or online, the human type of
a new user that was not included in the training set, and can compute a policy
for the robot that will be aligned to the preference of this new user and will
be robust to deviations of the human actions from prior demonstrations. Finally
we validate the approach using data collected in human subject experiments, and
conduct proof-of-concept demonstrations in which a person performs a
collaborative task with a small industrial robot
Offline RL with Observation Histories: Analyzing and Improving Sample Complexity
Offline reinforcement learning (RL) can in principle synthesize more optimal
behavior from a dataset consisting only of suboptimal trials. One way that this
can happen is by "stitching" together the best parts of otherwise suboptimal
trajectories that overlap on similar states, to create new behaviors where each
individual state is in-distribution, but the overall returns are higher.
However, in many interesting and complex applications, such as autonomous
navigation and dialogue systems, the state is partially observed. Even worse,
the state representation is unknown or not easy to define. In such cases,
policies and value functions are often conditioned on observation histories
instead of states. In these cases, it is not clear if the same kind of
"stitching" is feasible at the level of observation histories, since two
different trajectories would always have different histories, and thus "similar
states" that might lead to effective stitching cannot be leveraged.
Theoretically, we show that standard offline RL algorithms conditioned on
observation histories suffer from poor sample complexity, in accordance with
the above intuition. We then identify sufficient conditions under which offline
RL can still be efficient -- intuitively, it needs to learn a compact
representation of history comprising only features relevant for action
selection. We introduce a bisimulation loss that captures the extent to which
this happens, and propose that offline RL can explicitly optimize this loss to
aid worst-case sample complexity. Empirically, we show that across a variety of
tasks either our proposed loss improves performance, or the value of this loss
is already minimized as a consequence of standard offline RL, indicating that
it correlates well with good performance.Comment: 21 pages, 4 figure
- …