24 research outputs found
The Green Choice: Learning and Influencing Human Decisions on Shared Roads
Autonomous vehicles have the potential to increase the capacity of roads via
platooning, even when human drivers and autonomous vehicles share roads.
However, when users of a road network choose their routes selfishly, the
resulting traffic configuration may be very inefficient. Because of this, we
consider how to influence human decisions so as to decrease congestion on these
roads. We consider a network of parallel roads with two modes of
transportation: (i) human drivers who will choose the quickest route available
to them, and (ii) ride hailing service which provides an array of autonomous
vehicle ride options, each with different prices, to users. In this work, we
seek to design these prices so that when autonomous service users choose from
these options and human drivers selfishly choose their resulting routes, road
usage is maximized and transit delay is minimized. To do so, we formalize a
model of how autonomous service users make choices between routes with
different price/delay values. Developing a preference-based algorithm to learn
the preferences of the users, and using a vehicle flow model related to the
Fundamental Diagram of Traffic, we formulate a planning optimization to
maximize a social objective and demonstrate the benefit of the proposed routing
and learning scheme.Comment: Submitted to CDC 201
LLF-Bench: Benchmark for Interactive Learning from Language Feedback
We introduce a new benchmark, LLF-Bench (Learning from Language Feedback
Benchmark; pronounced as "elf-bench"), to evaluate the ability of AI agents to
interactively learn from natural language feedback and instructions. Learning
from language feedback (LLF) is essential for people, largely because the rich
information this feedback provides can help a learner avoid much of trial and
error and thereby speed up the learning process. Large Language Models (LLMs)
have recently enabled AI agents to comprehend natural language -- and hence AI
agents can potentially benefit from language feedback during learning like
humans do. But existing interactive benchmarks do not assess this crucial
capability: they either use numeric reward feedback or require no learning at
all (only planning or information retrieval). LLF-Bench is designed to fill
this omission. LLF-Bench is a diverse collection of sequential decision-making
tasks that includes user recommendation, poem writing, navigation, and robot
control. The objective of an agent is to interactively solve these tasks based
on their natural-language instructions and the feedback received after taking
actions. Crucially, to ensure that the agent actually "learns" from the
feedback, LLF-Bench implements several randomization techniques (such as
paraphrasing and environment randomization) to ensure that the task isn't
familiar to the agent and that the agent is robust to various verbalizations.
In addition, LLF-Bench provides a unified OpenAI Gym interface for all its
tasks and allows the users to easily configure the information the feedback
conveys (among suggestion, explanation, and instantaneous performance) to study
how agents respond to different types of feedback. Together, these features
make LLF-Bench a unique research platform for developing and testing LLF
agents
Query-Policy Misalignment in Preference-Based Reinforcement Learning
Preference-based reinforcement learning (PbRL) provides a natural way to
align RL agents' behavior with human desired outcomes, but is often restrained
by costly human feedback. To improve feedback efficiency, most existing PbRL
methods focus on selecting queries to maximally improve the overall quality of
the reward model, but counter-intuitively, we find that this may not
necessarily lead to improved performance. To unravel this mystery, we identify
a long-neglected issue in the query selection schemes of existing PbRL studies:
Query-Policy Misalignment. We show that the seemingly informative queries
selected to improve the overall quality of reward model actually may not align
with RL agents' interests, thus offering little help on policy learning and
eventually resulting in poor feedback efficiency. We show that this issue can
be effectively addressed via near on-policy query and a specially designed
hybrid experience replay, which together enforce the bidirectional query-policy
alignment. Simple yet elegant, our method can be easily incorporated into
existing approaches by changing only a few lines of code. We showcase in
comprehensive experiments that our method achieves substantial gains in both
human feedback and RL sample efficiency, demonstrating the importance of
addressing query-policy misalignment in PbRL tasks.Comment: The first two authors contributed equall