8 research outputs found
LLF-Bench: Benchmark for Interactive Learning from Language Feedback
We introduce a new benchmark, LLF-Bench (Learning from Language Feedback
Benchmark; pronounced as "elf-bench"), to evaluate the ability of AI agents to
interactively learn from natural language feedback and instructions. Learning
from language feedback (LLF) is essential for people, largely because the rich
information this feedback provides can help a learner avoid much of trial and
error and thereby speed up the learning process. Large Language Models (LLMs)
have recently enabled AI agents to comprehend natural language -- and hence AI
agents can potentially benefit from language feedback during learning like
humans do. But existing interactive benchmarks do not assess this crucial
capability: they either use numeric reward feedback or require no learning at
all (only planning or information retrieval). LLF-Bench is designed to fill
this omission. LLF-Bench is a diverse collection of sequential decision-making
tasks that includes user recommendation, poem writing, navigation, and robot
control. The objective of an agent is to interactively solve these tasks based
on their natural-language instructions and the feedback received after taking
actions. Crucially, to ensure that the agent actually "learns" from the
feedback, LLF-Bench implements several randomization techniques (such as
paraphrasing and environment randomization) to ensure that the task isn't
familiar to the agent and that the agent is robust to various verbalizations.
In addition, LLF-Bench provides a unified OpenAI Gym interface for all its
tasks and allows the users to easily configure the information the feedback
conveys (among suggestion, explanation, and instantaneous performance) to study
how agents respond to different types of feedback. Together, these features
make LLF-Bench a unique research platform for developing and testing LLF
agents
Multi-task Hierarchical Reinforcement Learning for Compositional Tasks
This thesis presents the algorithms for solve multiple compositional tasks with high sample efficiency and strong generalization ability.
Central to this work is the subtask graph which models the structure in compositional tasks into a graph form. We formulate the compositional tasks as a multi-task and meta-RL problems using the subtask graph and discuss different approaches to tackle the problem.
Specifically, we present four contributions, where the common idea is to exploit the inductive bias in the hierarchical task structure for efficien learning and strong generalization.
The first part of the thesis formally introduces the subtask graph execution problem: a modeling of the compositional task as an multi-task RL problem where the agent is given a task description input in a graph form as an additional input.
We present the hierarchical architecture where high-level policy determines the subtask to execute and low-level policy executes the given subtask. The high-level policy learns the modular neural network that can be dynamically assmbled according to the input task description to choose the optimal sequence of subtasks to maximize the reward.
We demonstrate that the proposed method can achieve a strong zero-shot task generalization ability, and also improve the search efficiency of existing planning method when combined together.
The second part studies the more general setting where the task structure is not available to agent such that the task should be inferred from the agent's own experience; ie, few-shot reinforcement learning setting.
Specifically, we combine the meta-reinforcemenet learning with an inductive logic programming (ILP) method to explicitly infer the latent task structure in terms of subtask graph from agent's trajectory.
Our empirical study shows that the underlying task structure can be accurately inferred from a small amount of environment interaction without any explicit supervision on complex 3D environments with high-dimensional state and actions space.
The third contribution extends thesecond contribution by transfer-learning the prior over the task structure from training tasks to the unseen testing task to achieve a faster adaptation. Although the meta-policy learned the general exploration strategy over the distribution of tasks, the task structure was independently inferred from scratch for each task in the previous part. We overcome such limitation by modeling the prior of the tasks from the subtask graph inferred via ILP, and transfer-learning the learned prior when performing the inference of novel test tasks. To achieve this, we propose a novel prior sampling and posterior update method to incorporate the knowledge learned from the seen task that is most relevant to the current task.
The last part investigates more indirect form of inductive bias that is implemented as a constraint on the trajectory rolled out by the policy in MDP.
We present a theoretical result proving that the proposed constraint preserves the optimality while reducing the policy search space.
Empirically, the proposed method improves the sample effciency of the policy gradient method on a wide range of challenging sparse-reward tasks.
Overall, this work formulates the hierarchical structure in the compositional tasks and provides the evidences that such structure exists in many important problems.
In addition, we present diverse principled approaches to exploit the inductive bias on the hierarchical structure in MDP in different problem settings and assumptions, and demonstrate the usefulness of such inductive bias when tackling compositional tasks.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169951/1/srsohn_1.pd