Improving Multimodal Interactive Agents with Reinforcement Learning from
  Human Feedback

Abramson, Josh; Ahuja, Arun; Carnevale, Federico; Georgiev, Petko; Goldin, Alex; Hung, Alden; Landon, Jessica; Lhotka, Jirka; Lillicrap, Timothy; Muldal, Alistair; Powell, George; Santoro, Adam; Scully, Guy; Srivastava, Sanjana; von Glehn, Tamara; Wayne, Greg; Wong, Nathaniel; Yan, Chen; Zhu, Rui

Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback

Authors: Josh Abramson
Arun Ahuja
Federico Carnevale
Petko Georgiev
Alex Goldin
Alden Hung
Jessica Landon
Jirka Lhotka
Timothy Lillicrap
Alistair Muldal
George Powell
Adam Santoro
Guy Scully
Sanjana Srivastava
Tamara von Glehn
Greg Wayne
Nathaniel Wong
Chen Yan
Rui Zhu
Publication date: 21 November 2022
Publisher

Abstract

An important goal in artificial intelligence is to create agents that can both interact naturally with humans and learn from their feedback. Here we demonstrate how to use reinforcement learning from human feedback (RLHF) to improve upon simulated, embodied agents trained to a base level of competency with imitation learning. First, we collected data of humans interacting with agents in a simulated 3D world. We then asked annotators to record moments where they believed that agents either progressed toward or regressed from their human-instructed goal. Using this annotation data we leveraged a novel method - which we call "Inter-temporal Bradley-Terry" (IBT) modelling - to build a reward model that captures human judgments. Agents trained to optimise rewards delivered from IBT reward models improved with respect to all of our metrics, including subsequent human judgment during live interactions with agents. Altogether our results demonstrate how one can successfully leverage human judgments to improve agent behaviour, allowing us to use reinforcement learning in complex, embodied domains without programmatic reward functions. Videos of agent behaviour may be found at https://youtu.be/v_Z9F2_eKk4

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2211.11602

Last time updated on 24/12/2022