From Pixels to UI Actions: Learning to Follow Instructions via Graphical
  User Interfaces

Berant, Jonathan; Cohan, James; Hu, Hexiang; Joshi, Mandar; Khandelwal, Urvashi; Lee, Kenton; Pasupat, Panupong; Shaw, Peter; Toutanova, Kristina

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Authors: Jonathan Berant
James Cohan
Hexiang Hu
Mandar Joshi
Urvashi Khandelwal
Kenton Lee
Panupong Pasupat
Peter Shaw
Kristina Toutanova
Publication date: 31 May 2023
Publisher

Abstract

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.00245

Last time updated on 04/06/2023