With advances in generative AI, there is now potential for autonomous agents
to manage daily tasks via natural language commands. However, current agents
are primarily created and tested in simplified synthetic environments, leading
to a disconnect with real-world scenarios. In this paper, we build an
environment for language-guided agents that is highly realistic and
reproducible. Specifically, we focus on agents that perform tasks on the web,
and create an environment with fully functional websites from four common
domains: e-commerce, social forum discussions, collaborative software
development, and content management. Our environment is enriched with tools
(e.g., a map) and external knowledge bases (e.g., user manuals) to encourage
human-like task-solving. Building upon our environment, we release a set of
benchmark tasks focusing on evaluating the functional correctness of task
completions. The tasks in our benchmark are diverse, long-horizon, and designed
to emulate tasks that humans routinely perform on the internet. We experiment
with several baseline agents, integrating recent techniques such as reasoning
before acting. The results demonstrate that solving complex tasks is
challenging: our best GPT-4-based agent only achieves an end-to-end task
success rate of 14.41%, significantly lower than the human performance of
78.24%. These results highlight the need for further development of robust
agents, that current state-of-the-art large language models are far from
perfect performance in these real-life tasks, and that WebArena can be used to
measure such progress.Comment: Our code, data, environment reproduction resources, and video
demonstrations are publicly available at https://webarena.dev