Search CORE

3 research outputs found

Evaluating Human-Language Model Interaction

Author: Bernstein Michael
Bommasani Rishi
Cao Hancheng
Durmus Esin
Gerard-Ursin Ines
Hardy Amelia
Kwon Minae
Ladhak Faisal
Lee Mina
Lee Tony
Li Xiang Lisa
Liang Percy
Paranjape Ashwin
Park Joon Sung
Rong Frieda
Srivastava Megha
Thickstun John
Wang Rose E.
Publication venue
Publication date: 10/09/2023
Field of study

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI

arXiv.org e-Print Archive