3 research outputs found
Evaluating Human-Language Model Interaction
Many real-world applications of language models (LMs), such as writing
assistance and code autocomplete, involve human-LM interaction. However, most
benchmarks are non-interactive in that a model produces output without human
involvement. To evaluate human-LM interaction, we develop a new framework,
Human-AI Language-based Interaction Evaluation (HALIE), that defines the
components of interactive systems and dimensions to consider when designing
evaluation metrics. Compared to standard, non-interactive evaluation, HALIE
captures (i) the interactive process, not only the final output; (ii) the
first-person subjective experience, not just a third-party assessment; and
(iii) notions of preference beyond quality (e.g., enjoyment and ownership). We
then design five tasks to cover different forms of interaction: social
dialogue, question answering, crossword puzzles, summarization, and metaphor
generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3
and AI21 Labs' Jurassic-1), we find that better non-interactive performance
does not always translate to better human-LM interaction. In particular, we
highlight three cases where the results from non-interactive and interactive
metrics diverge and underscore the importance of human-LM interaction for LM
evaluation.Comment: Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Human-Centered Artificial Intelligence (HAI