Despite tremendous advancements in dialogue systems, stable evaluation still
requires human judgments producing notoriously high-variance metrics due to
their inherent subjectivity. Moreover, methods and labels in dialogue
evaluation are not fully standardized, especially for open-domain chats, with a
lack of work to compare and assess the validity of those approaches. The use of
inconsistent evaluation can misinform the performance of a dialogue system,
which becomes a major hurdle to enhance it. Thus, a dimensional evaluation of
chat-oriented open-domain dialogue systems that reliably measures several
aspects of dialogue capabilities is desired. This paper presents a novel human
evaluation method to estimate the rates of many dialogue system behaviors. Our
method is used to evaluate four state-of-the-art open-domain dialogue systems
and compared with existing approaches. The analysis demonstrates that our
behavior method is more suitable than alternative Likert-style or comparative
approaches for dimensional evaluation of these systems.Comment: Accepted to ACL 2023; first two authors contributed equall