Developing high-performing dialogue systems benefits from the automatic
identification of undesirable behaviors in system responses. However, detecting
such behaviors remains challenging, as it draws on a breadth of general
knowledge and understanding of conversational practices. Although recent
research has focused on building specialized classifiers for detecting specific
dialogue behaviors, the behavior coverage is still incomplete and there is a
lack of testing on real-world human-bot interactions. This paper investigates
the ability of a state-of-the-art large language model (LLM), ChatGPT-3.5, to
perform dialogue behavior detection for nine categories in real human-bot
dialogues. We aim to assess whether ChatGPT can match specialized models and
approximate human performance, thereby reducing the cost of behavior detection
tasks. Our findings reveal that neither specialized models nor ChatGPT have yet
achieved satisfactory results for this task, falling short of human
performance. Nevertheless, ChatGPT shows promising potential and often
outperforms specialized detection models. We conclude with an in-depth
examination of the prevalent shortcomings of ChatGPT, offering guidance for
future research to enhance LLM capabilities.Comment: Accepted to SIGDIAL 202