13 research outputs found
Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems
User engagement is a critical metric for evaluating the quality of
open-domain dialogue systems. Prior work has focused on conversation-level
engagement by using heuristically constructed features such as the number of
turns and the total time of the conversation. In this paper, we investigate the
possibility and efficacy of estimating utterance-level engagement and define a
novel metric, {\em predictive engagement}, for automatic evaluation of
open-domain dialogue systems. Our experiments demonstrate that (1) human
annotators have high agreement on assessing utterance-level engagement scores;
(2) conversation-level engagement scores can be predicted from properly
aggregated utterance-level engagement scores. Furthermore, we show that the
utterance-level engagement scores can be learned from data. These scores can
improve automatic evaluation metrics for open-domain dialogue systems, as shown
by correlation with human judgements. This suggests that predictive engagement
can be used as a real-time feedback for training better dialogue models
Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4
The advent and fast development of neural networks have revolutionized the
research on dialogue systems and subsequently have triggered various challenges
regarding their automatic evaluation. Automatic evaluation of open-domain
dialogue systems as an open challenge has been the center of the attention of
many researchers. Despite the consistent efforts to improve automatic metrics'
correlations with human evaluation, there have been very few attempts to assess
their robustness over multiple domains and dimensions. Also, their focus is
mainly on the English language. All of these challenges prompt the development
of automatic evaluation metrics that are reliable in various domains,
dimensions, and languages. This track in the 11th Dialogue System Technology
Challenge (DSTC11) is part of the ongoing effort to promote robust and
multilingual automatic evaluation metrics. This article describes the datasets
and baselines provided to participants and discusses the submission and result
details of the two proposed subtasks
ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems
Commonsense reasoning is omnipresent in human communications and thus is an
important feature for open-domain dialogue systems. However, evaluating
commonsense in dialogue systems is still an open challenge. We take the first
step by focusing on event commonsense that considers events and their
relations, and is crucial in both dialogues and general commonsense reasoning.
We propose ACCENT, an event commonsense evaluation metric empowered by
commonsense knowledge bases (CSKBs). ACCENT first extracts event-relation
tuples from a dialogue, and then evaluates the response by scoring the tuples
in terms of their compatibility with the CSKB. To evaluate ACCENT, we construct
the first public event commonsense evaluation dataset for open-domain
dialogues. Our experiments show that ACCENT is an efficient metric for event
commonsense evaluation, which achieves higher correlations with human judgments
than existing baselines.Comment: ACL 202
Modeling Dialogues with Hashcode Representations: A Nonparametric Approach
We propose a novel dialogue modeling framework, the first-ever nonparametric kernel functions based approach for dialogue modeling, which learns hashcodes as text representations; unlike traditional deep learning models, it handles well relatively small datasets, while also scaling to large ones. We also derive a novel lower bound on mutual information, used as a model-selection criterion favoring representations with better alignment between the utterances of participants in a collaborative dialogue setting, as well as higher predictability of the generated responses. As demonstrated on three real-life datasets, including prominently psychotherapy sessions, the proposed approach significantly outperforms several state-of-art neural network based dialogue systems, both in terms of computational efficiency, reducing training time from days or weeks to hours, and the response quality, achieving an order of magnitude improvement over competitors in frequency of being chosen as the best model by human evaluators
PARSINLU: A Suite of Language Understanding Challenges for Persian
Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce PARSINLU, the first benchmark in Persian language that includes a range of language understanding tasks-reading comprehension, textual entailment, and so on. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Additionally, we present the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope PARSINLU fosters further research and advances in Persian language understanding.(1