2 research outputs found
Which Kind Is Better in Open-domain Multi-turn Dialog,Hierarchical or Non-hierarchical Models? An Empirical Study
Currently, open-domain generative dialog systems have attracted considerable
attention in academia and industry. Despite the success of single-turn dialog
generation, multi-turn dialog generation is still a big challenge. So far,
there are two kinds of models for open-domain multi-turn dialog generation:
hierarchical and non-hierarchical models. Recently, some works have shown that
the hierarchical models are better than non-hierarchical models under their
experimental settings; meanwhile, some works also demonstrate the opposite
conclusion. Due to the lack of adequate comparisons, it's not clear which kind
of models are better in open-domain multi-turn dialog generation. Thus, in this
paper, we will measure systematically nearly all representative hierarchical
and non-hierarchical models over the same experimental settings to check which
kind is better. Through extensive experiments, we have the following three
important conclusions: (1) Nearly all hierarchical models are worse than
non-hierarchical models in open-domain multi-turn dialog generation, except for
the HRAN model. Through further analysis, the excellent performance of HRAN
mainly depends on its word-level attention mechanism; (2) The performance of
other hierarchical models will also obtain a great improvement if integrating
the word-level attention mechanism into these models. The modified hierarchical
models even significantly outperform the non-hierarchical models; (3) The
reason why the word-level attention mechanism is so powerful for hierarchical
models is because it can leverage context information more effectively,
especially the fine-grained information. Besides, we have implemented all of
the models and already released the codes
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems
Many automatic evaluation metrics have been proposed to score the overall
quality of a response in open-domain dialogue. Generally, the overall quality
is comprised of various aspects, such as relevancy, specificity, and empathy,
and the importance of each aspect differs according to the task. For instance,
specificity is mandatory in a food-ordering dialogue task, whereas fluency is
preferred in a language-teaching dialogue system. However, existing metrics are
not designed to cope with such flexibility. For example, BLEU score
fundamentally relies only on word overlapping, whereas BERTScore relies on
semantic similarity between reference and candidate response. Thus, they are
not guaranteed to capture the required aspects, i.e., specificity. To design a
metric that is flexible to a task, we first propose making these qualities
manageable by grouping them into three groups: understandability, sensibleness,
and likability, where likability is a combination of qualities that are
essential for a task. We also propose a simple method to composite metrics of
each aspect to obtain a single metric called USL-H, which stands for
Understandability, Sensibleness, and Likability in Hierarchy. We demonstrated
that USL-H score achieves good correlations with human judgment and maintains
its configurability towards different aspects and metrics.Comment: 15 pages, 4 figures, 7 tables, Accepted to COLING 202