76,091 research outputs found
RankME: Reliable Human Ratings for Natural Language Generation
Human evaluation for natural language generation (NLG) often suffers from
inconsistent user ratings. While previous research tends to attribute this
problem to individual user preferences, we show that the quality of human
judgements can also be improved by experimental design. We present a novel
rank-based magnitude estimation method (RankME), which combines the use of
continuous scales and relative assessments. We show that RankME significantly
improves the reliability and consistency of human ratings compared to
traditional evaluation methods. In addition, we show that it is possible to
evaluate NLG systems according to multiple, distinct criteria, which is
important for error analysis. Finally, we demonstrate that RankME, in
combination with Bayesian estimation of system quality, is a cost-effective
alternative for ranking multiple NLG systems.Comment: Accepted to NAACL 2018 (The 2018 Conference of the North American
Chapter of the Association for Computational Linguistics
Design and implementation of a user-oriented speech recognition interface: the synergy of technology and human factors
The design and implementation of a user-oriented speech recognition interface are described. The interface enables the use of speech recognition in so-called interactive voice response systems which can be accessed via a telephone connection. In the design of the interface a synergy of technology and human factors is achieved. This synergy is very important for making speech interfaces a natural and acceptable form of human-machine interaction. Important concepts such as interfaces, human factors and speech recognition are discussed. Additionally, an indication is given as to how the synergy of human factors and technology can be realised by a sketch of the interface's implementation. An explanation is also provided of how the interface might be integrated in different applications fruitfully
Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation
Automated metrics such as BLEU are widely used in the machine translation
literature. They have also been used recently in the dialogue community for
evaluating dialogue response generation. However, previous work in dialogue
response generation has shown that these metrics do not correlate strongly with
human judgment in the non task-oriented dialogue setting. Task-oriented
dialogue responses are expressed on narrower domains and exhibit lower
diversity. It is thus reasonable to think that these automated metrics would
correlate well with human judgment in the task-oriented setting where the
generation task consists of translating dialogue acts into a sentence. We
conduct an empirical study to confirm whether this is the case. Our findings
indicate that these automated metrics have stronger correlation with human
judgments in the task-oriented setting compared to what has been observed in
the non task-oriented setting. We also observe that these metrics correlate
even better for datasets which provide multiple ground truth reference
sentences. In addition, we show that some of the currently available corpora
for task-oriented language generation can be solved with simple models and
advocate for more challenging datasets
- …