1 research outputs found
A Survey of Evaluation Metrics Used for NLG Systems
The success of Deep Learning has created a surge in interest in a wide a
range of Natural Language Generation (NLG) tasks. Deep Learning has not only
pushed the state of the art in several existing NLG tasks but has also
facilitated researchers to explore various newer NLG tasks such as image
captioning. Such rapid progress in NLG has necessitated the development of
accurate automatic evaluation metrics that would allow us to track the progress
in the field of NLG. However, unlike classification tasks, automatically
evaluating NLG systems in itself is a huge challenge. Several works have shown
that early heuristic-based metrics such as BLEU, ROUGE are inadequate for
capturing the nuances in the different NLG tasks. The expanding number of NLG
models and the shortcomings of the current metrics has led to a rapid surge in
the number of evaluation metrics proposed since 2014. Moreover, various
evaluation metrics have shifted from using pre-determined heuristic-based
formulae to trained transformer models. This rapid change in a relatively short
time has led to the need for a survey of the existing NLG metrics to help
existing and new researchers to quickly come up to speed with the developments
that have happened in NLG evaluation in the last few years. Through this
survey, we first wish to highlight the challenges and difficulties in
automatically evaluating NLG systems. Then, we provide a coherent taxonomy of
the evaluation metrics to organize the existing metrics and to better
understand the developments in the field. We also describe the different
metrics in detail and highlight their key contributions. Later, we discuss the
main shortcomings identified in the existing metrics and describe the
methodology used to evaluate evaluation metrics. Finally, we discuss our
suggestions and recommendations on the next steps forward to improve the
automatic evaluation metrics.Comment: A condensed version of this paper is submitted to ACM CSU