Skip to main content
Article thumbnail
Location of Repository

Comparing automatic and human evaluation of NLG systems

By Anja Belz and Ehud Reiter


We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NI ST, B LEU, and ROUGE. We find that NI ST scores correlate best (>0.8) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential, in particular where high-quality reference texts and only a small number of human evaluators are available. However, in general it is probably best for automatic evaluations to be supported by human based evaluations, or at least by studies that demonstrate that a particular metric correlates well with human judgments in a given domain

Topics: G400 Computing, Q100 Linguistics
Publisher: Association for Computational Linguistics
Year: 2006
OAI identifier:

Suggested articles


  1. A method for automatic evaluation of machine translation. doi
  2. (2003). Acquiring correct knowledge for natural language generation.
  3. (1998). An empirical verification of coverage and correctness for a general-purpose sentence generator.
  4. (2002). Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. doi
  5. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. doi
  6. (2005). Choosing words in computer-generated weather forecasts. doi
  7. (2004). Classificationbased generation using TAG. doi
  8. (2006). Data-driven generation of emphatic facial displays.
  9. (1997). Developing and empirically evaluating robust explanation generators: The KNIGHT experiments.
  10. (1996). Evaluating and comparing three text production techniques. doi
  11. (1996). Evaluating Natural Language Processing Systems: An Analysis and Review. doi
  12. (1998). Evaluation in the context of natural language generation. Computer Speech and Language, doi
  13. (2000). Evaluation metrics for generation. doi
  14. (2005). Evaluation of an NLG system used post-edit data: Lessons learned.
  15. (2003). Exploiting a parallel TEXT-DATA corpus. doi
  16. (2005). Generating readable texts for readers with low basic skills. doi
  17. (2005). On some pitfalls in automatic evaluation and significance testing for MT.
  18. (2006). pCRU: Probabilistic generation using representational underspecification.
  19. (2002). Should corpora texts be gold standards for NLG?
  20. (2002). SRILM: An extensible language modeling toolkit.
  21. (2005). Statistical generation: Three methods compared and evaluated.
  22. (2004). The use of a structural n-gram language model in generation-heavy hybrid machine translation. doi
  23. (1999). Using Grice’s maxim of quantity to select the content of plan descriptions. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.