Search CORE

3 research outputs found

Unifying Human and Statistical Evaluation for Natural Language Generation

Author: Hashimoto Tatsunori B.
Liang Percy
Zhang Hugh
Publication venue
Publication date: 04/04/2019
Field of study

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.Comment: NAACL Camera Ready Submissio

arXiv.org e-Print Archive

Controlled Text Generation for Data Augmentation in Intelligent Artificial Agents

Author: Gao Shuyang
Goyal Anuj
Malandrakis Nikolaos
Metallinou Angeliki
Sethi Abhishek
Shen Minmin
Publication venue
Publication date: 04/10/2019
Field of study

Data availability is a bottleneck during early stages of development of new capabilities for intelligent artificial agents. We investigate the use of text generation techniques to augment the training data of a popular commercial artificial agent across categories of functionality, with the goal of faster development of new functionality. We explore a variety of encoder-decoder generative models for synthetic training data generation and propose using conditional variational auto-encoders. Our approach requires only direct optimization, works well with limited data and significantly outperforms the previous controlled text generation techniques. Further, the generated data are used as additional training samples in an extrinsic intent classification task, leading to improved performance by up to 5\% absolute f-score in low-resource cases, validating the usefulness of our approach.Comment: EMNLP WNGT worksho

arXiv.org e-Print Archive

Diversity-aware Evaluation for Paraphrase Patterns

Author: Hideki Shima
Teruko Mitamura
Publication venue
Publication date: 14/01/2012
Field of study

Common evaluation metrics for paraphrase patterns do not necessarily correlate with extrinsic recognition task performance. We propose a metric which gives weight to lexical variety in paraphrase patterns; our proposed metric has a positive correlation with paraphrase recognition task performance, with a Pearson correlation of 0.5~0.7 (k=10, with “strict ” judgment) in a statistically significant level (p-value<0.01).

CiteSeerX