3 research outputs found
Unifying Human and Statistical Evaluation for Natural Language Generation
How can we measure whether a natural language generation system produces both
high quality and diverse outputs? Human evaluation captures quality but not
diversity, as it does not catch models that simply plagiarize from the training
set. On the other hand, statistical evaluation (i.e., perplexity) captures
diversity but not quality, as models that occasionally emit low quality samples
would be insufficiently penalized. In this paper, we propose a unified
framework which evaluates both diversity and quality, based on the optimal
error rate of predicting whether a sentence is human- or machine-generated. We
demonstrate that this error rate can be efficiently estimated by combining
human and statistical evaluation, using an evaluation metric which we call
HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects
diversity defects which fool pure human evaluation and that (ii) techniques
such as annealing for improving quality actually decrease HUSE due to decreased
diversity.Comment: NAACL Camera Ready Submissio
Controlled Text Generation for Data Augmentation in Intelligent Artificial Agents
Data availability is a bottleneck during early stages of development of new
capabilities for intelligent artificial agents. We investigate the use of text
generation techniques to augment the training data of a popular commercial
artificial agent across categories of functionality, with the goal of faster
development of new functionality. We explore a variety of encoder-decoder
generative models for synthetic training data generation and propose using
conditional variational auto-encoders. Our approach requires only direct
optimization, works well with limited data and significantly outperforms the
previous controlled text generation techniques. Further, the generated data are
used as additional training samples in an extrinsic intent classification task,
leading to improved performance by up to 5\% absolute f-score in low-resource
cases, validating the usefulness of our approach.Comment: EMNLP WNGT worksho
Diversity-aware Evaluation for Paraphrase Patterns
Common evaluation metrics for paraphrase patterns do not necessarily correlate with extrinsic recognition task performance. We propose a metric which gives weight to lexical variety in paraphrase patterns; our proposed metric has a positive correlation with paraphrase recognition task performance, with a Pearson correlation of 0.5~0.7 (k=10, with “strict ” judgment) in a statistically significant level (p-value<0.01).