3,367 research outputs found
Referenceless Quality Estimation for Natural Language Generation
Traditional automatic evaluation measures for natural language generation
(NLG) use costly human-authored references to estimate the quality of a system
output. In this paper, we propose a referenceless quality estimation (QE)
approach based on recurrent neural networks, which predicts a quality score for
a NLG system output by comparing it to the source meaning representation only.
Our method outperforms traditional metrics and a constant baseline in most
respects; we also show that synthetic data helps to increase correlation
results by 21% compared to the base system. Our results are comparable to
results obtained in similar QE tasks despite the more challenging setting.Comment: Accepted as a regular paper to 1st Workshop on Learning to Generate
Natural Language (LGNL), Sydney, 10 August 201
Why We Need New Evaluation Metrics for NLG
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In
this paper, we motivate the need for novel, system- and data-independent
automatic evaluation methods: We investigate a wide range of metrics, including
state-of-the-art word-based and novel grammar-based ones, and demonstrate that
they only weakly reflect human judgements of system outputs as generated by
data-driven, end-to-end NLG. We also show that metric performance is data- and
system-specific. Nevertheless, our results also suggest that automatic metrics
perform reliably at system-level and can support system development by finding
cases where a system performs poorly.Comment: accepted to EMNLP 201
Revisiting Grammatical Error Correction Evaluation and Beyond
Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore
and BARTScore) have been widely used in several sentence generation tasks
(e.g., machine translation and text summarization) due to their better
correlation with human judgments over traditional overlap-based methods.
Although PT-based methods have become the de facto standard for training
grammatical error correction (GEC) systems, GEC evaluation still does not
benefit from pretrained knowledge. This paper takes the first step towards
understanding and improving GEC evaluation with pretraining. We first find that
arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory
correlation results because of the excessive attention to inessential systems
outputs (e.g., unchanged parts). To alleviate the limitation, we propose a
novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2
which only uses PT-based metrics to score those corrected parts. Experimental
results on the CoNLL14 evaluation task show that PT-M2 significantly
outperforms existing methods, achieving a new state-of-the-art result of 0.949
Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate
competitive GEC systems. Source code and scripts are freely available at
https://github.com/pygongnlp/PT-M2.Comment: Accepted to EMNLP 202
Recommended from our members
Lexical patterns, features and knowledge resources for coreference resolution in clinical notes
Generation of entity coreference chains provides a means to extract linked narrative events from clinical notes, but despite being a well-researched topic in natural language processing, general- purpose coreference tools perform poorly on clinical texts. This paper presents a knowledge-centric and pattern-based approach to resolving coreference across a wide variety of clinical records comprising discharge summaries, progress notes, pathology, radiology and surgical reports from two corpora (Ontology Development and Information Extraction (ODIE) and i2b2/VA). In addition, a method for generating coreference chains using progressively pruned linked lists is demonstrated that reduces the search space and facilitates evaluation by a number of metrics. Independent evaluation results show an F-measure for each corpus of 79.2% and 87.5%, respectively, which offers performance at least as good as human annotators, greatly increased performance over general- purpose tools, and improvement on previously reported clinical coreference systems. The system uses a number of open-source components that are available to download
- …