8,646 research outputs found
Better Summarization Evaluation with Word Embeddings for ROUGE
ROUGE is a widely adopted, automatic evaluation measure for text
summarization. While it has been shown to correlate well with human judgements,
it is biased towards surface lexical similarities. This makes it unsuitable for
the evaluation of abstractive summarization, or summaries with substantial
paraphrasing. We study the effectiveness of word embeddings to overcome this
disadvantage of ROUGE. Specifically, instead of measuring lexical overlaps,
word embeddings are used to compute the semantic similarity of the words used
in summaries instead. Our experimental results show that our proposal is able
to achieve better correlations with human judgements when measured with the
Spearman and Kendall rank coefficients.Comment: Pre-print - To appear in proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP
Recommended from our members
Coreference resolution in clinical discharge summaries, progress notes, surgical and pathology reports: a unified lexical approach
We developed a lexical rule-based system that uses a unified approach to resolving coreference across a wide variety of clinical records comprising discharge summaries, progress notes, pathology, radiology and surgical reports from two corpora (Ontology Development and Information Extraction (ODIE) and i2b2/VA) provided for the fifth i2b2/VA shared task. Taking the unweighted mean between 4 coreference metrics, validation of the system against the i2b2/VA corpus attained an overall F-score of 87.7% across all mention classes, with a maximum of 93.1% for coreference of persons, and a minimum of 77.2% for coreference of tests. For the ODIE corpus the overall F-score across all mention classes was 79.4%, with a maximum of 82.0% for coreference of persons and a minimum of 13.1% for coreference of diagnostic reagents. For the ODIE corpus our results are comparable to the mean reported inter-annotator agreement with the gold standard. We discuss the four categories of errors we identified, and how these might be addressed. The system uses a number of reusable modules and techniques that may be of benefit to the research community
- …