51 research outputs found
BLEU is Not Suitable for the Evaluation of Text Simplification
BLEU is widely considered to be an informative metric for text-to-text
generation, including Text Simplification (TS). TS includes both lexical and
structural aspects. In this paper we show that BLEU is not suitable for the
evaluation of sentence splitting, the major structural simplification
operation. We manually compiled a sentence splitting gold standard corpus
containing multiple structural paraphrases, and performed a correlation
analysis with human judgments. We find low or no correlation between BLEU and
the grammaticality and meaning preservation parameters where sentence splitting
is involved. Moreover, BLEU often negatively correlates with simplicity,
essentially penalizing simpler sentences.Comment: Accepted to EMNLP 2018 (Short papers
Evaluation of Automatic Video Captioning Using Direct Assessment
We present Direct Assessment, a method for manually assessing the quality of
automatically-generated captions for video. Evaluating the accuracy of video
captions is particularly difficult because for any given video clip there is no
definitive ground truth or correct answer against which to measure. Automatic
metrics for comparing automatic video captions against a manual caption such as
BLEU and METEOR, drawn from techniques used in evaluating machine translation,
were used in the TRECVid video captioning task in 2016 but these are shown to
have weaknesses. The work presented here brings human assessment into the
evaluation by crowdsourcing how well a caption describes a video. We
automatically degrade the quality of some sample captions which are assessed
manually and from this we are able to rate the quality of the human assessors,
a factor we take into account in the evaluation. Using data from the TRECVid
video-to-text task in 2016, we show how our direct assessment method is
replicable and robust and should scale to where there many caption-generation
techniques to be evaluated.Comment: 26 pages, 8 figure
An Investigation of Evaluation Metrics for Automated Medical Note Generation
Recent studies on automatic note generation have shown that doctors can save
significant amounts of time when using automatic clinical note generation
(Knoll et al., 2022). Summarization models have been used for this task to
generate clinical notes as summaries of doctor-patient conversations (Krishna
et al., 2021; Cai et al., 2022). However, assessing which model would best
serve clinicians in their daily practice is still a challenging task due to the
large set of possible correct summaries, and the potential limitations of
automatic evaluation metrics. In this paper, we study evaluation methods and
metrics for the automatic generation of clinical notes from medical
conversations. In particular, we propose new task-specific metrics and we
compare them to SOTA evaluation metrics in text summarization and generation,
including: (i) knowledge-graph embedding-based metrics, (ii) customized
model-based metrics, (iii) domain-adapted/fine-tuned metrics, and (iv) ensemble
metrics. To study the correlation between the automatic metrics and manual
judgments, we evaluate automatic notes/summaries by comparing the system and
reference facts and computing the factual correctness, and the hallucination
and omission rates for critical medical facts. This study relied on seven
datasets manually annotated by domain experts. Our experiments show that
automatic evaluation metrics can have substantially different behaviors on
different types of clinical notes datasets. However, the results highlight one
stable subset of metrics as the most correlated with human judgments with a
relevant aggregation of different evaluation criteria.Comment: Accepted to ACL Findings 202
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Automatic evaluation metrics capable of replacing human judgments are
critical to allowing fast development of new methods. Thus, numerous research
efforts have focused on crafting such metrics. In this work, we take a step
back and analyze recent progress by comparing the body of existing automatic
metrics and human metrics altogether. As metrics are used based on how they
rank systems, we compare metrics in the space of system rankings. Our extensive
statistical analysis reveals surprising findings: automatic metrics -- old and
new -- are much more similar to each other than to humans. Automatic metrics
are not complementary and rank systems similarly. Strikingly, human metrics
predict each other much better than the combination of all automatic metrics
used to predict a human metric. It is surprising because human metrics are
often designed to be independent, to capture different aspects of quality, e.g.
content fidelity or readability. We provide a discussion of these findings and
recommendations for future work in the field of evaluation
- …