We propose a novel methodology (namely, MuLER) that transforms any
reference-based evaluation metric for text generation, such as machine
translation (MT) into a fine-grained analysis tool.
Given a system and a metric, MuLER quantifies how much the chosen metric
penalizes specific error types (e.g., errors in translating names of
locations). MuLER thus enables a detailed error analysis which can lead to
targeted improvement efforts for specific phenomena.
We perform experiments in both synthetic and naturalistic settings to support
MuLER's validity and showcase its usability in MT evaluation, and other tasks,
such as summarization. Analyzing all submissions to WMT in 2014-2020, we find
consistent trends. For example, nouns and verbs are among the most frequent POS
tags. However, they are among the hardest to translate. Performance on most POS
tags improves with overall system performance, but a few are not thus
correlated (their identity changes from language to language). Preliminary
experiments with summarization reveal similar trends