Automatic machine translation (MT) metrics are widely used to distinguish the
translation qualities of machine translation systems across relatively large
test sets (system-level evaluation). However, it is unclear if automatic
metrics are reliable at distinguishing good translations from bad translations
at the sentence level (segment-level evaluation). In this paper, we investigate
how useful MT metrics are at detecting the success of a machine translation
component when placed in a larger platform with a downstream task. We evaluate
the segment-level performance of the most widely used MT metrics (chrF, COMET,
BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state
tracking, question answering, and semantic parsing). For each task, we only
have access to a monolingual task-specific model. We calculate the correlation
between the metric's ability to predict a good/bad translation with the
success/failure on the final task for the Translate-Test setup. Our experiments
demonstrate that all metrics exhibit negligible correlation with the extrinsic
evaluation of the downstream outcomes. We also find that the scores provided by
neural metrics are not interpretable mostly because of undefined ranges. We
synthesise our analysis into recommendations for future MT metrics to produce
labels rather than scores for more informative interaction between machine
translation and multilingual language understanding.Comment: ACL 2023 Camera Read