This study investigates how Large Language Models (LLMs) leverage source and
reference data in machine translation evaluation task, aiming to better
understand the mechanisms behind their remarkable performance in this task. We
design the controlled experiments across various input modes and model types,
and employ both coarse-grained and fine-grained prompts to discern the utility
of source versus reference information. We find that reference information
significantly enhances the evaluation accuracy, while surprisingly, source
information sometimes is counterproductive, indicating LLMs' inability to fully
leverage the cross-lingual capability when evaluating translations. Further
analysis of the fine-grained evaluation and fine-tuning experiments show
similar results. These findings also suggest a potential research direction for
LLMs that fully exploits the cross-lingual capability of LLMs to achieve better
performance in machine translation evaluation tasks.Comment: Accepted by ACL2024 Finding