Large language models (LLMs) such as Chat-GPT can produce coherent, cohesive,
relevant, and fluent answers for various natural language processing (NLP)
tasks. Taking document-level machine translation (MT) as a testbed, this paper
provides an in-depth evaluation of LLMs' ability on discourse modeling. The
study fo-cuses on three aspects: 1) Effects of Discourse-Aware Prompts, where
we investigate the impact of different prompts on document-level translation
quality and discourse phenomena; 2) Comparison of Translation Models, where we
compare the translation performance of Chat-GPT with commercial MT systems and
advanced document-level MT methods; 3) Analysis of Discourse Modelling
Abilities, where we further probe discourse knowledge encoded in LLMs and
examine the impact of training techniques on discourse modeling. By evaluating
a number of benchmarks, we surprisingly find that 1) leveraging their powerful
long-text mod-eling capabilities, ChatGPT outperforms commercial MT systems in
terms of human evaluation. 2) GPT-4 demonstrates a strong ability to explain
discourse knowledge, even through it may select incorrect translation
candidates in contrastive testing. 3) ChatGPT and GPT-4 have demonstrated
superior performance and show potential to become a new and promising paradigm
for document-level translation. This work highlights the challenges and
opportunities of discourse modeling for LLMs, which we hope can inspire the
future design and evaluation of LLMs