Modeling discourse -- the linguistic phenomena that go beyond individual
sentences, is a fundamental yet challenging aspect of natural language
processing (NLP). However, existing evaluation benchmarks primarily focus on
the evaluation of inter-sentence properties and overlook critical discourse
phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a
benchmark that can evaluate intra-sentence discourse properties across a
diverse set of NLP tasks, covering understanding, translation, and generation.
Disco-Bench consists of 9 document-level testsets in the literature domain,
which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese
and/or English. For linguistic analysis, we also design a diagnostic test suite
that can examine whether the target models learn discourse knowledge. We
totally evaluate 20 general-, in-domain and commercial models based on
Transformer, advanced pretraining architectures and large language models
(LLMs). Our results show (1) the challenge and necessity of our evaluation
benchmark; (2) fine-grained pretraining based on literary document-level
training data consistently improves the modeling of discourse information. We
will release the datasets, pretrained models, and leaderboard, which we hope
can significantly facilitate research in this field:
https://github.com/longyuewangdcu/Disco-Bench.Comment: Zhaopeng Tu is the corresponding autho