M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box
  Machine-Generated Text Detection

Afzal, Osama Mohammed; Aji, Alham Fikri; Ivanov, Petar; Mahmoud, Tarek; Mansurov, Jonibek; Nakov, Preslav; Shelmanov, Artem; Su, Jinyan; Tsvigun, Akim; Wang, Yuxia; Whitehouse, Chenxi

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Authors: Osama Mohammed Afzal
Alham Fikri Aji
Petar Ivanov
Tarek Mahmoud
Jonibek Mansurov
Preslav Nakov
Artem Shelmanov
Jinyan Su
Akim Tsvigun
Yuxia Wang
Chenxi Whitehouse
Publication date: 24 May 2023
Publisher

Abstract

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. In this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. We first introduce a large-scale benchmark M4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and there is a lot of room for improvement. We believe that our dataset M4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. The M4 dataset is available at https://github.com/mbzuai-nlp/M4.Comment: 11 page

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.14902

Last time updated on 26/05/2023