Towards better language representation in Natural Language Processing A multilingual dataset for text-level Grammatical Error Correction

Masciolini, Arianna; Kurfalı, Murathan; Zesch, Torsten

journal articleresearch articletext

Towards better language representation in Natural Language Processing A multilingual dataset for text-level Grammatical Error Correction

Authors: Arianna Masciolini
Murathan Kurfalı
Torsten Zesch
Publication date: 1 January 2025
Publisher: John Benjamins Publishing Company
Doi

Abstract

This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual GEC studiesSwedish Work on Swedish has been supported by Nationella Språkbanken and Huminfra, both funded by the Swedish Research Council (2018–2024, contract 2017-00626; 2022–2024, contract 2021-00176) and their participating partner institutions, as well as the Swedish Research Council grant 2019-04129.</p

Similar works

Full text

RISE – Research Institutes of Sweden

oai:DiVA.org:ri-78407

Last time updated on 25/12/2025

This paper was published in RISE – Research Institutes of Sweden.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: info:eu-repo/semantics/openAccess