Search CORE

724 research outputs found

Grammatical Error Correction: A Survey of the State of the Art

Author: Briscoe Ted
Bryant Christopher
Cao Hannan
Ng Hwee Tou
Qorib Muhammad Reza
Yuan Zheng
Publication venue
Publication date: 25/03/2023
Field of study

Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments

arXiv.org e-Print Archive

Assessing Grammatical Correctness in Language Learning

Author: Katinskaia Anisia
Yangarber Roman
Publication venue: The Association for Computational Linguistics
Publication date: 01/04/2021
Field of study

We present experiments on assessing the grammatical correctness of learners’ answers in a language-learning System (references to the System, and the links to the released data and code are withheld for anonymity). In particular, we explore the problem of detecting alternative-correct answers: when more than one inflected form of a lemma fits syntactically and semantically in a given context. We approach the problem with the methods for grammatical error detection (GED), since we hypothesize that models for detecting grammatical mistakes can assess the correctness of potential alternative answers in a learning setting. Due to the paucity of training data, we explore the ability of pre-trained BERT to detect grammatical errors and then fine-tune it using synthetic training data. In this work, we focus on errors in inflection. Our experiments show a. that pre-trained BERT performs worse at detecting grammatical irregularities for Russian than for English; b. that fine-tuned BERT yields promising results on assessing the correctness of grammatical exercises; and c. establish a new benchmark for Russian. To further investigate its performance, we compare fine-tuned BERT with one of the state-of-the-art models for GED (Bell et al., 2019) on our dataset and RULEC-GEC (Rozovskaya and Roth, 2019). We release the manually annotated learner dataset, used for testing, for general use.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Beyond Hard Samples: Robust and Effective Grammatical Error Correction with Cycle Self-Augmenting

Author: Li Juntao
Qi Kaifeng
Tang Zecheng
Zhang Min
Publication venue
Publication date: 23/10/2023
Field of study

Recent studies have revealed that grammatical error correction methods in the sequence-to-sequence paradigm are vulnerable to adversarial attack, and simply utilizing adversarial examples in the pre-training or post-training process can significantly enhance the robustness of GEC models to certain types of attack without suffering too much performance loss on clean data. In this paper, we further conduct a thorough robustness evaluation of cutting-edge GEC methods for four different types of adversarial attacks and propose a simple yet very effective Cycle Self-Augmenting (CSA) method accordingly. By leveraging the augmenting data from the GEC models themselves in the post-training process and introducing regularization data for cycle training, our proposed method can effectively improve the model robustness of well-trained GEC models with only a few more training epochs as an extra cost. More concretely, further training on the regularization data can prevent the GEC models from over-fitting on easy-to-learn samples and thus can improve the generalization capability and robustness towards unseen data (adversarial noise/samples). Meanwhile, the self-augmented data can provide more high-quality pseudo pairs to improve model performance on the original testing data. Experiments on four benchmark datasets and seven strong models indicate that our proposed training method can significantly enhance the robustness of four types of attacks without using purposely built adversarial examples in training. Evaluation results on clean data further confirm that our proposed CSA method significantly improves the performance of four baselines and yields nearly comparable results with other state-of-the-art models. Our code is available at https://github.com/ZetangForward/CSA-GEC

arXiv.org e-Print Archive

Towards Orthographic and Grammatical Clinical Text Correction: a First Approach

Author: Lima López Salvador
Publication venue
Publication date: 21/09/2020
Field of study

Akats Gramatikalen Zuzenketa (GEC, ingelesetik, Grammatical Error Analysis) Hizkuntza Naturalaren Prozesamenduaren azpieremu bat da, ortogra a, puntuazio edo gramatika akatsak dituzten testuak automatikoki zuzentzea helburu duena. Orain arte, bigarren hizkuntzako ikasleek ekoitzitako testuetara bideratu da gehien bat, ingelesez idatzitako testuetara batez ere. Master-Tesi honetan gaztelaniaz idatzitako mediku-txostenetarako Akats Gramatikalen Zuzenketa lantzen da. Arlo espezi ko hau ez da asko esploratu orain arte, ez gaztelaniarako zentzu orokorrean, ezta domeinu klinikorako konkretuki ere. Hasteko, IMEC (gaztelaniatik, Informes Médicos en Español Corregidos) corpusa aurkezten da, eskuz zuzendutako mediku-txosten elektronikoen bilduma paralelo berria. Corpusa automatikoki etiketatu da zeregin honetarako egokitutako ERRANT tresna erabiliz. Horrez gain, hainbat esperimentu deskribatzen dira, zeintzuetan sare neuronaletan oinarritutako sistemak ataza honetarako diseinatutako baseline sistema batekin alderatzen diren.Grammatical Error Correction (GEC) is a sub field of Natural Language Processing that aims to automatically correct texts that include errors related to spelling, punctuation or grammar. So far, it has mainly focused on texts produced by second language learners, mostly in English. This Master's Thesis describes a first approach to Grammatical Error Correction for Spanish health records. This specific field has not been explored much until now, nor in Spanish in a general sense nor for the clinical domain specifically. For this purpose, the corpus IMEC (Informes Médicos en Español Corregidos) ---a manually-corrected parallel collection of Electronic Health Records--- is introduced. This corpus has been automatically annotated using the toolkit ERRANT, specialized in the automatic annotation of GEC parallel corpora, which was adapted to Spanish for this task. Furthermore, some experiments using neural networks and data augmentation are shown and compared with a baseline system also created specifically for this task

Archivo Digital para la Docencia y la Investigación