Towards standardizing Korean Grammatical Error Correction: Datasets and
  Annotation

Cho, Junhee; Kim, Gyu Tae; Kim, Gyuwan; Oh, Alice; Park, Kihyo; Park, Sungjoon; Seo, Minjoon; Yoon, Soyoung

Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Authors: Junhee Cho
Gyu Tae Kim
Gyuwan Kim
Alice Oh
Kihyo Park
Sungjoon Park
Minjoon Seo
Soyoung Yoon
Publication date: 27 October 2022
Publisher

Abstract

Research on Korean grammatical error correction (GEC) is limited compared to other major languages such as English and Chinese. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean. Thus, in this work, we first collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide range of error types and annotate them using our newly proposed tool called Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a carefully designed edit alignment & classification tool that considers the nature of Korean on generating an alignment between a source sentence and a target sentence, and identifies error types on each aligned edit. We also present baseline models fine-tuned over our datasets. We show that the model trained with our datasets significantly outperforms the public statistical GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets.Comment: Add affiliation and email addres

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2210.14389

Last time updated on 06/12/2022