2kenize: Tying Subword Sequences for Chinese Script Conversion

A, Pranav; Augenstein, Isabelle

research

2kenize: Tying Subword Sequences for Chinese Script Conversion

Authors: Pranav A
Isabelle Augenstein
Publication date: 1 January 2020
Publisher
Doi

Abstract

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

Similar works

Full text

Available Versions

Copenhagen University Research Information System

oai:pure.atira.dk:publications...

Last time updated on 23/01/2021

Crossref

Last time updated on 10/08/2021