Search CORE

4,559 research outputs found

2kenize: Tying Subword Sequences for Chinese Script Conversion

Author: A Pranav
Augenstein Isabelle
Publication venue
Publication date: 01/01/2020
Field of study

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Recommended from our members

The role of HG in the analysis of temporal iteration and interaural correlation

Author: Barrett DJK
Hall DA
Publication venue
Publication date: 01/01/2004
Field of study

Nottingham Trent Institutional Repository (IRep)

Effective Use of Chinese Structural Auxiliaries for Chinese Parsing

Author: Jin Yun
Kim Young-Gil
Li Qing
Wu Yingshun
Publication venue: City University of Hong Kong
Publication date: 01/01/2009
Field of study

PACLIC 23 / City University of Hong Kong / 3-5 December 200

Waseda University Repository

Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval

Author: Hsu Wen-Lian
Jiang Mike Tian-Jian
Kuo Chan-Hung
Shih Cheng-Wei
Tsai Richard Tzong-Han
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

Word segmentation of Vietnamese texts: a comparison of approaches

Author: Dinh Quang Thang
Le Hong Phuong
Nguyen Cam Tu
Nguyen Thi Minh Huyen
Rossignol Mathias
Vu Xuan Luong
Publication venue: HAL CCSD
Publication date: 28/05/2008
Field of study

International audienceWe present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

A Machine Translation Approach for Chinese Whole-Sentence Pinyin-to-Character Conversion

Author: Lu Bao-Liang
Yang Shaohua
Zhao Hai
Publication venue: Faculty of Computer Science, Universitas Indonesia
Publication date: 01/01/2012
Field of study

Waseda University Repository