5,730 research outputs found
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
Radical-Enhanced Chinese Character Embedding
We present a method to leverage radical for learning Chinese character
embedding. Radical is a semantic and phonetic component of Chinese character.
It plays an important role as characters with the same radical usually have
similar semantic meaning and grammatical usage. However, existing Chinese
processing algorithms typically regard word or character as the basic unit but
ignore the crucial radical information. In this paper, we fill this gap by
leveraging radical for learning continuous representation of Chinese character.
We develop a dedicated neural architecture to effectively learn character
embedding and apply it on Chinese character similarity judgement and Chinese
word segmentation. Experiment results show that our radical-enhanced method
outperforms existing embedding learning algorithms on both tasks.Comment: 8 pages, 4 figure
Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization
Existing methods for CWS usually rely on a large number of labeled sentences
to train word segmentation models, which are expensive and time-consuming to
annotate. Luckily, the unlabeled data is usually easy to collect and many
high-quality Chinese lexicons are off-the-shelf, both of which can provide
useful information for CWS. In this paper, we propose a neural approach for
Chinese word segmentation which can exploit both lexicon and unlabeled data.
Our approach is based on a variant of posterior regularization algorithm, and
the unlabeled data and lexicon are incorporated into model training as indirect
supervision by regularizing the prediction space of CWS models. Extensive
experiments on multiple benchmark datasets in both in-domain and cross-domain
scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference
(WWW '19
- …