440 research outputs found
Radical-Enhanced Chinese Character Embedding
We present a method to leverage radical for learning Chinese character
embedding. Radical is a semantic and phonetic component of Chinese character.
It plays an important role as characters with the same radical usually have
similar semantic meaning and grammatical usage. However, existing Chinese
processing algorithms typically regard word or character as the basic unit but
ignore the crucial radical information. In this paper, we fill this gap by
leveraging radical for learning continuous representation of Chinese character.
We develop a dedicated neural architecture to effectively learn character
embedding and apply it on Chinese character similarity judgement and Chinese
word segmentation. Experiment results show that our radical-enhanced method
outperforms existing embedding learning algorithms on both tasks.Comment: 8 pages, 4 figure
Dual Long Short-Term Memory Networks for Sub-Character Representation Learning
Characters have commonly been regarded as the minimal processing unit in
Natural Language Processing (NLP). But many non-latin languages have
hieroglyphic writing systems, involving a big alphabet with thousands or
millions of characters. Each character is composed of even smaller parts, which
are often ignored by the previous work. In this paper, we propose a novel
architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to
learn sub-character level representation and capture deeper level of semantic
meanings. To build a concrete study and substantiate the efficiency of our
neural architecture, we take Chinese Word Segmentation as a research case
example. Among those languages, Chinese is a typical case, for which every
character contains several components called radicals. Our networks employ a
shared radical level embedding to solve both Simplified and Traditional Chinese
Word Segmentation, without extra Traditional to Simplified Chinese conversion,
in such a highly end-to-end way the word segmentation can be significantly
simplified compared to the previous work. Radical level embeddings can also
capture deeper semantic meaning below character level and improve the system
performance of learning. By tying radical and character embeddings together,
the parameter count is reduced whereas semantic knowledge is shared and
transferred between two levels, boosting the performance largely. On 3 out of 4
Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to
0.4%. Our results are reproducible, source codes and corpora are available on
GitHub.Comment: Accepted & forthcoming at ITNG-201
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF
We present a character-based model for joint segmentation and POS tagging for Chinese. The bidirectional RNN-CRF architecture for general sequence tagging is adapted and applied with novel vector representations of Chinese characters that capture rich contextual information and lower-than-character level features. The proposed model is extensively evaluated and compared with a state-of-the-art tagger respectively on CTB5, CTB9 and UD Chinese. The experimental results indicate that our model is accurate and robust across datasets in different sizes, genres and annotation schemes. We obtain state-of-the-art performance on CTB5, achieving 94.38 F1-score for joint segmentation and POS tagging.Peer reviewe
- …