Search CORE

21,877 research outputs found

Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

Author: Chen Pinzhen
Heafield Kenneth
Publication venue
Publication date: 12/01/2021
Field of study

Supervised Chinese word segmentation has entered the deep learning era which reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as character-level translation which further simplified model designing and building, but there is still a performance gap between the translation-based approach and other methods. In this work, we apply the best practices from low-resource neural machine translation to Chinese word segmentation. We build encoder-decoder models with attention, and examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning and ensembling. Our method is generic for word segmentation, without the need for feature engineering or model implementation. In the closed test with constrained data, our method ties with the state of the art on the MSR dataset and is comparable to other methods on the PKU dataset

arXiv.org e-Print Archive

Smart Chinese Reader: A Chinese Language Learning Aid with Web Browser

Author: Yu Yifeng
Publication venue: EngagedScholarship@CSU
Publication date: 01/09/2017
Field of study

Smart Chinese Reader is a program based on NLP (natural language processing) technology to help you learn Chinese language through deep reading. It provides Chinese word segmentation, Chinese part of speech tagging, Chinese to English translation, example sentence search, and text to speech conversion. Compared with dictionary apps, it lets you gain more Chinese language knowledge (meanings and usages of Chinese words, patterns and even rhythms of Chinese sentences) from a text, rather than just to get through the text. It makes your Chinese learning more effective

Cleveland-Marshall College of Law

State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Author: Ganchev Kuzman
Ma Ji
Weiss David
Publication venue
Publication date: 01/01/2018
Field of study

A wide variety of neural-network architectures have been proposed for the task of Chinese word segmentation. Surprisingly, we find that a bidirectional LSTM model, when combined with standard deep learning techniques and best practices, can achieve better accuracy on many of the popular datasets as compared to models based on more complex neural-network architectures. Furthermore, our error analysis shows that out-of-vocabulary words remain challenging for neural-network models, and many of the remaining errors are unlikely to be fixed through architecture changes. Instead, more effort should be made on exploring resources for further improvement

arXiv.org e-Print Archive

Crossref

Radical-Enhanced Chinese Character Embedding

Author: Ji Zhenzhou
Lin Lei
Sun Yaming
Tang Duyu
Wang Xiaolong
Yang Nan
Publication venue
Publication date: 01/01/2014
Field of study

We present a method to leverage radical for learning Chinese character embedding. Radical is a semantic and phonetic component of Chinese character. It plays an important role as characters with the same radical usually have similar semantic meaning and grammatical usage. However, existing Chinese processing algorithms typically regard word or character as the basic unit but ignore the crucial radical information. In this paper, we fill this gap by leveraging radical for learning continuous representation of Chinese character. We develop a dedicated neural architecture to effectively learn character embedding and apply it on Chinese character similarity judgement and Chinese word segmentation. Experiment results show that our radical-enhanced method outperforms existing embedding learning algorithms on both tasks.Comment: 8 pages, 4 figure

arXiv.org e-Print Archive

Crossref

Dual Long Short-Term Memory Networks for Sub-Character Representation Learning

Author: Feng Yi
Gao Zhimin
He Han
Townsend George
Wu Lei
Yan Hua
Yang Xiaokun
Publication venue
Publication date: 01/01/2018
Field of study

Characters have commonly been regarded as the minimal processing unit in Natural Language Processing (NLP). But many non-latin languages have hieroglyphic writing systems, involving a big alphabet with thousands or millions of characters. Each character is composed of even smaller parts, which are often ignored by the previous work. In this paper, we propose a novel architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to learn sub-character level representation and capture deeper level of semantic meanings. To build a concrete study and substantiate the efficiency of our neural architecture, we take Chinese Word Segmentation as a research case example. Among those languages, Chinese is a typical case, for which every character contains several components called radicals. Our networks employ a shared radical level embedding to solve both Simplified and Traditional Chinese Word Segmentation, without extra Traditional to Simplified Chinese conversion, in such a highly end-to-end way the word segmentation can be significantly simplified compared to the previous work. Radical level embeddings can also capture deeper semantic meaning below character level and improve the system performance of learning. By tying radical and character embeddings together, the parameter count is reduced whereas semantic knowledge is shared and transferred between two levels, boosting the performance largely. On 3 out of 4 Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to 0.4%. Our results are reproducible, source codes and corpora are available on GitHub.Comment: Accepted & forthcoming at ITNG-201

arXiv.org e-Print Archive

Crossref