21,877 research outputs found
Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task
Supervised Chinese word segmentation has entered the deep learning era which
reduces the hassle of feature engineering. Recently, some researchers attempted
to treat it as character-level translation which further simplified model
designing and building, but there is still a performance gap between the
translation-based approach and other methods. In this work, we apply the best
practices from low-resource neural machine translation to Chinese word
segmentation. We build encoder-decoder models with attention, and examine a
series of techniques including regularization, data augmentation, objective
weighting, transfer learning and ensembling. Our method is generic for word
segmentation, without the need for feature engineering or model implementation.
In the closed test with constrained data, our method ties with the state of the
art on the MSR dataset and is comparable to other methods on the PKU dataset
Smart Chinese Reader: A Chinese Language Learning Aid with Web Browser
Smart Chinese Reader is a program based on NLP (natural language processing) technology to help you learn Chinese language through deep reading. It provides Chinese word segmentation, Chinese part of speech tagging, Chinese to English translation, example sentence search, and text to speech conversion. Compared with dictionary apps, it lets you gain more Chinese language knowledge (meanings and usages of Chinese words, patterns and even rhythms of Chinese sentences) from a text, rather than just to get through the text. It makes your Chinese learning more effective
State-of-the-art Chinese Word Segmentation with Bi-LSTMs
A wide variety of neural-network architectures have been proposed for the
task of Chinese word segmentation.
Surprisingly, we find that a bidirectional LSTM model, when combined with
standard deep learning techniques and best practices, can achieve better
accuracy on many of the popular datasets as compared to models based on more
complex neural-network architectures.
Furthermore, our error analysis shows that out-of-vocabulary words remain
challenging for neural-network models, and many of the remaining errors are
unlikely to be fixed through architecture changes.
Instead, more effort should be made on exploring resources for further
improvement
Radical-Enhanced Chinese Character Embedding
We present a method to leverage radical for learning Chinese character
embedding. Radical is a semantic and phonetic component of Chinese character.
It plays an important role as characters with the same radical usually have
similar semantic meaning and grammatical usage. However, existing Chinese
processing algorithms typically regard word or character as the basic unit but
ignore the crucial radical information. In this paper, we fill this gap by
leveraging radical for learning continuous representation of Chinese character.
We develop a dedicated neural architecture to effectively learn character
embedding and apply it on Chinese character similarity judgement and Chinese
word segmentation. Experiment results show that our radical-enhanced method
outperforms existing embedding learning algorithms on both tasks.Comment: 8 pages, 4 figure
Dual Long Short-Term Memory Networks for Sub-Character Representation Learning
Characters have commonly been regarded as the minimal processing unit in
Natural Language Processing (NLP). But many non-latin languages have
hieroglyphic writing systems, involving a big alphabet with thousands or
millions of characters. Each character is composed of even smaller parts, which
are often ignored by the previous work. In this paper, we propose a novel
architecture employing two stacked Long Short-Term Memory Networks (LSTMs) to
learn sub-character level representation and capture deeper level of semantic
meanings. To build a concrete study and substantiate the efficiency of our
neural architecture, we take Chinese Word Segmentation as a research case
example. Among those languages, Chinese is a typical case, for which every
character contains several components called radicals. Our networks employ a
shared radical level embedding to solve both Simplified and Traditional Chinese
Word Segmentation, without extra Traditional to Simplified Chinese conversion,
in such a highly end-to-end way the word segmentation can be significantly
simplified compared to the previous work. Radical level embeddings can also
capture deeper semantic meaning below character level and improve the system
performance of learning. By tying radical and character embeddings together,
the parameter count is reduced whereas semantic knowledge is shared and
transferred between two levels, boosting the performance largely. On 3 out of 4
Bakeoff 2005 datasets, our method surpassed state-of-the-art results by up to
0.4%. Our results are reproducible, source codes and corpora are available on
GitHub.Comment: Accepted & forthcoming at ITNG-201
- …