1,725 research outputs found
Neural Word Segmentation with Rich Pretraining
Neural word segmentation research has benefited from large-scale raw texts by
leveraging them for pretraining character and word embeddings. On the other
hand, statistical segmentation research has exploited richer sources of
external information, such as punctuation, automatic segmentation and POS. We
investigate the effectiveness of a range of external training sources for
neural word segmentation by building a modular segmentation model, pretraining
the most important submodule using rich external sources. Results show that
such pretraining significantly improves the model, leading to accuracies
competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201
Handwritten Character Recognition of South Indian Scripts: A Review
Handwritten character recognition is always a frontier area of research in
the field of pattern recognition and image processing and there is a large
demand for OCR on hand written documents. Even though, sufficient studies have
performed in foreign scripts like Chinese, Japanese and Arabic characters, only
a very few work can be traced for handwritten character recognition of Indian
scripts especially for the South Indian scripts. This paper provides an
overview of offline handwritten character recognition in South Indian Scripts,
namely Malayalam, Tamil, Kannada and Telungu.Comment: Paper presented on the "National Conference on Indian Language
Computing", Kochi, February 19-20, 2011. 6 pages, 5 figure
Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF
We present a character-based model for joint segmentation and POS tagging for Chinese. The bidirectional RNN-CRF architecture for general sequence tagging is adapted and applied with novel vector representations of Chinese characters that capture rich contextual information and lower-than-character level features. The proposed model is extensively evaluated and compared with a state-of-the-art tagger respectively on CTB5, CTB9 and UD Chinese. The experimental results indicate that our model is accurate and robust across datasets in different sizes, genres and annotation schemes. We obtain state-of-the-art performance on CTB5, achieving 94.38 F1-score for joint segmentation and POS tagging.Peer reviewe
- …