Search CORE

3 research outputs found

State-of-the-Art Vietnamese Word Segmentation

Author: Cong Song Nguyen Duc
Jiamthapthaksin Rachsuda
Ngo Quoc Hung
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/06/2019
Field of study

Word segmentation is the first step of any tasks in Vietnamese language processing. This paper reviews stateof-the-art approaches and systems for word segmentation in Vietnamese. To have an overview of all stages from building corpora to developing toolkits, we discuss building the corpus stage, approaches applied to solve the word segmentation and existing toolkits to segment words in Vietnamese sentences. In addition, this study shows clearly the motivations on building corpus and implementing machine learning techniques to improve the accuracy for Vietnamese word segmentation. According to our observation, this study also reports a few of achivements and limitations in existing Vietnamese word segmentation systems.Comment: 2016 2nd International Conference on Science in Information Technology (ICSITech

arXiv.org e-Print Archive

A Subword Guided Neural Word Segmentation Model for Sindhi

Author: Ali Wazir
Kumar Jay
Kumar Rajesh
Lu Junyu
Luo Congjian
Ren Yazhou
Shao Junming
Xu Zenglin
Publication venue
Publication date: 30/12/2020
Field of study

Deep neural networks employ multiple processing layers for learning text representations to alleviate the burden of manual feature engineering in Natural Language Processing (NLP). Such text representations are widely used to extract features from unlabeled data. The word segmentation is a fundamental and inevitable prerequisite for many languages. Sindhi is an under-resourced language, whose segmentation is challenging as it exhibits space omission, space insertion issues, and lacks the labeled corpus for segmentation. In this paper, we investigate supervised Sindhi Word Segmentation (SWS) using unlabeled data with a Subword Guided Neural Word Segmenter (SGNWS) for Sindhi. In order to learn text representations, we incorporate subword representations to recurrent neural architecture to capture word information at morphemic-level, which takes advantage of Bidirectional Long-Short Term Memory (BiLSTM), self-attention mechanism, and Conditional Random Field (CRF). Our proposed SGNWS model achieves an F1 value of 98.51% without relying on feature engineering. The empirical results demonstrate the benefits of the proposed model over the existing Sindhi word segmenters.Comment: Journal Paper, 16 page

arXiv.org e-Print Archive

Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures

Author: Gutkin Alexander
Jansche Martin
Merkulova Tatiana
Publication venue
Publication date: 04/05/2020
Field of study

The use of linguistic typological resources in natural language processing has been steadily gaining more popularity. It has been observed that the use of typological information, often combined with distributed language representations, leads to significantly more powerful models. While linguistic typology representations from various resources have mostly been used for conditioning the models, there has been relatively little attention on predicting features from these resources from the input data. In this paper we investigate whether the various linguistic features from World Atlas of Language Structures (WALS) can be reliably inferred from multi-lingual text. Such a predictor can be used to infer structural features for a language never observed in training data. We frame this task as a multi-label classification involving predicting the set of non-mutually exclusive and extremely sparse multi-valued labels (WALS features). We construct a recurrent neural network predictor based on byte embeddings and convolutional layers and test its performance on 556 languages, providing analysis for various linguistic types, macro-areas, language families and individual features. We show that some features from various linguistic types can be predicted reliably.Comment: Originally prepared as a conference submission to EMNLP 201

arXiv.org e-Print Archive