3 research outputs found
State-of-the-Art Vietnamese Word Segmentation
Word segmentation is the first step of any tasks in Vietnamese language
processing. This paper reviews stateof-the-art approaches and systems for word
segmentation in Vietnamese. To have an overview of all stages from building
corpora to developing toolkits, we discuss building the corpus stage,
approaches applied to solve the word segmentation and existing toolkits to
segment words in Vietnamese sentences. In addition, this study shows clearly
the motivations on building corpus and implementing machine learning techniques
to improve the accuracy for Vietnamese word segmentation. According to our
observation, this study also reports a few of achivements and limitations in
existing Vietnamese word segmentation systems.Comment: 2016 2nd International Conference on Science in Information
Technology (ICSITech
A Subword Guided Neural Word Segmentation Model for Sindhi
Deep neural networks employ multiple processing layers for learning text
representations to alleviate the burden of manual feature engineering in
Natural Language Processing (NLP). Such text representations are widely used to
extract features from unlabeled data. The word segmentation is a fundamental
and inevitable prerequisite for many languages. Sindhi is an under-resourced
language, whose segmentation is challenging as it exhibits space omission,
space insertion issues, and lacks the labeled corpus for segmentation. In this
paper, we investigate supervised Sindhi Word Segmentation (SWS) using unlabeled
data with a Subword Guided Neural Word Segmenter (SGNWS) for Sindhi. In order
to learn text representations, we incorporate subword representations to
recurrent neural architecture to capture word information at morphemic-level,
which takes advantage of Bidirectional Long-Short Term Memory (BiLSTM),
self-attention mechanism, and Conditional Random Field (CRF). Our proposed
SGNWS model achieves an F1 value of 98.51% without relying on feature
engineering. The empirical results demonstrate the benefits of the proposed
model over the existing Sindhi word segmenters.Comment: Journal Paper, 16 page
Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures
The use of linguistic typological resources in natural language processing
has been steadily gaining more popularity. It has been observed that the use of
typological information, often combined with distributed language
representations, leads to significantly more powerful models. While linguistic
typology representations from various resources have mostly been used for
conditioning the models, there has been relatively little attention on
predicting features from these resources from the input data. In this paper we
investigate whether the various linguistic features from World Atlas of
Language Structures (WALS) can be reliably inferred from multi-lingual text.
Such a predictor can be used to infer structural features for a language never
observed in training data. We frame this task as a multi-label classification
involving predicting the set of non-mutually exclusive and extremely sparse
multi-valued labels (WALS features). We construct a recurrent neural network
predictor based on byte embeddings and convolutional layers and test its
performance on 556 languages, providing analysis for various linguistic types,
macro-areas, language families and individual features. We show that some
features from various linguistic types can be predicted reliably.Comment: Originally prepared as a conference submission to EMNLP 201