74 research outputs found
Efficient Convolutional Neural Networks for Diacritic Restoration
Diacritic restoration has gained importance with the growing need for
machines to understand written texts. The task is typically modeled as a
sequence labeling problem and currently Bidirectional Long Short Term Memory
(BiLSTM) models provide state-of-the-art results. Recently, Bai et al. (2018)
show the advantages of Temporal Convolutional Neural Networks (TCN) over
Recurrent Neural Networks (RNN) for sequence modeling in terms of performance
and computational resources. As diacritic restoration benefits from both
previous as well as subsequent timesteps, we further apply and evaluate a
variant of TCN, Acausal TCN (A-TCN), which incorporates context from both
directions (previous and future) rather than strictly incorporating previous
context as in the case of TCN. A-TCN yields significant improvement over TCN
for diacritization in three different languages: Arabic, Yoruba, and
Vietnamese. Furthermore, A-TCN and BiLSTM have comparable performance, making
A-TCN an efficient alternative over BiLSTM since convolutions can be trained in
parallel. A-TCN is significantly faster than BiLSTM at inference time
(270%-334% improvement in the amount of text diacritized per minute).Comment: accepted in EMNLP 201
Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation
In this work, we present several deep learning models for the automatic
diacritization of Arabic text. Our models are built using two main approaches,
viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN),
with several enhancements such as 100-hot encoding, embeddings, Conditional
Random Field (CRF) and Block-Normalized Gradient (BNG). The models are tested
on the only freely available benchmark dataset and the results show that our
models are either better or on par with other models, which require
language-dependent post-processing steps, unlike ours. Moreover, we show that
diacritics in Arabic can be used to enhance the models of NLP tasks such as
Machine Translation (MT) by proposing the Translation over Diacritization (ToD)
approach.Comment: 18 pages, 17 figures, 14 table
Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text
Automatic Arabic diacritization is useful in many applications, ranging from
reading support for language learners to accurate pronunciation predictor for
downstream tasks like speech synthesis. While most of the previous works
focused on models that operate on raw non-diacritized text, production systems
can gain accuracy by first letting humans partly annotate ambiguous words. In
this paper, we propose 2SDiac, a multi-source model that can effectively
support optional diacritics in input to inform all predictions. We also
introduce Guided Learning, a training scheme to leverage given diacritics in
input with different levels of random masking. We show that the provided hints
during test affect more output positions than those annotated. Moreover,
experiments on two common benchmarks show that our approach i) greatly
outperforms the baseline also when evaluated on non-diacritized text; and ii)
achieves state-of-the-art results while reducing the parameter count by over
60%.Comment: Arabic text diacritization, partially-diacritized text, Arabic
natural language processin
Exploiting Arabic Diacritization for High Quality Automatic Annotation
International audienceWe present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined
Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences
AbstractThe diacritical marks of Arabic language are characters other than letters and are in the majority of cases absent from Arab writings. This paper presents a hybrid system for automatic diacritization of Arabic sentences combining linguistic rules and statistical treatments. The used approach is based on four stages. The first phase consists of a morphological analysis using the second version of the morphological analyzer Alkhalil Morpho Sys. Morphosyntactic outputs from this step are used in the second phase to eliminate invalid word transitions according to the syntactic rules. Then, the system used in the third stage is a discrete hidden Markov model and Viterbi algorithm to determine the most probable diacritized sentence. The unseen transitions in the training corpus are processed using smoothing techniques. Finally, the last step deals with words not analyzed by Alkhalil analyzer, for which we use statistical treatments based on the letters. The word error rate of our system is around 2.58% if we ignore the diacritic of the last letter of the word and around 6.28% when this diacritic is taken into account
- …