Search CORE

97 research outputs found

Memory-based vocalization of Arabic

Author: Kübler Sandra
Mohamed Emad
Publication venue
Publication date: 01/01/2008
Field of study

The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly

Hochschulschriftenserver - Universität Frankfurt am Main

Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences

Author: Chennoufi Amine
Mazroui Azzeddine
Publication venue: The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
Publication date: 01/04/2017
Field of study

AbstractThe diacritical marks of Arabic language are characters other than letters and are in the majority of cases absent from Arab writings. This paper presents a hybrid system for automatic diacritization of Arabic sentences combining linguistic rules and statistical treatments. The used approach is based on four stages. The first phase consists of a morphological analysis using the second version of the morphological analyzer Alkhalil Morpho Sys. Morphosyntactic outputs from this step are used in the second phase to eliminate invalid word transitions according to the syntactic rules. Then, the system used in the third stage is a discrete hidden Markov model and Viterbi algorithm to determine the most probable diacritized sentence. The unseen transitions in the training corpus are processed using smoothing techniques. Finally, the last step deals with words not analyzed by Alkhalil analyzer, for which we use statistical treatments based on the letters. The word error rate of our system is around 2.58% if we ignore the diacritic of the last letter of the word and around 6.28% when this diacritic is taken into account

Elsevier - Publisher Connector

Directory of Open Access Journals

Arabic diacritization using weighted finite-state transducers

Author: Nelken Rani
Shieber Stuart
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

Arabic is usually written without short vowels and additional diacritics, which are nevertheless important for several applications. We present a novel algorithm for restoring these symbols, using a cascade of probabilistic finite- state transducers trained on the Arabic treebank, integrating a word-based language model, a letter-based language model, and an extremely simple morphological model. This combination of probabilistic methods and simple linguistic information yields high levels of accuracy.Engineering and Applied Science

CiteSeerX

Crossref

Harvard University - DASH

An automatically built named entity lexicon for Arabic

Author: Attia Mohammed
Monachini Monica
Toral Antonio
Tounsi Lamia
van Genabith Josef
Publication venue: European Language Resources Association
Publication date: 01/01/2010
Field of study

We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold

CiteSeerX

Irish Universities

DCU Online Research Access Service

Diacritization as a Translation Problem and as a Sequence Labeling Problem

Author: Nguyen T
Schlippe Tim
Vogel S.
Publication venue: AMTA
Publication date: 01/01/2008
Field of study

KITopen

Diacritization as a machine translation problem and as a sequence labeling problem

Author: Nguyen ThuyLinh
Schlippe Tim
Vogel Stephan
Publication venue: Association for Machine Translation in the Americas (AMTA)
Publication date: 03/01/2024
Field of study

In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates

KITopen