Search CORE

100 research outputs found

Diacritization as a Translation Problem and as a Sequence Labeling Problem

Author: Nguyen T
Schlippe Tim
Vogel S.
Publication venue: AMTA
Publication date: 01/01/2008
Field of study

Arabic diacritization using weighted finite-state transducers

Author: Nelken Rani
Shieber Stuart
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

Arabic is usually written without short vowels and additional diacritics, which are nevertheless important for several applications. We present a novel algorithm for restoring these symbols, using a cascade of probabilistic finite- state transducers trained on the Arabic treebank, integrating a word-based language model, a letter-based language model, and an extremely simple morphological model. This combination of probabilistic methods and simple linguistic information yields high levels of accuracy.Engineering and Applied Science

CiteSeerX

Crossref

Harvard University - DASH

Combining Speech with textual methods for arabic diacritization

Author: AISHA SIDDIQA AZIM
Publication venue
Publication date: 20/01/2012
Field of study

Master'sMASTER OF SCIENC

ScholarBank@NUS

Al-Farahidi Arabic Diacrizer System

Author: Iyad Ahmad Mahmoud Abusamra
إياد أحمد محمود أبوسمرة
Publication venue: جامعة القدس
Publication date
Field of study

Al-Quds University Digital Repository

Spell-checking in Spanish: the case of diacritic accents

Author: Atserias Batalla Jordi
Fuentes Fort Maria
Nazar Rogelio
Renau Irene
Publication venue
Publication date: 01/01/2012
Field of study

This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker’s dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo ‘continuous’ and continuó ‘he/she/it continued’, or when different diacritics make other word distinctions, as in continúo ‘I continue’. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.Peer ReviewedPostprint (author’s final draft

UPCommons. Portal del coneixement obert de la UPC

Memory-based vocalization of Arabic

Author: Kübler Sandra
Mohamed Emad
Publication venue
Publication date: 01/01/2008
Field of study

The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly

Hochschulschriftenserver - Universität Frankfurt am Main

Normalization of Dutch user-generated content

Author: De Clercq Orphée
Desmet Bart
Hoste Veronique
Lefever Els
Schulz Sarah
Publication venue: INCOMA
Publication date: 01/01/2013
Field of study

Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly developed guidelines. For the automatic normalization task we focus on text messages, and find that a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction. After these initial experiments, we investigate the system's robustness on the complete domain of UGC by testing it on the other two social media genres, and find that the cascaded approach performs best on these genres as well. To our knowledge, we deliver the first proof-of-concept system for Dutch UGC normalization, which can serve as a baseline for future work

CiteSeerX

Ghent University Academic Bibliography

Archivsystem Ask23

Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging

Author: Habash Nizar
Zalmout Nasser
Publication venue
Publication date: 05/10/2019
Field of study

Semitic languages can be highly ambiguous, having several interpretations of the same surface forms, and morphologically rich, having many morphemes that realize several morphological features. This is further exacerbated for dialectal content, which is more prone to noise and lacks a standard orthography. The morphological features can be lexicalized, like lemmas and diacritized forms, or non-lexicalized, like gender, number, and part-of-speech tags, among others. Joint modeling of the lexicalized and non-lexicalized features can identify more intricate morphological patterns, which provide better context modeling, and further disambiguate ambiguous lexical choices. However, the different modeling granularity can make joint modeling more difficult. Our approach models the different features jointly, whether lexicalized (on the character-level), where we also model surface form normalization, or non-lexicalized (on the word-level). We use Arabic as a test case, and achieve state-of-the-art results for Modern Standard Arabic, with 20% relative error reduction, and Egyptian Arabic (a dialectal variant of Arabic), with 11% reduction

arXiv.org e-Print Archive

Crossref