100 research outputs found
Arabic diacritization using weighted finite-state transducers
Arabic is usually written without short vowels and additional diacritics, which are nevertheless important for several applications. We present a novel algorithm for restoring these symbols, using a cascade of probabilistic finite- state transducers trained on the Arabic treebank, integrating a word-based language model, a letter-based language model, and an extremely simple morphological model. This combination of probabilistic methods and simple linguistic information yields high levels of accuracy.Engineering and Applied Science
Combining Speech with textual methods for arabic diacritization
Master'sMASTER OF SCIENC
Spell-checking in Spanish: the case of diacritic accents
This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker’s dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo
‘continuous’ and continuó ‘he/she/it continued’, or when different diacritics make other word distinctions, as in continúo ‘I continue’. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible
applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.Peer ReviewedPostprint (author’s final draft
Memory-based vocalization of Arabic
The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly
Normalization of Dutch user-generated content
Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly developed guidelines. For the automatic normalization task we focus on text messages, and find that a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction. After these initial experiments, we investigate the system's robustness on the complete domain of UGC by testing it on the other two social media genres, and find that the cascaded approach performs best on these genres as well. To our knowledge, we deliver the first proof-of-concept system for Dutch UGC normalization, which can serve as a baseline for future work
Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging
Semitic languages can be highly ambiguous, having several interpretations of
the same surface forms, and morphologically rich, having many morphemes that
realize several morphological features. This is further exacerbated for
dialectal content, which is more prone to noise and lacks a standard
orthography. The morphological features can be lexicalized, like lemmas and
diacritized forms, or non-lexicalized, like gender, number, and part-of-speech
tags, among others. Joint modeling of the lexicalized and non-lexicalized
features can identify more intricate morphological patterns, which provide
better context modeling, and further disambiguate ambiguous lexical choices.
However, the different modeling granularity can make joint modeling more
difficult. Our approach models the different features jointly, whether
lexicalized (on the character-level), where we also model surface form
normalization, or non-lexicalized (on the word-level). We use Arabic as a test
case, and achieve state-of-the-art results for Modern Standard Arabic, with 20%
relative error reduction, and Egyptian Arabic (a dialectal variant of Arabic),
with 11% reduction
- …