2 research outputs found

    Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

    Get PDF
    Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging

    Extraction of Arabic word roots: An Approach Based on Computational Model and Multi-Backpropagation Neural Networks

    Get PDF
    Stemming is a process of extracting the root of a given word, by stripping off the affixes attached to this word. Many attempts have been made to address the stemming of Arabic words problem. The majority of the existing Arabic stemming algorithms require a complete set of morphological rules and large vocabulary lookup tables. Furthermore, many of them give more than one potential stem or root for a given Arabic word. According to Ahmad [11], the Arabic stemming process based on the language morphological rules is still a very difficult task due to the nature of the language itself. The limitations of the current Arabic stemming methods have motivated this research in which we investigate a novel approach to extract the word roots of Arabic language named here as MUAIDI-STEMMER 2. This approach attempts to exploit numerical relations between Arabic letters, avoiding having a list of the root and pattern of each word in the language, and giving one root solution. This approach is composed of two phases. Phase I depends on a basic calculations extracted from linguistic analysis of Arabic patterns and affixes. Phase II is based on artificial neural network trained by backpropagation learning rule. In this proposed phase, we formulate the root extraction problem as a classification problem and the neural network as a classifier tool. This study demonstrates that a neural network can be effectively used to ex- tract the word roots of Arabic language The stemmer developed is tested using 46,895 Arabic word types3. Error counting accuracy evaluation was employed to evaluate the performance of the stemmer. It was successful in producing the stems of 44,107 Arabic words from the given test datasets with accuracy of 94.81%. 2.Muaidi is the author father's name. 3.Types mean distinct or unique words
    corecore