research

Linguistically informed and corpus informed morphological analysis of Arabic

Abstract

Standard English PoS-taggers generally involve tag-assignment (via dictionary-lookup etc) followed by tag-disambiguation (via a context model, e.g. PoS-ngrams or Brill transformations). We want to PoS-tag our Arabic Corpus, but evaluation of existing PoS-taggers has highlighted shortcomings; in particular, about a quarter of all word tokens are not assigned a fully correct morphological analysis. Tag-assignment is significantly more complex for Arabic. An Arabic lemmatiser program can extract the stem or root, but this is not enough for full PoS-tagging; words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and postclitics. The morphological analyser should then add the appropriate linguistic information to each of these parts of the word; in effect, instead of a tag for a word, we need a subtag for each part (and possibly multiple subtags if there are multiple proclitics, prefixes, suffixes and postclitics). Many challenges face the implementation of Arabic morphology, the rich “root-and-pattern” nonconcatenative (or nonlinear) morphology and the highly complex word formation process of root and patterns, especially if one or two long vowels are part of the root letters. Moreover, the orthographic issues of Arabic such as short vowels ( َ ُ ِ ), Hamzah (ء أ إ ؤ ئ), Taa’ Marboutah ( ة ) and Ha’ ( ه ), Ya’ ( ي ) and Alif Maksorah( ى ) , Shaddah ( ّ ) or gemination, and Maddah ( آ ) or extension which is a compound letter of Hamzah and Alif ( أا ). Our morphological analyzer uses linguistic knowledge of the language as well as corpora to verify the linguistic information. To understand the problem, we started by analyzing fifteen established Arabic language dictionaries, to build a broad-coverage lexicon which contains not only roots and single words but also multi-word expressions, idioms, collocations requiring special part-of-speech assignment, and words with special part-of-speech tags. The next stage of research was a detailed analysis and classification of Arabic language roots to address the “tail” of hard cases for existing morphological analyzers, and analysis of the roots, word-root combinations and the coverage of each root category of the Qur’an and the word-root information stored in our lexicon. From authoritative Arabic grammar books, we extracted and generated comprehensive lists of affixes, clitics and patterns. These lists were then cross-checked by analyzing words of three corpora: the Qur’an, the Corpus of Contemporary Arabic and Penn Arabic Treebank (as well as our Lexicon, considered as a fourth cross-check corpus). We also developed a novel algorithm that generates the correct pattern of the words, which deals with the orthographic issues of the Arabic language and other word derivation issues, such as the elimination or substitution of root letters

    Similar works