589 research outputs found

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    APMorph: finite-state transducer for Amazigh pronominal morphology

    Get PDF
    Our work aims to present an amazigh pronominal morphological analyzer (APMorph) based on xerox’s finite-state transducer (XFST). Our system revolves around a large lexicon named “APlex” including the affixed pronoun to the noun and to the verb and the characteristics relating to each lemma. A set of rules are added to define the inflectional behavior and morphosyntactic links of each entry as well as the relationship between the different lexical units. The implementation and the evaluation of our approach will be detailed within this article. The use of XFST remains a relevant choice in the sense that this platform allows both analysis and generation. The robustness of our system makes it able to be integrated in other applications of natural language processing (NLP) especially spellchecking, machine translation, and machine learning. This paper presents a continuation of our previous works on the automatic processing of Amazigh nouns and verbs

    Pattern-and-root inflectional morphology: the Arabic broken plural

    Get PDF
    International audienceنقدم نموذجًا مفصّلاً لتوصيف جموع التكسير مرتكزاً على أولوية الوزن على الجذر. ويستخلص النموذج صيغة جمع التكسير مستنداً على أحرف المفرد وصيغته. وقد تمّ تنفيذه وإختباره في ترميز 3200 مدخل معجمي. وقد أولينا اهتماماً خاصًا بإدارة الموارد اللغوية والمعاجم من اجل تسهيل عملية التوصيف لتصبح أكثر ملائمةً للخبراء في اللغّة العربيّة.يستند النموذج على المفاهيم التقليدية للوزن والجذر. وبالمقارنة مع الصرف التقليدي، فإنه يُبعد الصرف الإشتقاقي من هذا التوصيف. كما في القواميس العربية التقليدية، ويتمحور القاموس على مداخله المعجمية القابلة للتحديث، وهي إملائياً مشكولة كلياً. في نموذجنا، يتعرّف نظام التحليل الصرفي آلياً على جموع التكسير في النص مباشرةً معتمداً على قاموس أشكال مصرّفة بالكامل ودون قواعد مورفوفنولوجية أو إملائية. يعتمد تصنيف صيغ جموع التكسير مبادئ سهلة، منتظمة ومفصّلة. تم تبسيط ترميز أوزان المفرد للصوائت القصيرة (v) والطويلة (vv) دون تحديد نوعها كضمة أو فتحة، أو كسرة. تم ترميز التبدّلات المورفوفنولوجية للجذر والتغيرات الإملائية لأحرف العلة والهمزة بشكل مستقل عن صيغة الوزن و بشكل مباشر، أي دون ردّ الجذر إلى أصله ودون قواعد مورفوفونولوجية.تم تصنيف صيغ جموع التكسير تَراتُبِياً وفقاً: لوزن الجمع، فوزن المفرد، فأحرف العلّة. تقتصر صيَغ جموع التكسير الرباعية على 3 أوزان متفرعة إلى 70 صنفاً، وصِيَغ التكسير الثلاثية علـى 22 وزناً متفرعة إلى 90 صنفاً. هذه الأصناف الـ 160، تصبح 300 عندما نأخذ في الاعتبار التغيرات الإعرابية والإملائية في صيغ المفرد.We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns
    corecore