6 research outputs found

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    Extraction d'information à partir d'un texte arabe : extraction des entités nommées et leurs relations sémantiques

    No full text
    In this thesis we address the issue of knowledge discovery within Arabic text. This task was achieved by detecting and recognizing the semantic relations between named entities. The issue of locating and extracting the named entities as well as the semantic relations binding them is solved by using a rule-based approach where we convert the expert rules to finite state transducers.The lack of linguistic resources and tools needed for Arabic NLP has pushed us to build our own resources and to adapt the Unitex/GramLab tools to achieve tasks mentioned above. The resources are also built, then compressed and stored using the finite state transducers.Dans cette thèse on aborde le sujet d’extraction des connaissances à partir du texte arabe. Cette tache a été réalisée à travers la détection et l’extraction des relations sémantiques entre les entités nommées. La problématique de repérage et d’extraction des entités nommées ainsi que les relations sémantiques les reliant a été résolue en utilisant une approche à base de règles, où les règles de l’expert sont traduites sous formes de transducteurs à états finis.Le manque terrible des ressources linguistiques et d’outils nécessaires au TAL arabe nous a conduit à construire nos propres ressources et à l’adaptation des outils de la plateforme Unitex/GramLab afin d’accomplir les taches citées ci-dessus. Les ressources sont aussi construites et puis compressées et stockées en utilisant les transducteurs à états finis

    Using finite-state transducers to build lexical resources for Unitex Arabic package

    No full text
    International audienceThis paper addresses the issue of generating Arabic verbal inflectional paradigms using the FSA. In the process of the proposed approach the tokens drawn from the corpus are manually lemmatized and then finite state transducers are applied to the lemmas for producing all possible word forms with their full morphological features. The first strength of the approach lie in the algorithm of automatic generation of 184 transitions transducers, which is very cumbersome, if manually, designed. The second strength is the new classification of Arabic verbs; this classification is based on our new suggested inflection scheme that specifies the verb inflection paradigms. All resulting resources are publicly available and currently used as an open package in the Unitex framework under the LGPL license

    Conception d'un jeu de ressources libres pour le TAL arabe sous Unitex

    No full text
    International audienceThis paper aims to describe the process of building a free Arabic package for the Unitex framework: we proposed a test corpus, we chose a tag set suited to this task and we build dictionaries respecting the LADL DELA format. We describe each of the above particularly the building of dictionaries, for which we designed algorithms for automatic generation of verb and noun inflection graphs. We use the word-based inflection foundations and we define for each lexeme a set of themes. For the verbs, we use five themes given by the user and the graphs generate up to 264 inflected verbal forms; for the nouns and adjectives we use one or at most two themes and the produced graphs generate 63 inflected forms