50 research outputs found
Recognition and translation Arabic-French of Named Entities: case of the Sport places
The recognition of Arabic Named Entities (NE) is a problem in different
domains of Natural Language Processing (NLP) like automatic translation.
Indeed, NE translation allows the access to multilingual in-formation. This
translation doesn't always lead to expected result especially when NE contains
a person name. For this reason and in order to ameliorate translation, we can
transliterate some part of NE. In this context, we propose a method that
integrates translation and transliteration together. We used the linguis-tic
NooJ platform that is based on local grammars and transducers. In this paper,
we focus on sport domain. We will firstly suggest a refinement of the
typological model presented at the MUC Conferences we will describe the
integration of an Arabic transliteration module into translation system.
Finally, we will detail our method and give the results of the evaluation
Multilingual Extraction of functional relations between Arabic Named Entities using NooJ platform
10 pagesInternational audienceThe extraction of relation between Named Entities (NE) has become the last few years an interesting research domain. It is very useful for many applications such as Web mining, Information extraction and retrieval, Business intelligence, Automatic databases filing with Entities & types, Questions answering task and document Summarization. Several works has been performed for relation discovery in texts written in Latin languages and as far as we know, very few works has been done for Arabic language. In this paper, we focus on functional relations between ENAMEX and ORG Arabic Named Entities. The extraction approach is rule based and the implementation is performed using NooJ Platform
AmAMorph: Finite State Morphological Analyzer for Amazighe
This paper presents AmAMorph, a morphological analyzer for Amazighe language using a system based on the NooJ linguistic development environment. The paper begins with the development of Amazighe lexicons with large coverage formalization. The built electronic lexicons, named âNAmLexâ, âVAmLexâ and âPAmLexâ which stand for âNoun Amazighe Lexiconâ, âVerb Amazighe Lexiconâ and âParticles Amazighe Lexiconâ, link inflectional, morphological, and syntacticsemantic information to the list of lemmas. Automated inflectional and derivational routines are applied to each lemma producing over inflected forms. To our knowledge,AmAMorph is the first morphological analyzer for Amazighe. It identifies the component morphemes of the forms using large coverage morphological grammars. Along with the description of how the analyzer is implemented, this paper gives an evaluation of the analyzer
A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs
International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithmâs strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license
Discourse analysis of arabic documents and application to automatic summarization
Dans un discours, les textes et les conversations ne sont pas seulement une juxtaposition de mots et de phrases. Ils sont plutĂŽt organisĂ©s en une structure dans laquelle des unitĂ©s de discours sont liĂ©es les unes aux autres de maniĂšre Ă assurer Ă la fois la cohĂ©rence et la cohĂ©sion du discours. La structure du discours a montrĂ© son utilitĂ© dans de nombreuses applications TALN, y compris la traduction automatique, la gĂ©nĂ©ration de texte et le rĂ©sumĂ© automatique. L'utilitĂ© du discours dans les applications TALN dĂ©pend principalement de la disponibilitĂ© d'un analyseur de discours performant. Pour aider Ă construire ces analyseurs et Ă amĂ©liorer leurs performances, plusieurs ressources ont Ă©tĂ© annotĂ©es manuellement par des informations de discours dans des diffĂ©rents cadres thĂ©oriques. La plupart des ressources disponibles sont en anglais. RĂ©cemment, plusieurs efforts ont Ă©tĂ© entrepris pour dĂ©velopper des ressources discursives pour d'autres langues telles que le chinois, l'allemand, le turc, l'espagnol et le hindi. NĂ©anmoins, l'analyse de discours en arabe standard moderne (MSA) a reçu moins d'attention malgrĂ© le fait que MSA est une langue de plus de 422 millions de locuteurs dans 22 pays. Le sujet de thĂšse s'intĂšgre dans le cadre du traitement automatique de la langue arabe, plus particuliĂšrement, l'analyse de discours de textes arabes. Cette thĂšse a pour but d'Ă©tudier l'apport de l'analyse sĂ©mantique et discursive pour la gĂ©nĂ©ration de rĂ©sumĂ© automatique de documents en langue arabe. Pour atteindre cet objectif, nous proposons d'Ă©tudier la thĂ©orie de la reprĂ©sentation discursive segmentĂ©e (SDRT) qui propose un cadre logique pour la reprĂ©sentation sĂ©mantique de phrases ainsi qu'une reprĂ©sentation graphique de la structure du texte oĂč les relations de discours sont de nature sĂ©mantique plutĂŽt qu'intentionnelle. Cette thĂ©orie a Ă©tĂ© Ă©tudiĂ©e pour l'anglais, le français et l'allemand mais jamais pour la langue arabe. Notre objectif est alors d'adapter la SDRT Ă la spĂ©cificitĂ© de la langue arabe afin d'analyser sĂ©mantiquement un texte pour gĂ©nĂ©rer un rĂ©sumĂ© automatique. Nos principales contributions sont les suivantes : Une Ă©tude de la faisabilitĂ© de la construction d'une structure de discours rĂ©cursive et complĂšte de textes arabes. En particulier, nous proposons : Un schĂ©ma d'annotation qui couvre la totalitĂ© d'un texte arabe, dans lequel chaque constituant est liĂ© Ă d'autres constituants. Un document est alors reprĂ©sentĂ© par un graphe acyclique orientĂ© qui capture les relations explicites et les relations implicites ainsi que des phĂ©nomĂšnes de discours complexes, tels que l'attachement, la longue distance du discours pop-ups et les dĂ©pendances croisĂ©es. Une nouvelle hiĂ©rarchie des relations de discours. Nous Ă©tudions les relations rhĂ©toriques d'un point de vue sĂ©mantique en se concentrant sur leurs effets sĂ©mantiques et non pas sur la façon dont elles sont dĂ©clenchĂ©es par des connecteurs de discours, qui sont souvent ambigĂŒes en arabe.
o une analyse quantitative (en termes de connecteurs de discours, de frĂ©quences de relations, de proportion de relations implicites, etc.) et une analyse qualitative (accord inter-annotateurs et analyse des erreurs) de la campagne d'annotation. Un outil d'analyse de discours oĂč nous Ă©tudions Ă la fois la segmentation automatique de textes arabes en unitĂ©s de discours minimales et l'identification automatique des relations explicites et implicites du discours. L'utilisation de notre outil pour rĂ©sumer des textes arabes. Nous comparons la reprĂ©sentation de discours en graphes et en arbres pour la production de rĂ©sumĂ©s.Within a discourse, texts and conversations are not just a juxtaposition of words and sentences. They are rather organized in a structure in which discourse units are related to each other so as to ensure both discourse coherence and cohesion. Discourse structure has shown to be useful in many NLP applications including machine translation, natural language generation and language technology in general. The usefulness of discourse in NLP applications mainly depends on the availability of powerful discourse parsers. To build such parsers and improve their performances, several resources have been manually annotated with discourse information within different theoretical frameworks. Most available resources are in English. Recently, several efforts have been undertaken to develop manually annotated discourse information for other languages such as Chinese, German, Turkish, Spanish and Hindi. Surprisingly, discourse processing in Modern Standard Arabic (MSA) has received less attention despite the fact that MSA is a language with more than 422 million speakers in 22 countries. Computational processing of Arabic language has received a great attention in the literature for over twenty years. Several resources and tools have been built to deal with Arabic non concatenative morphology and Arabic syntax going from shallow to deep parsing. However, the field is still very vacant at the layer of discourse. As far as we know, the sole effort towards Arabic discourse processing was done in the Leeds Arabic Discourse Treebank that extends the Penn Discourse TreeBank model to MSA. In this thesis, we propose to go beyond the annotation of explicit relations that link adjacent units, by completely specifying the semantic scope of each discourse relation, making transparent an interpretation of the text that takes into account the semantic effects of discourse relations. In particular, we propose the first effort towards a semantically driven approach of Arabic texts following the Segmented Discourse Representation Theory (SDRT). Our main contributions are: A study of the feasibility of building a recursive and complete discourse structures of Arabic texts. In particular, we propose: An annotation scheme for the full discourse coverage of Arabic texts, in which each constituent is linked to other constituents. A document is then represented by an oriented acyclic graph, which captures explicit and implicit relations as well as complex discourse phenomena, such as long-distance attachments, long-distance discourse pop-ups and crossed dependencies. A novel discourse relation hierarchy. We study the rhetorical relations from a semantic point of view by focusing on their effect on meaning and not on how they are lexically triggered by discourse connectives that are often ambiguous, especially in Arabic. A thorough quantitative analysis (in terms of discourse connectives, relation frequencies, proportion of implicit relations, etc.) and qualitative analysis (inter-annotator agreements and error analysis) of the annotation campaign. An automatic discourse parser where we investigate both automatic segmentation of Arabic texts into elementary discourse units and automatic identification of explicit and implicit Arabic discourse relations. An application of our discourse parser to Arabic text summarization. We compare tree-based vs. graph-based discourse representations for producing indicative summaries and show that the full discourse coverage of a document is definitively a plus
Recommended from our members
A hybrid NLP & semantic knowledgebase approach for the intelligent exploration of Arabic documents
In the contemporary era, a colossal amount of information is published daily on the Web in the form of articles, documents, reviews, blogs and social media posts. As most of this data is available in the form of unstructured documents, it makes it challenging and timeconsuming to extract non-trivial, previously unknown, and potentially useful knowledge from the published documents. Hence, extracting useful knowledge from unstructured text, i.e., Information Extraction, is becoming an increasingly significant aspect of knowledge discovery.
This work focuses on Information Extraction form Arabic unstructured text, which is an especially challenging task as Arabic is a highly inflectional and derivational language. The problem is compounded by the lack of mature tools and advanced research in Arabic Natural Language Processing (NLP) in comparison to European languages for instance.
The principal objective of this research work is presenting a comprehensive methodology for integrating domain knowledge with Natural Language Processing techniques that were proven effective in solving most classification problems in order to improve the Information extraction process form online unstructured data. The importance of NLP tools lies in that they play a key role in allowing semantic concept tagging of unstructured text, and so realize the Semantic Web. This work presents a novel rule-based approach that uses linguistic grammar-based techniques to extract Arabic composite names from Arabic text. Our approach uniquely exploits the genitive Arabic grammar rules; in particular, the rules regarding the identification of definite nouns (Ù
Űč۱ÙŰ©) and indefinite nouns (ÙÙ۱۩) to support the process of extracting composite names. Furthermore, this approach does not place any constraints on the length of the Arabic composite name. The results of our experiments show that there are improvement in recognizing Arabic composite names entity in the Arabic language text.
Our research also contributes a novel, knowledge-based approach to relation extraction from unstructured Arabic text, which is based on the principles of Functional Discourse Grammar (FDG). We further improve the approach by integrating it with Machine Learning relation classification, resulting in a hybrid relation extraction algorithm that can handle especially complex Arabic sentence structures. The accuracy of our relation classification efforts was extensively evaluated by means of experimental evaluation that evidenced the accuracy of the FDG relation extraction approach and the improvement gained by the Machine Learning integration.
The essential NLP algorithms of entity recognition and relation extraction were deployed in a Semantic Knowledge-base that was built from the outset to model the knowledge of the problem domain. The semantic modelling of the knowledgebase aided improving the accuracy of the NLP algorithms by leveraging relevant domain knowledge published in Open Linked Datasets. Moreover, the extracted information was semantically tagged and inserted into the Semantic Knowledge-base, which facilitated building advanced rules to infer new interesting information from the extracted knowledge as well as utilising advanced query mechanisms for intelligently exploring the mined problem domain knowledge
Arabic nested noun compound extraction based on linguistic features and statistical measures
The extraction of Arabic nested noun compound is significant for several research areas such
as sentiment analysis, text summarization, word categorization, grammar checker, and
machine translation. Much research has studied the extraction of Arabic noun compound
using linguistic approaches, statistical methods, or a hybrid of both. A wide range of the
existing approaches concentrate on the extraction of the bi-gram or tri-gram noun compound.
Nonetheless, extracting a 4-gram or 5-gram nested noun compound is a challenging task due
to the morphological, orthographic, syntactic and semantic variations. Many features have an
important effect on the efficiency of extracting a noun compound such as unit-hood,
contextual information, and term-hood. Hence, there is a need to improve the effectiveness of
the Arabic nested noun compound extraction. Thus, this paper proposes a hybrid linguistic
approach and a statistical method with a view to enhance the extraction of the Arabic nested
noun compound. A number of pre-processing phases are presented, including transformation,
tokenization, and normalisation. The linguistic approaches that have been used in this study
consist of a part-of-speech tagging and the named entities pattern, whereas the proposed
statistical methods that have been used in this study consist of the NC-value, NTC-value,
NLC-value, and the combination of these association measures. The proposed methods have
demonstrated that the combined association measures have outperformed the NLC-value,
NTC-value, and NC-value in terms of nested noun compound extraction by achieving 90%,
88%, 87%, and 81% for bigram, trigram, 4-gram, and 5-gram, respectively
Reconnaissance automatique des entités nommées arabes et leur traduction vers le français
The translation of named entities (NEs) is a current research topic with regard to the proliferation of electronic documents exchanged through the Internet. So, the need to process these documents with NLP tools becomes necessary and interesting. Formal or semi-formal modeling of these NEs may intervene in both processes of recognition and translation. Indeed, it makes the accumulation of linguistic resources more reliable, limits the impact of linguistic specificities and facilitates the transformation from one representation to another. In this context, we propose a tool for the recognition and translation of Arabic NEs into French, based primarily on formal .representation and a set of transducers. This tool takes into account the integration of a module of transliteration. Its implementation was performed using the NooJ platform and the results obtained proved to be satisfactoryLa traduction des Entités Nommées (EN) est un axe de recherche d'actualité vu la multitude des documents électroniques échangés à travers Internet. Ainsi, le besoin de traiter ces documents par des outils de TALN est devenu nécessaire et intéressant. La modélisation formelle ou semi formelle de ces EN peut intervenir dans les processus de reconnaissance et de traduction. En effet, elle permet de rendre plus fiable la constitution des ressources linquistiques, de limiter l'impact des spécificités linguistiques ct de faciliter les transformations d'une représentation à une autre. Dans ce contexte, nous proposons un outil de reconnaissance ct de traduction vers le français des EN arabes basé essentiellement sur une représentation formelle et sur un ensemble de transducteurs. L'outil prend en compte l'intégration d'un module de translittération. L'implémentation de cet outil a été effectuée en utilisant la plateforme NooJ. Les résultats obtenus sont satisfaisant