10 research outputs found

    Normalized Google Distance for Collocation Extraction from Islamic Domain

    Get PDF
    This study investigates the properties of Arabic collocations, and classifies them according to their structural patterns on Islamic domain. Based on linguistic information, the patterns and the variation of the collocations have been identified.  Then, a system that extracts the collocations from Islamic domain based on statistical measures has been described. In candidate ranking, the normalized Google distance has been adapted to measure the associations between the words in the candidates set. Finally, the n-best evaluation that selects n-best lists for each association measure has been used to annotate all candidates in these lists manually. The following association measures (log-likelihood ratio, t-score, mutual information, and enhanced mutual information) have been utilized in the candidate ranking step to compare these measures with the normalized Google distance in Arabic collocation extraction. In the experiment of this work, the normalized Google distance achieved the highest precision value 93% compared with other association measures. In fact, this strengthens our motivation to utilize the normalized Google distance to measure the relatedness between the constituent words of the collocations instead of using the frequency-based association measures as in the state-of-the-art methods. Keywords: normalized Google distance, collocation extraction, Islamic domai

    An empirical study of Arabic formulaic sequence extraction methods

    Get PDF
    This paper aims to implement what is referred to as the collocation of the Arabic keywords approach for extracting formulaic sequences (FSs) in the form of high frequency but semantically regular formulas that are not restricted to any syntactic construction or semantic domain. The study applies several distributional semantic models in order to automatically extract relevant FSs related to Arabic keywords. The data sets used in this experiment are rendered from a new developed corpus-based Arabic wordlist consisting of 5,189 lexical items which represent a variety of modern standard Arabic (MSA) genres and regions, the new wordlist being based on an overlapping frequency based on a comprehensive comparison of four large Arabic corpora with a total size of over 8 billion running words. Empirical n-best precision evaluation methods are used to determine the best association measures (AMs) for extracting high frequency and meaningful FSs. The gold standard reference FSs list was developed in previous studies and manually evaluated against well-established quantitative and qualitative criteria. The results demonstrate that the MI.log_f AM achieved the highest results in extracting significant FSs from the large MSA corpus, while the T-score association measure achieved the worst results

    A hybrid approach for arabic semantic relation extraction

    Get PDF
    Information retrieval applications are essential tools to manage the huge amount of information in the Web. Ontologies have great importance in these applications. The idea here is that several data belonging to a domain of interest are represented and related semantically in the ontology, which can help to navigate, manage and reuse these data. Despite of the growing need of ontology, only few works were interested in Arabic language. Indeed, arabic texts are highly ambiguous, especially when diacritics are absent. Besides, existent works does not cover all the types of se-mantic relations, which are useful to structure Arabic ontol-ogies. A lot of work has been done on cooccurrence- based techniques, which lead to over-generation. In this paper, we propose a new approach for Arabic se-mantic relation extraction. We use vocalized texts to reduce ambiguities and propose a new distributional approach for similarity calculus, which is compared to cooccurrence. We discuss our contribution through experimental results and propose some perspectives for future research

    An empirical study on the Holy Quran based on a large classical Arabic corpus

    Get PDF
    Distributional semantics is one of the empirical approaches to natural language processing and acquisition, which is mainly concerned by modeling word meaning using words distribution statistics gathered from huge corpora. Many distributional semantic models are available in the literature, but none of them have been applied so far to the Quran nor to Classical Arabic in general. This paper reports the construction of a very large corpus of Classical Arabic that will be used as a base to study distributional lexical semantics of the Quran and Classical Arabic. It also reports the results of two empirical studies; the first is applying a number of probabilistic distributional semantic models to automatically identify lexical collocations in the Quran and the other is applying those same models on the Classical Arabic corpus in an attempt to test their ability of capturing lexical collocations and co occurrences for a number of the corpus words. Results show that the MI.log_freq association measure achieved the highest results in extracting significant co-occurrences and collocations from small and large Classical Arabic corpora, while mutual information association measure achieved the worst results

    Organizing Contextual Knowledge for Arabic Text Disambiguation and Terminology Extraction.

    Get PDF
    Ontologies have an important role in knowledge organization and information retrieval. Domain ontologies are composed of concepts represented by domain relevant terms. Existing approaches of ontology construction make use of statistical and linguistic information to extract domain relevant terms. The quality and the quantity of this information influence the accuracy of terminologyextraction approaches and other steps in knowledge extraction and information retrieval. This paper proposes an approach forhandling domain relevant terms from Arabic non-diacriticised semi-structured corpora. In input, the structure of documentsis exploited to organize knowledge in a contextual graph, which is exploitedto extract relevant terms. This network contains simple and compound nouns handled by a morphosyntactic shallow parser. The noun phrases are evaluated in terms of termhood and unithood by means of possibilistic measures. We apply a qualitative approach, which weighs terms according to their positions in the structure of the document. In output, the extracted knowledge is organized as network modeling dependencies between terms, which can be exploited to infer semantic relations.We test our approach on three specific domain corpora. The goal of this evaluation is to check if our model for organizing and exploiting contextual knowledge will improve the accuracy of extraction of simple and compound nouns. We also investigate the role of compound nouns in improving information retrieval results

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    Ontology Learning from the Arabic Text of the Qur’an: Concepts Identification and Hierarchical Relationships Extraction

    Get PDF
    Recent developments in ontology learning have highlighted the growing role ontologies play in linguistic and computational research areas such as language teaching and natural language processing. The ever-growing availability of annotations for the Qur’an text has made the acquisition of the ontological knowledge promising. However, the availability of resources and tools for Arabic ontology is not comparable with other languages. Manual ontology development is labour-intensive, time-consuming and it requires knowledge and skills of domain experts. This thesis aims to develop new methods for Ontology learning from the Arabic text of the Qur’an, including concepts identification and hierarchical relationships extraction. The thesis presents a methodology for reducing human intervention in building ontology from Classical Arabic Language of the Qur’an text. The set of concepts, which is a crucial step in ontology learning, was generated based on a set of patterns made of lexical and inflectional information. The concepts were identified based on adapted weighting schema that exploit a combination of knowledge to learn the relevance degree of a term. Statistical, domain-specific knowledge and internal information of Multi-Word Terms (MWTs) were combined to learn the relevance of generated terms. This methodology which represents the major contribution of the thesis was experimentally investigated using different terms generation methods. As a result, we provided the Arabic Qur’anic Terms (AQT) as a training resource for machine learning based term extraction. This thesis also introduces a new approach for hierarchical relations extraction from Arabic text of the Qur’an. A set of hierarchical relations occurring between identified concepts are extracted based on hybrid methods including head-modifier, set of markers for copula construct in Arabic text, referents. We also compared a number of ontology alignment methods for matching ontological bilingual Qur’anic resources. In addition, a multi-dimensional resource named Arabic Qur’anic Database (AQD) about the Qur’an is made for Arabic computational researchers, allowing regular expression query search over the included annotations. The search tool was successfully applied to find instances for a given complex rule made of different combined resources

    Du terme prédicatif au cadre sémantique : méthodologie de compilation d'une ressource terminologique pour les termes arabes de l'informatique

    Get PDF
    La description des termes dans les ressources terminologiques traditionnelles se limite Ă  certaines informations, comme le terme (principalement nominal), sa dĂ©finition et son Ă©quivalent dans une langue Ă©trangĂšre. Cette description donne rarement d’autres informations qui peuvent ĂȘtre trĂšs utiles pour l’utilisateur, surtout s’il consulte les ressources dans le but d’approfondir ses connaissances dans un domaine de spĂ©cialitĂ©, maitriser la rĂ©daction professionnelle ou trouver des contextes oĂč le terme recherchĂ© est rĂ©alisĂ©. Les informations pouvant ĂȘtre utiles dans ce sens comprennent la description de la structure actancielle des termes, des contextes provenant de sources authentiques et l’inclusion d’autres parties du discours comme les verbes. Les verbes et les noms dĂ©verbaux, ou les unitĂ©s terminologiques prĂ©dicatives (UTP), souvent ignorĂ©s par la terminologie classique, revĂȘtent une grande importance lorsqu’il s’agit d’exprimer une action, un processus ou un Ă©vĂšnement. Or, la description de ces unitĂ©s nĂ©cessite un modĂšle de description terminologique qui rend compte de leurs particularitĂ©s. Un certain nombre de terminologues (Condamines 1993, Mathieu-Colas 2002, Gross et Mathieu-Colas 2001 et L’Homme 2012, 2015) ont d’ailleurs proposĂ© des modĂšles de description basĂ©s sur diffĂ©rents cadres thĂ©oriques. Notre recherche consiste Ă  proposer une mĂ©thodologie de description terminologique des UTP de la langue arabe, notamment l’arabe standard moderne (ASM), selon la thĂ©orie de la SĂ©mantique des cadres (Frame Semantics) de Fillmore (1976, 1977, 1982, 1985) et son application, le projet FrameNet (Ruppenhofer et al. 2010). Le domaine de spĂ©cialitĂ© qui nous intĂ©resse est l’informatique. Dans notre recherche, nous nous appuyons sur un corpus recueilli du web et nous nous inspirons d’une ressource terminologique existante, le DiCoInfo (L’Homme 2008), pour compiler notre propre ressource. Nos objectifs se rĂ©sument comme suit. PremiĂšrement, nous souhaitons jeter les premiĂšres bases d’une version en ASM de cette ressource. Cette version a ses propres particularitĂ©s : 1) nous visons des unitĂ©s bien spĂ©cifiques, Ă  savoir les UTP verbales et dĂ©verbales; 2) la mĂ©thodologie dĂ©veloppĂ©e pour la compilation du DiCoInfo original devra ĂȘtre adaptĂ©e pour prendre en compte une langue sĂ©mitique. Par la suite, nous souhaitons crĂ©er une version en cadres de cette ressource, oĂč nous regroupons les UTP dans des cadres sĂ©mantiques, en nous inspirant du modĂšle de FrameNet. À cette ressource, nous ajoutons les UTP anglaises et françaises, puisque cette partie du travail a une portĂ©e multilingue. La mĂ©thodologie consiste Ă  extraire automatiquement les unitĂ©s terminologiques verbales et nominales (UTV et UTN), comme Ham~ala (Ű­Ù…Ù„) (tĂ©lĂ©charger) et taHmiyl (ŰȘŰ­Ù…ÙŠÙ„) (tĂ©lĂ©chargement). Pour ce faire, nous avons adaptĂ© un extracteur automatique existant, TermoStat (Drouin 2004). Ensuite, Ă  l’aide des critĂšres de validation terminologique (L’Homme 2004), nous validons le statut terminologique d’une partie des candidats. AprĂšs la validation, nous procĂ©dons Ă  la crĂ©ation de fiches terminologiques, Ă  l’aide d’un Ă©diteur XML, pour chaque UTV et UTN retenue. Ces fiches comprennent certains Ă©lĂ©ments comme la structure actancielle des UTP et jusqu’à vingt contextes annotĂ©s. La derniĂšre Ă©tape consiste Ă  crĂ©er des cadres sĂ©mantiques Ă  partir des UTP de l’ASM. Nous associons Ă©galement des UTP anglaises et françaises en fonction des cadres crĂ©Ă©s. Cette association a menĂ© Ă  la crĂ©ation d’une ressource terminologique appelĂ©e « DiCoInfo : A Framed Version ». Dans cette ressource, les UTP qui partagent les mĂȘmes propriĂ©tĂ©s sĂ©mantiques et structures actancielles sont regroupĂ©es dans des cadres sĂ©mantiques. Par exemple, le cadre sĂ©mantique Product_development regroupe des UTP comme Taw~ara (Ű·ÙˆŰ±) (dĂ©velopper), to develop et dĂ©velopper. À la suite de ces Ă©tapes, nous avons obtenu un total de 106 UTP ASM compilĂ©es dans la version en ASM du DiCoInfo et 57 cadres sĂ©mantiques associĂ©s Ă  ces unitĂ©s dans la version en cadres du DiCoInfo. Notre recherche montre que l’ASM peut ĂȘtre dĂ©crite avec la mĂ©thodologie que nous avons mise au point.The description of terms in traditional terminological resources is limited to certain details, such as the term (which is usually a noun), its definition, and its equivalent. This description seldom takes into account other details, which can be of high importance for the users, especially if they consult resources to enhance their knowledge of the domain, to improve professional writing, or to find contexts where the term is realized. The information that might be useful includes the description of the actantial structure of the terms, contexts from authentic resources and the inclusion of other parts of speech such as verbs. Verbs and deverbal nouns, or predicative terminological units (PTUs), which are often ignored by traditional terminology, are of great importance especially for expressing actions, processes or events. But the description of these units requires a model of terminological description that takes into account their special features. Some terminologists (Condamines 1993, Mathieu-Colas 2002, Gross et Mathieu-Colas 2001 et L’Homme 2012, 2015) proposed description models based on different theoretical frameworks. Our research consists of proposing a methodology of terminological description of PTUs of the Arabic language, in particular Modern Standard Arabic (MSA), according to the theory of Frame Semantics of Fillmore (1976, 1977, 1982, 1985) and its application, the FrameNet project (Ruppenhofer et al. 2010). The specialized domain in which we are interested is computing. In our research, we compiled a corpus that we collected from online material and we based our method on an existing online terminological resource called the DiCoInfo (L’Homme 2008) in our pursuit to compile our own. Our objectives are the following. First, we will lay the foundations of an MSA version of the aforementioned resource. This version has its own features: 1) we target specific units, namely verbal and deverbal PTUs; 2) the developed methodology for the compilation of the original DiCoInfo should be adapted to take into account a Semitic language. Afterwards, we will create a framed version of this resource. In this version, we organize the PTUs in semantic frames according to the model of FrameNet. Since this frame version has a multilingual dimension, we add English and French PTUs to the resource. Our methodology consists of automatically extracting the verbal and nominal terminological units (VTUs and NTUs) such as Ham~ala (Ű­Ù…Ù„) (download). To do this, we integrated the MSA to an existing automatic extractor, TermoStat (Drouin 2004). Then, with the help of terminological validation criteria, we validate the terminological status of the candidates. After the validation, we create terminological files with an XML editor for each VTU and NTU. These files contain elements, such as the actantial structure of the PTUs and up to 20 annotated contexts. The last step consists of creating semantic frames from the MSA PTUs. We also associate English and French PTUs to the created frames. This association resulted in the creation of a second terminological resource called “DiCoInfo: A Framed Version”. In this resource, the PTUs that share the same semantic features and actantial structures are organized in semantic frames. For example, the semantic frame Product_development groups PTUs such as Taw~ara (Ű·ÙˆŰ±) (develop), to develop and dĂ©velopper. As a result of our methodology, we obtained a total of 106 PTUs in MSA compiled in the MSA version of DiCoInfo and 57 semantic frames associated to these units in the framed version. Our research shows that the MSA can be described using the methodology that we set up
    corecore