520 research outputs found

    On the evaluation and improvement of arabic wordnet coverage and usability

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-013-9237-0[EN] Built on the basis of the methods developed for Princeton WordNet and EuroWordNet, Arabic WordNet (AWN) has been an interesting project which combines WordNet structure compliance with Arabic particularities. In this paper, some AWN shortcomings related to coverage and usability are addressed. The use of AWN in question/answering (Q/A) helped us to deeply evaluate the resource from an experience-based perspective. Accordingly, an enrichment of AWN was built by semi-automatically extending its content. Indeed, existing approaches and/or resources developed for other languages were adapted and used for AWN. The experiments conducted in Arabic Q/A have shown an improvement of both AWN coverage as well as usability. Concerning coverage, a great amount of named entities extracted from YAGO were connected with corresponding AWN synsets. Also, a significant number of new verbs and nouns (including Broken Plural forms) were added. In terms of usability, thanks to the use of AWN, the performance for the AWN-based Q/A application registered an overall improvement with respect to the following three measures: accuracy (+9.27 % improvement), mean reciprocal rank (+3.6 improvement) and number of answered questions (+12.79 % improvement).The work presented in Sect. 2.2 was done in the framework of the bilateral Spain-Morocco AECID-PCI C/026728/09 research project. The research of the two first authors is done in the framework of the PROGRAMME D'URGENCE project (grant no. 03/2010). The research of the third author is done in the framework of WIQEI IRSES project (grant no. 269180) within the FP 7 Marie Curie People, DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project and VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. We would like to thank Manuel Montes-y-Gomez (INAOE-Puebla, Mexico) and Sandra Garcia-Blasco (Bitsnbrain, Spain) for their feedback on the work presented in Sect. 2.4. We would like finally to thank Violetta Cavalli-Sforza (Al Akhawayn University in Ifrane, Morocco) for having reviewed the linguistic level of the entire document.Abouenour, L.; Bouzoubaa, K.; Rosso, P. (2013). On the evaluation and improvement of arabic wordnet coverage and usability. Language Resources and Evaluation. 47(3):891-917. https://doi.org/10.1007/s10579-013-9237-0S891917473Abbès, R., Dichy, J., & Hassoun, M. (2004). The architecture of a standard Arabic lexical database: Some figures, ratios and categories from the DIINAR.1 source program. In Workshop on computational approaches to Arabic script-based languages, Coling 2004. Geneva, Switzerland.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2009a). Structure-based evaluation of an Arabic semantic query expansion using the JIRS passage retrieval system. In Proceedings of the workshop on computational approaches to Semitic languages, E-ACL-2009, Athens, Greece, March.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2009b). Three-level approach for passage retrieval in Arabic question/answering systems. In Proceedings of the 3rd international conference on Arabic language processing CITALA’09, Rabat, Morocco, May, 2009.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2010a). An evaluated semantic query expansion and structure-based approach for enhancing Arabic question/answering. Special Issue in the International Journal on Information and Communication Technologies/IEEE. June.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2010b). Using the YAGO ontology as a resource for the enrichment of named entities in Arabic WordNet. In Workshop LR & HLT for semitic languages, LREC’10. Malta. May, 2010.Ahonen-Myka, H. (2002). Discovery of frequent word sequences in text. In Proceedings of the ESF exploratory workshop on pattern detection and discovery (pp. 180–189). London, UK: Springer.Al Khalifa, M., & Rodríguez, H. (2009). Automatically extending NE coverage of Arabic WordNet using Wikipedia. In Proceedings of the 3rd international conference on Arabic language processing CITALA’09, May, Rabat, Morocco.Alotaiby, F., Alkharashi, I., & Foda, S. (2009). Processing large Arabic text corpora: Preliminary analysis and results. In Proceedings of the second international conference on Arabic language resources and tools (pp. 78–82), Cairo, Egypt.Baker, C. F., Fillmore, C. J., & Cronin, B. (2003). The structure of the FrameNet database. International Journal of Lexicography, 16(3), 281–296.Baldwin, T., Pool, P., & Colowick, S. M. (2010). PanLex and LEXTRACT: Translating all words of all languages of the world. In Proceedings of Coling 2010, demonstration volume (pp. 37–40), Beijing.Benajiba, Y., Diab, M., & Rosso, P. (2009). Using language independent and language specific features to enhance Arabic named entity recognition. In IEEE transactions on audio, speech and language processing. Special Issue on Processing Morphologically Rich Languages, 17(5), 2009.Benajiba, Y., Rosso, P., & Lyhyaoui, A. (2007). Implementation of the ArabiQA question answering system’s components. In Proceedings of workshop on Arabic natural language processing, 2nd Information Communication Technologies int. symposium, ICTIS-2007, April 3–5, Fez, Morocco.Benoît, S., & Darja, F. (2008). Building a free French WordNet from multilingual resources. Workshop on Ontolex 2008, LREC’08, June, Marrakech, Morocco.Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A., et al. (2006). Introducing the Arabic WordNet project. In Proceedings of the third international WordNet conference. Sojka, Choi: Fellbaum & Vossen (eds).Boudelaa, S., & Gaskell, M. G. (2002). A reexamination of the default system for Arabic plurals. Language and Cognitive Processes, 17, 321–343.Brini, W., Ellouze & M., Hadrich, B. L. (2009a). QASAL: Un système de question-réponse dédié pour les questions factuelles en langue Arabe. In 9th Journées Scientifiques des Jeunes Chercheurs en Génie Electrique et Informatique, Tunisia.Brini, W., Trigui, O., Ellouze, M., Mesfar, S., Hadrich, L., & Rosso, P. (2009b). Factoid and definitional Arabic question answering system. In Post-proceedings of NOOJ-2009, June 8–10, Tozeur, Tunisia.Buscaldi, D., Rosso, P., Gómez, J. M., & Sanchis, E. (2010). Answering questions with an n-gram based passage retrieval engine. Journal of Intelligent Information Systems, 34(2), 113–134.Costa, R. P., & Seco, N. (2008). Hyponymy extraction and Web search behavior analysis based on query reformulation. In Proceedings of the 11th Ibero-American conference on AI: advances in artificial intelligence (pp. 1–10).Denicia-carral, C., Montes-y-Gõmez, M., Villaseñor-pineda, L., & Hernandez, R. G. (2006). A text mining approach for definition question answering. In Proceedings of the 5th international conference on natural language processing, FinTal’2006, Turku, Finland.Diab, M. T. (2004). Feasibility of bootstrapping an Arabic Wordnet leveraging parallel corpora and an English Wordnet. In Proceedings of the Arabic language technologies and resources, NEMLAR, Cairo, Egypt.El Amine, M. A. (2009). Vers une interface pour l’enrichissement des requêtes en arabe dans un système de recherche d’information. In Proceedings of the 2nd conférence internationale sur l’informatique et ses applications (CIIA’09), May 3–4, Saida, Algeria.Elghamry, K. (2008). Using the web in building a corpus-based hypernymy-hyponymy Lexicon with hierarchical structure for Arabic. In Proceedings of the 6th international conference on informatics and systems, INFOS 2008. Cairo, Egypt.Elkateb, S., Black, W., Vossen, P., Farwell, D., Rodríguez, H., Pease, A., et al. (2006). Arabic WordNet and the challenges of Arabic. In Proceedings of Arabic NLP/MT conference, London, UK.Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. MA: MIT Press.García-Blasco, S., Danger, R., & Rosso, P. (2010). Drug–drug interaction detection: A new approach based on maximal frequent sequences. Sociedad Española para el Procesamiento del Lenguaje Natural, SEPLN, 45, 263–266.García-Hernández, R. A. (2007). Algoritmos para el descubrimiento de patrones secuenciales maximales. Ph.D. Thesis, INAOE. September, Mexico.García-Hernández, R. A., Martínez Trinidad, J. F., & Carrasco-ochoa, J. A. (2010). Finding maximal sequential patterns in text document collections and single documents. Informatica, 34(1), 93–101.Goweder, A., & De Roeck, A. (2001). Assessment of a significant Arabic corpus. In Proceedings of the Arabic NLP workshop at ACL/EACL, (pp. 73–79), Toulouse, France.Graff, D. (2007). Arabic Gigaword (3rd ed.). Philadelphia, USA: Linguistic Data Consortium.Graff, D., Kong, J., Chen, K., & Maeda, K. (2007). English Gigaword (3rd ed.). Philadelphia, USA: Linguistic Data Consortium.Hammou, B., Abu-salem, H., Lytinen, S., & Evens, M. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of the workshop on computational approaches to Semitic languages, ACL, (pp. 55–65), Philadelphia.Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, COLING ‘92 (vol. 2, pp. 539–545).Kanaan, G., Hammouri, A., Al-Shalabi, R., & Swalha, M. (2009). A new question answering system for the Arabic language. American Journal of Applied Sciences, 6(4), 797–805.Kim, H., Chen, S., & Veale, T. (2006). Analogical reasoning with a synergy of HowNet and WordNet. In Proceedings of GWC’2006, the 3rd global WordNet conference, January, Cheju, Korea.Kipper-Schuler, K. (2006). VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. Thesis.Mohammed, F. A., Nasser, K., & Harb, H. M. (1993). A knowledge-based Arabic question answering system (AQAS). In ACM SIGART bulletin (pp. 21–33).Niles, I., & Pease, A. (2001). Towards a standard upper ontology. In Proceedings of FOIS-2 (pp. 2–9), Ogunquit, Maine.Niles, I., & Pease, A. (2003). Linking lexicons and ontologies: Mapping WordNet to the suggested upper merged ontology. In Proceedings of the 2003 international conference on information and knowledge engineering, Las Vegas, Nevada.Ortega-Mendoza, R. M., Villaseñor-pineda, L., & Montes-y-Gõmez, M. (2007). Using lexical patterns to extract hyponyms from the Web. In Proceedings of the Mexican international conference on artificial intelligence MICAI 2007. November, Aguascalientes, Mexico. Lecture Notes in Artificial Intelligence 4827. Berlin: Springer.Palmer, M., P. Kingsbury, & D. Gildea. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 21. USA: MIT Press.Pantel, P., & Pennacchiotti, M. (2006). Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of conference on computational linguistics association for computational linguistics, (pp. 113–120), Sydney, Australia.Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., & Martí, A. (2008a). Arabic WordNet: Semi-automatic extensions using Bayesian Inference. In Proceedings of the the 6th conference on language resources and evaluation LREC2008, May, Marrakech, Morocco.Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Mart., M., et al. (2008b). Arabic WordNet: Current state and future extensions. In Proceedings of the fourth global WordNet conference, January 22–25, Szeged, Hungary.Sharaf, A. M. (2009). The Qur’an annotation for text mining. First year transfer report. School of Computing, Leeds University. December.Snow, R., Jurafsky, D., & Andrew, Y. N. (2005). Learning syntactic patterns for automatic hypernym discovery. In Lawrence K. Saul et al. (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press.Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In Proceedings of 16th international World Wide Web conference WWW’2007, (pp. 697–706), May, Banff, Alberta, Canada: ACM Press.Tjong Kim Sang, E., & Hofmann, K. (2007). Automatic extraction of Dutch hypernym–hyponym pairs. In Proceedings of CLIN-2006, Leuven, Belgium.Toral, A., Munoz, R., & Monachini, M. (2008). Named entity WordNet. In Proceedings of the Sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco.Vossen, P. (Ed.). (1998). EuroWordNet, a multilingual database with lexical semantic networks. The Netherlands: Kluwer.Wagner, A. (2005). Learning thematic role relations for lexical semantic nets. Ph.D. Thesis, University of Tübingen, 2005

    Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon

    Get PDF
    This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for different languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which affects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The different steps of the procedure (mapping, disambiguation, extraction, NE identification and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented

    Construction of an ontology for intelligent Arabic QA systems leveraging the Conceptual Graphs representation

    Full text link
    The last decade had known a great interest in Arabic Natural Language Processing (NLP) applications. This interest is due to the prominent importance of this 6th most wide-spread language in the world with more than 350 million native speakers. Currently, some basic Arabic language challenges related to the high inflection and derivation, Part-of-Speech (PoS) tagging, and diacritical ambiguity of Arabic text are practically tamed to a great extent. However, the development of high level and intelligent applications such as Question Answering (QA) systems is still obstructed by the lacks in terms of ontologies and other semantic resources. In this paper, we present the construction of a new Arabic ontology leveraging the contents of Arabic WordNet (AWN) and Arabic VerbNet (AVN). This new resource presents the advantage to combine the high lexical coverage and semantic relations between words existing in AWN together with the formal representation of syntactic and semantic frames corresponding to verbs in AVN. The Conceptual Graphs representation was adopted in the framework of a multi-layer platform dedicated to the development of intelligent and multi-agents systems. The built ontology is used to represent key concepts in questions and documents for further semantic comparison. Experiments conducted in the context of the QA task show a promising coverage with respect to the processed questions and passages. The obtained results also highlight an improvement in the performance of Arabic QA regarding the c@1 measure.The work of the last author was carried out in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP 7 Marie Curie, the DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Abouenour, L.; Nasri, M.; Bouzoubaa, K.; Kabbaj, A.; Rosso, P. (2014). Construction of an ontology for intelligent Arabic QA systems leveraging the Conceptual Graphs representation. Journal of Intelligent and Fuzzy Systems. 27(6):2869-2881. https://doi.org/10.3233/IFS-141248S2869288127

    An analysis of machine translation errors on the effectiveness of an Arabic-English QA system

    Get PDF
    The aim of this paper is to investigate how much the effectiveness of a Question Answering (QA) system was affected by the performance of Machine Translation (MT) based question translation. Nearly 200 questions were selected from TREC QA tracks and ran through a question answering system. It was able to answer 42.6% of the questions correctly in a monolingual run. These questions were then translated manually from English into Arabic and back into English using an MT system, and then re-applied to the QA system. The system was able to answer 10.2% of the translated questions. An analysis of what sort of translation error affected which questions was conducted, concluding that factoid type questions are less prone to translation error than others

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines

    Get PDF
    Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF

    Learning Explicit and Implicit Arabic Discourse Relations.

    Get PDF
    We propose in this paper a supervised learning approach to identify discourse relations in Arabic texts. To our knowledge, this work represents the first attempt to focus on both explicit and implicit relations that link adjacent as well as non adjacent Elementary Discourse Units (EDUs) within the Segmented Discourse Representation Theory (SDRT). We use the Discourse Arabic Treebank corpus (D-ATB) which is composed of newspaper documents extracted from the syntactically annotated Arabic Treebank v3.2 part3 where each document is associated with complete discourse graph according to the cognitive principles of SDRT. Our list of discourse relations is composed of a three-level hierarchy of 24 relations grouped into 4 top-level classes. To automatically learn them, we use state of the art features whose efficiency has been empirically proved. We investigate how each feature contributes to the learning process. We report our experiments on identifying fine-grained discourse relations, mid-level classes and also top-level classes. We compare our approach with three baselines that are based on the most frequent relation, discourse connectives and the features used by Al-Saif and Markert (2011). Our results are very encouraging and outperform all the baselines with an F-score of 78.1% and an accuracy of 80.6%

    Data augmentation for named entity recognition in the German legal domain

    Get PDF
    Named Entity Recognition over texts from the legal domain aims to recognize legal entities such as references to legal norms or court decisions. This task is commonly approached with supervised deep learning techniques that require large amounts of training data. However, especially for low-resource languages and specific domains, such training data is often scarce. In this work, we focus on the German legal domain because it is of interest to the Canarėno project, which deals with information extraction from and analysis of legal norms. The objective of the work presented in this thesis is the implementation, evaluation, and comparison of different data augmentation techniques that can be used to expand the available data and thereby improve model performance. Through experiments on different dataset fractions, we show that Mention Replacement and Synonym Replacement can effectively enhance the performance of both recurrent and transformer-based NER models in low-resource environments.Die Anwendung von Named Entity Recognition auf Texte aus dem juristischen Bereich zielt darauf ab, juristische Entitäten wie Referenzen auf Rechtsnormen oder Gerichtsentscheidungen zu erkennen. Diese Aufgabe wird in der Regel mit überwachten Deep-Learning-Techniken angegangen, die große Mengen an Trainingsdaten erfordern. Vor allem für Sprachen mit geringen Ressourcen und für bestimmte Domänen sind solche Trainingsdaten jedoch oft rar. In dieser Arbeit konzentrieren wir uns auf die deutsche Rechtsdomäne, da sie für das Canarėno-Projekt von Interesse ist, das sich mit der Informationsextraktion aus und Analyse von Rechtsnormen beschäftigt. Das Ziel dieser Arbeit ist die Implementierung, Bewertung und der Vergleich verschiedener Techniken, die zur Erweiterung von verfügbaren Daten und damit zur Verbesserung der Modellleistung eingesetzt werden können. Durch Experimente mit verschiedenen Datensatzanteilen zeigen wir, dass Mention Replacement und Synonym Replacement die Leistung von sowohl rekurrenten als auch von transformatorischen NERModellen in ressourcenarmen Umgebungen effektiv verbessern können
    corecore