14 research outputs found

    IDRAAQ: New Arabic Question Answering system based on Query Expansion and Passage Retrieval

    Full text link
    Arabic is one of the languages which are less concerned by researchers in the field of Question Answering. The paper presents core modules of a new Arabic Question Answering system called IDRAAQ. These modules aim at enhancing the quality of retrieved passages with respect to a given question. Experiments have been conducted in the framework of the main task of QA4MRE@CLEF 2012 that includes this year the Arabic language. Two runs were submitted. Both runs only use reading test documents to answer questions. The difference between the two runs exists in the answer validation process which is more relaxed in the second run. The Passage Retrieval (PR) module of our system presents multi-levels of processing in order to improve the quality of returned passage and thereafter the performances of the whole system. The PR module of IDRAAQ is based on keyword-based and structure-based levels that respectively consist in: (i) a Query Expansion (QE) process relying on Arabic WordNet semantic relations; (ii) a Distance Density N-gram Model based passage retrieval system. The latter level uses passages retrieved on the basis of QE queries and re-ranks them according to a structure-based similarity score. Named Entities are recognized by means of a mapping between the YAGO ontology and Arabic WordNet. The experiments that we conducted show that with respect to the accuracy and c@1 measure, IDRAAQ registered encouraging performances in particular with factoid questions. The same experiments allowed us to identify the lacks of the system especially when processing non factoid questions and at the Answer Validation stage. The IDRAAQ system, which is still under construction, will integrate a Conceptual Graph-based passage re-ranking introducing a semantic level to its PR module.Abouenour, L.; Bouzoubaa, K.; Rosso, P. (2012). IDRAAQ: New Arabic Question Answering system based on Query Expansion and Passage Retrieval. CELCT. http://hdl.handle.net/10251/46316

    On the evaluation and improvement of arabic wordnet coverage and usability

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-013-9237-0[EN] Built on the basis of the methods developed for Princeton WordNet and EuroWordNet, Arabic WordNet (AWN) has been an interesting project which combines WordNet structure compliance with Arabic particularities. In this paper, some AWN shortcomings related to coverage and usability are addressed. The use of AWN in question/answering (Q/A) helped us to deeply evaluate the resource from an experience-based perspective. Accordingly, an enrichment of AWN was built by semi-automatically extending its content. Indeed, existing approaches and/or resources developed for other languages were adapted and used for AWN. The experiments conducted in Arabic Q/A have shown an improvement of both AWN coverage as well as usability. Concerning coverage, a great amount of named entities extracted from YAGO were connected with corresponding AWN synsets. Also, a significant number of new verbs and nouns (including Broken Plural forms) were added. In terms of usability, thanks to the use of AWN, the performance for the AWN-based Q/A application registered an overall improvement with respect to the following three measures: accuracy (+9.27 % improvement), mean reciprocal rank (+3.6 improvement) and number of answered questions (+12.79 % improvement).The work presented in Sect. 2.2 was done in the framework of the bilateral Spain-Morocco AECID-PCI C/026728/09 research project. The research of the two first authors is done in the framework of the PROGRAMME D'URGENCE project (grant no. 03/2010). The research of the third author is done in the framework of WIQEI IRSES project (grant no. 269180) within the FP 7 Marie Curie People, DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project and VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. We would like to thank Manuel Montes-y-Gomez (INAOE-Puebla, Mexico) and Sandra Garcia-Blasco (Bitsnbrain, Spain) for their feedback on the work presented in Sect. 2.4. We would like finally to thank Violetta Cavalli-Sforza (Al Akhawayn University in Ifrane, Morocco) for having reviewed the linguistic level of the entire document.Abouenour, L.; Bouzoubaa, K.; Rosso, P. (2013). On the evaluation and improvement of arabic wordnet coverage and usability. Language Resources and Evaluation. 47(3):891-917. https://doi.org/10.1007/s10579-013-9237-0S891917473Abbès, R., Dichy, J., & Hassoun, M. (2004). The architecture of a standard Arabic lexical database: Some figures, ratios and categories from the DIINAR.1 source program. In Workshop on computational approaches to Arabic script-based languages, Coling 2004. Geneva, Switzerland.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2009a). Structure-based evaluation of an Arabic semantic query expansion using the JIRS passage retrieval system. In Proceedings of the workshop on computational approaches to Semitic languages, E-ACL-2009, Athens, Greece, March.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2009b). Three-level approach for passage retrieval in Arabic question/answering systems. In Proceedings of the 3rd international conference on Arabic language processing CITALA’09, Rabat, Morocco, May, 2009.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2010a). An evaluated semantic query expansion and structure-based approach for enhancing Arabic question/answering. Special Issue in the International Journal on Information and Communication Technologies/IEEE. June.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2010b). Using the YAGO ontology as a resource for the enrichment of named entities in Arabic WordNet. In Workshop LR & HLT for semitic languages, LREC’10. Malta. May, 2010.Ahonen-Myka, H. (2002). Discovery of frequent word sequences in text. In Proceedings of the ESF exploratory workshop on pattern detection and discovery (pp. 180–189). London, UK: Springer.Al Khalifa, M., & Rodríguez, H. (2009). Automatically extending NE coverage of Arabic WordNet using Wikipedia. In Proceedings of the 3rd international conference on Arabic language processing CITALA’09, May, Rabat, Morocco.Alotaiby, F., Alkharashi, I., & Foda, S. (2009). Processing large Arabic text corpora: Preliminary analysis and results. In Proceedings of the second international conference on Arabic language resources and tools (pp. 78–82), Cairo, Egypt.Baker, C. F., Fillmore, C. J., & Cronin, B. (2003). The structure of the FrameNet database. International Journal of Lexicography, 16(3), 281–296.Baldwin, T., Pool, P., & Colowick, S. M. (2010). PanLex and LEXTRACT: Translating all words of all languages of the world. In Proceedings of Coling 2010, demonstration volume (pp. 37–40), Beijing.Benajiba, Y., Diab, M., & Rosso, P. (2009). Using language independent and language specific features to enhance Arabic named entity recognition. In IEEE transactions on audio, speech and language processing. Special Issue on Processing Morphologically Rich Languages, 17(5), 2009.Benajiba, Y., Rosso, P., & Lyhyaoui, A. (2007). Implementation of the ArabiQA question answering system’s components. In Proceedings of workshop on Arabic natural language processing, 2nd Information Communication Technologies int. symposium, ICTIS-2007, April 3–5, Fez, Morocco.Benoît, S., & Darja, F. (2008). Building a free French WordNet from multilingual resources. Workshop on Ontolex 2008, LREC’08, June, Marrakech, Morocco.Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A., et al. (2006). Introducing the Arabic WordNet project. In Proceedings of the third international WordNet conference. Sojka, Choi: Fellbaum & Vossen (eds).Boudelaa, S., & Gaskell, M. G. (2002). A reexamination of the default system for Arabic plurals. Language and Cognitive Processes, 17, 321–343.Brini, W., Ellouze & M., Hadrich, B. L. (2009a). QASAL: Un système de question-réponse dédié pour les questions factuelles en langue Arabe. In 9th Journées Scientifiques des Jeunes Chercheurs en Génie Electrique et Informatique, Tunisia.Brini, W., Trigui, O., Ellouze, M., Mesfar, S., Hadrich, L., & Rosso, P. (2009b). Factoid and definitional Arabic question answering system. In Post-proceedings of NOOJ-2009, June 8–10, Tozeur, Tunisia.Buscaldi, D., Rosso, P., Gómez, J. M., & Sanchis, E. (2010). Answering questions with an n-gram based passage retrieval engine. Journal of Intelligent Information Systems, 34(2), 113–134.Costa, R. P., & Seco, N. (2008). Hyponymy extraction and Web search behavior analysis based on query reformulation. In Proceedings of the 11th Ibero-American conference on AI: advances in artificial intelligence (pp. 1–10).Denicia-carral, C., Montes-y-Gõmez, M., Villaseñor-pineda, L., & Hernandez, R. G. (2006). A text mining approach for definition question answering. In Proceedings of the 5th international conference on natural language processing, FinTal’2006, Turku, Finland.Diab, M. T. (2004). Feasibility of bootstrapping an Arabic Wordnet leveraging parallel corpora and an English Wordnet. In Proceedings of the Arabic language technologies and resources, NEMLAR, Cairo, Egypt.El Amine, M. A. (2009). Vers une interface pour l’enrichissement des requêtes en arabe dans un système de recherche d’information. In Proceedings of the 2nd conférence internationale sur l’informatique et ses applications (CIIA’09), May 3–4, Saida, Algeria.Elghamry, K. (2008). Using the web in building a corpus-based hypernymy-hyponymy Lexicon with hierarchical structure for Arabic. In Proceedings of the 6th international conference on informatics and systems, INFOS 2008. Cairo, Egypt.Elkateb, S., Black, W., Vossen, P., Farwell, D., Rodríguez, H., Pease, A., et al. (2006). Arabic WordNet and the challenges of Arabic. In Proceedings of Arabic NLP/MT conference, London, UK.Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. MA: MIT Press.García-Blasco, S., Danger, R., & Rosso, P. (2010). Drug–drug interaction detection: A new approach based on maximal frequent sequences. Sociedad Española para el Procesamiento del Lenguaje Natural, SEPLN, 45, 263–266.García-Hernández, R. A. (2007). Algoritmos para el descubrimiento de patrones secuenciales maximales. Ph.D. Thesis, INAOE. September, Mexico.García-Hernández, R. A., Martínez Trinidad, J. F., & Carrasco-ochoa, J. A. (2010). Finding maximal sequential patterns in text document collections and single documents. Informatica, 34(1), 93–101.Goweder, A., & De Roeck, A. (2001). Assessment of a significant Arabic corpus. In Proceedings of the Arabic NLP workshop at ACL/EACL, (pp. 73–79), Toulouse, France.Graff, D. (2007). Arabic Gigaword (3rd ed.). Philadelphia, USA: Linguistic Data Consortium.Graff, D., Kong, J., Chen, K., & Maeda, K. (2007). English Gigaword (3rd ed.). Philadelphia, USA: Linguistic Data Consortium.Hammou, B., Abu-salem, H., Lytinen, S., & Evens, M. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of the workshop on computational approaches to Semitic languages, ACL, (pp. 55–65), Philadelphia.Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, COLING ‘92 (vol. 2, pp. 539–545).Kanaan, G., Hammouri, A., Al-Shalabi, R., & Swalha, M. (2009). A new question answering system for the Arabic language. American Journal of Applied Sciences, 6(4), 797–805.Kim, H., Chen, S., & Veale, T. (2006). Analogical reasoning with a synergy of HowNet and WordNet. In Proceedings of GWC’2006, the 3rd global WordNet conference, January, Cheju, Korea.Kipper-Schuler, K. (2006). VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. Thesis.Mohammed, F. A., Nasser, K., & Harb, H. M. (1993). A knowledge-based Arabic question answering system (AQAS). In ACM SIGART bulletin (pp. 21–33).Niles, I., & Pease, A. (2001). Towards a standard upper ontology. In Proceedings of FOIS-2 (pp. 2–9), Ogunquit, Maine.Niles, I., & Pease, A. (2003). Linking lexicons and ontologies: Mapping WordNet to the suggested upper merged ontology. In Proceedings of the 2003 international conference on information and knowledge engineering, Las Vegas, Nevada.Ortega-Mendoza, R. M., Villaseñor-pineda, L., & Montes-y-Gõmez, M. (2007). Using lexical patterns to extract hyponyms from the Web. In Proceedings of the Mexican international conference on artificial intelligence MICAI 2007. November, Aguascalientes, Mexico. Lecture Notes in Artificial Intelligence 4827. Berlin: Springer.Palmer, M., P. Kingsbury, & D. Gildea. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 21. USA: MIT Press.Pantel, P., & Pennacchiotti, M. (2006). Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of conference on computational linguistics association for computational linguistics, (pp. 113–120), Sydney, Australia.Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., & Martí, A. (2008a). Arabic WordNet: Semi-automatic extensions using Bayesian Inference. In Proceedings of the the 6th conference on language resources and evaluation LREC2008, May, Marrakech, Morocco.Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Mart., M., et al. (2008b). Arabic WordNet: Current state and future extensions. In Proceedings of the fourth global WordNet conference, January 22–25, Szeged, Hungary.Sharaf, A. M. (2009). The Qur’an annotation for text mining. First year transfer report. School of Computing, Leeds University. December.Snow, R., Jurafsky, D., & Andrew, Y. N. (2005). Learning syntactic patterns for automatic hypernym discovery. In Lawrence K. Saul et al. (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press.Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In Proceedings of 16th international World Wide Web conference WWW’2007, (pp. 697–706), May, Banff, Alberta, Canada: ACM Press.Tjong Kim Sang, E., & Hofmann, K. (2007). Automatic extraction of Dutch hypernym–hyponym pairs. In Proceedings of CLIN-2006, Leuven, Belgium.Toral, A., Munoz, R., & Monachini, M. (2008). Named entity WordNet. In Proceedings of the Sixth international conference on language resources and evaluation (LREC’08), Marrakech, Morocco.Vossen, P. (Ed.). (1998). EuroWordNet, a multilingual database with lexical semantic networks. The Netherlands: Kluwer.Wagner, A. (2005). Learning thematic role relations for lexical semantic nets. Ph.D. Thesis, University of Tübingen, 2005

    Research in Linguistic Engineering: Resources and Tools

    Get PDF
    In this paper we are revisiting some of the resources and tools developed by the members of the Intelligent Systems Research Group (GSI) at UPM as well as from the Information Retrieval and Natural Language Processing Research Group (IR&NLP) at UNED. Details about developed resources (corpus, software) and current interests and projects are given for the two groups. It is also included a brief summary and links into open source resources and tools developed by other groups of the MAVIR consortium

    Construction of an ontology for intelligent Arabic QA systems leveraging the Conceptual Graphs representation

    Full text link
    The last decade had known a great interest in Arabic Natural Language Processing (NLP) applications. This interest is due to the prominent importance of this 6th most wide-spread language in the world with more than 350 million native speakers. Currently, some basic Arabic language challenges related to the high inflection and derivation, Part-of-Speech (PoS) tagging, and diacritical ambiguity of Arabic text are practically tamed to a great extent. However, the development of high level and intelligent applications such as Question Answering (QA) systems is still obstructed by the lacks in terms of ontologies and other semantic resources. In this paper, we present the construction of a new Arabic ontology leveraging the contents of Arabic WordNet (AWN) and Arabic VerbNet (AVN). This new resource presents the advantage to combine the high lexical coverage and semantic relations between words existing in AWN together with the formal representation of syntactic and semantic frames corresponding to verbs in AVN. The Conceptual Graphs representation was adopted in the framework of a multi-layer platform dedicated to the development of intelligent and multi-agents systems. The built ontology is used to represent key concepts in questions and documents for further semantic comparison. Experiments conducted in the context of the QA task show a promising coverage with respect to the processed questions and passages. The obtained results also highlight an improvement in the performance of Arabic QA regarding the c@1 measure.The work of the last author was carried out in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP 7 Marie Curie, the DIANA APPLICATIONS - Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Abouenour, L.; Nasri, M.; Bouzoubaa, K.; Kabbaj, A.; Rosso, P. (2014). Construction of an ontology for intelligent Arabic QA systems leveraging the Conceptual Graphs representation. Journal of Intelligent and Fuzzy Systems. 27(6):2869-2881. https://doi.org/10.3233/IFS-141248S2869288127

    Knowledge-based and data-driven approaches for geographical information access

    Get PDF
    Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents. Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms. The main contributions of this thesis to the state-of-the-art of GeoIA tasks are: 1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG). 2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks. 3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated. 4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and Rodríguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and Rodríguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments. 5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and Rodríguez, 2006). 6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and Rodríguez, 2014) and posterior experiments (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b).L'Accés a la Informació Geogràfica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anàlisi automàtic i la interpretació dels termes i restriccions geogràfiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogràfica (RIG), Cerca de la Resposta Geogràfica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogràfiques (com polígons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semàntic i els termes i les restriccions geogràfiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogràfic i temàtic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurístics basats en Coneixement Geogràfic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades específics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogràfic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogràfica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogràfic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguí resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and Rodríguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and Rodríguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadística en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'àmbit geogràfic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografía espanyola (Ferrés and Rodríguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and Rodríguez, 2014) i en experiments posteriors (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b).Postprint (published version

    Knowledge-based and data-driven approaches for geographical information access

    Get PDF
    Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents. Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms. The main contributions of this thesis to the state-of-the-art of GeoIA tasks are: 1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG). 2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks. 3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated. 4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and Rodríguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and Rodríguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments. 5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and Rodríguez, 2006). 6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and Rodríguez, 2014) and posterior experiments (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b).L'Accés a la Informació Geogràfica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anàlisi automàtic i la interpretació dels termes i restriccions geogràfiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogràfica (RIG), Cerca de la Resposta Geogràfica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogràfiques (com polígons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semàntic i els termes i les restriccions geogràfiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogràfic i temàtic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurístics basats en Coneixement Geogràfic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades específics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogràfic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogràfica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogràfic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguí resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and Rodríguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and Rodríguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadística en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'àmbit geogràfic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografía espanyola (Ferrés and Rodríguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and Rodríguez, 2014) i en experiments posteriors (Ferrés and Rodríguez, 2011; Ferrés and Rodríguez, 2015b)

    Arabic named entity recognition

    Full text link
    En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar las mejores tecnicas para construir un Reconocedor de Entidades Nombradas en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades nombradas que se encuentran en un texto arabe de dominio abierto. La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos que investigan la tarea de REN para un idioma especifico o desde una perspectiva independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy pocos trabajos que estudien dicha tarea para el arabe. El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una discusion sobre los resultados que benefician a la comunidad de investigadores del REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos: 1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha tarea; 2. Analizado el estado del arte del REN; 3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes tecnicas de aprendizaje automatico; 4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores, donde cada clasificador trata con una sola clase de entidades nombradas y emplea el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas adecuados para la clase de entidades nombradas en cuestion. Nuestros experimentos han sido evaluados sobre nueve conjuntos de test.Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318Palanci

    A Useful Framework for Identification and Analysis of Different Query Expansion Approaches based on the Candidate Expansion Terms Extraction Methods

    Get PDF
    Query expansion is a method for improving retrieval performance by supplementing an original query with additional terms. This process improves the quality of search engine results and helps users to find the required information. In the recent years, different methods have been proposed in this area. In addition to such a variety of different approaches in this area and necessity of the study of their characteristics, the lack of a comprehensive classification based on candidate expansion terms extraction methods and also suitable and complete criteria to evaluate them, make the precise study, comparison and evaluation of methods for query expansion and choosing appropriate method based on need difficult for researchers. Therefore, in this paper a new useful framework is presented. In the proposed framework, in addition to the identification of three basic approaches based on the candidate expansion terms extraction methods for query expansion and expressing their properties, appropriate criteria for qualitative evaluation of these methods will be described. Next, the proposed approaches will be evaluated qualitatively based on these criteria. Using the systematic and structured framework proposed in this paper leads a useful platform for researchers to be provided for the comparative study of existing methods in the field, investigating their features specially their drawbacks to improve them and choosing appropriate method based on their needs

    Contribution à l’amélioration de la recherche d’information par utilisation des méthodes sémantiques: application à la langue arabe

    Get PDF
    Un système de recherche d’information est un ensemble de programmes et de modules qui sert à interfacer avec l’utilisateur, pour prendre et interpréter une requête, faire la recherche dans l’index et retourner un classement des documents sélectionnés à cet utilisateur. Cependant le plus grand challenge de ce système est qu’il doit faire face au grand volume d’informations multi modales et multilingues disponibles via les bases documentaires ou le web pour trouver celles qui correspondent au mieux aux besoins des utilisateurs. A travers ce travail, nous avons présenté deux contributions. Dans la première nous avons proposé une nouvelle approche pour la reformulation des requêtes dans le contexte de la recherche d’information en arabe. Le principe est donc de représenter la requête par un arbre sémantique pondéré pour mieux identifier le besoin d'information de l'utilisateur, dont les nœuds représentent les concepts (synsets) reliés par des relations sémantiques. La construction de cet arbre est réalisée par la méthode de la Pseudo-Réinjection de la Pertinence combinée à la ressource sémantique du WordNet Arabe. Les résultats expérimentaux montrent une bonne amélioration dans les performances du système de recherche d’information. Dans la deuxième contribution, nous avons aussi proposé une nouvelle approche pour la construction d’une collection de test de recherche d’information arabe. L'approche repose sur la combinaison de la méthode de la stratégie de Pooling utilisant les moteurs de recherches et l’algorithme Naïve-Bayes de classification par l’apprentissage automatique. Pour l’expérimentation nous avons créé une nouvelle collection de test composée d’une base documentaire de 632 documents et de 165 requêtes avec leurs jugements de pertinence sous plusieurs topics. L’expérimentation a également montré l’efficacité du classificateur Bayésien pour la récupération de pertinences des documents, encore plus, il a réalisé des bonnes performances après l’enrichissement sémantique de la base documentaire par le modèle word2vec
    corecore