1,680 research outputs found
ANNOTATION MODEL FOR LOANWORDS IN INDONESIAN CORPUS: A LOCAL GRAMMAR FRAMEWORK
There is a considerable number for loanwords in Indonesian language as it has been,
or even continuously, in contact with other languages. The contact takes place via different
media; one of them is via machine readable medium. As the information in different languages
can be obtained by a mouse click these days, the contact becomes more and more intense. This
paper aims at proposing an annotation model and lexical resource for loanwords in
Indonesian. The lexical resource is applied to a corpus by a corpus processing software called
UNITEX. This software works under local grammar framewor
An automatically built named entity lexicon for Arabic
We have successfully adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWNâs instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the Lexical Markup Framework (LMF) ISO standard. We conduct a quantitative and qualitative evaluation of the lexicon against a manually annotated gold standard and achieve precision scores from
95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold
Semantic and Contextual Knowledge Representation for Lexical Disambiguation: Case of Arabic-French Query Translation
We present in this paper, an automatic query translation system in cross-language information retrieval (Arabic-French). For the lexical disambiguation, our system combines between two resources: a bilingual dictionary and a parallel corpus. To select the best translation, our method is based on a correspondence measure between two semantic networks. The first one represents the senses of ambiguous terms of the query. The second one is a semantic network contextually enriched, representing the collection of sentences responding to the query. This collection forms the knowledge base of our disambiguation method and it is obtained by alignment with the relevant sentences in Arabic. The evaluation of the proposed system shows the advantage of the contextual enrichment on the quality of the translation. We obtained a high precision, relatively proportional to the precision provided by the used alignment. Finally, our translation demonstrates its potential by comparing its Bleu score with that of Google translate.</p
On the evaluation and improvement of arabic wordnet coverage and usability
The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-013-9237-0[EN] Built on the basis of the methods developed for Princeton WordNet and EuroWordNet, Arabic WordNet (AWN) has been an interesting project which combines WordNet structure compliance with Arabic particularities. In this paper, some AWN shortcomings related to coverage and usability are addressed. The use of AWN in question/answering (Q/A) helped us to deeply evaluate the resource from an experience-based perspective. Accordingly, an enrichment of AWN was built by semi-automatically extending its content. Indeed, existing approaches and/or resources developed for other languages were adapted and used for AWN. The experiments conducted in Arabic Q/A have shown an improvement of both AWN coverage as well as usability. Concerning coverage, a great amount of named entities extracted from YAGO were connected with corresponding AWN synsets. Also, a significant number of new verbs and nouns (including Broken Plural forms) were added. In terms of usability, thanks to the use of AWN, the performance for the AWN-based Q/A application registered an overall improvement with respect to the following three measures: accuracy (+9.27 % improvement), mean reciprocal rank (+3.6 improvement) and number of answered questions (+12.79 % improvement).The work presented in Sect. 2.2 was done in the framework of the bilateral Spain-Morocco AECID-PCI C/026728/09 research project. The research of the two first authors is done in the framework of the PROGRAMME D'URGENCE project (grant no. 03/2010). The research of the third author is done in the framework of WIQEI IRSES project (grant no. 269180) within the FP 7 Marie Curie People, DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) research project and VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. We would like to thank Manuel Montes-y-Gomez (INAOE-Puebla, Mexico) and Sandra Garcia-Blasco (Bitsnbrain, Spain) for their feedback on the work presented in Sect. 2.4. We would like finally to thank Violetta Cavalli-Sforza (Al Akhawayn University in Ifrane, Morocco) for having reviewed the linguistic level of the entire document.Abouenour, L.; Bouzoubaa, K.; Rosso, P. (2013). On the evaluation and improvement of arabic wordnet coverage and usability. Language Resources and Evaluation. 47(3):891-917. https://doi.org/10.1007/s10579-013-9237-0S891917473Abbès, R., Dichy, J., & Hassoun, M. (2004). The architecture of a standard Arabic lexical database: Some figures, ratios and categories from the DIINAR.1 source program. In Workshop on computational approaches to Arabic script-based languages, Coling 2004. Geneva, Switzerland.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2009a). Structure-based evaluation of an Arabic semantic query expansion using the JIRS passage retrieval system. In Proceedings of the workshop on computational approaches to Semitic languages, E-ACL-2009, Athens, Greece, March.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2009b). Three-level approach for passage retrieval in Arabic question/answering systems. In Proceedings of the 3rd international conference on Arabic language processing CITALAâ09, Rabat, Morocco, May, 2009.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2010a). An evaluated semantic query expansion and structure-based approach for enhancing Arabic question/answering. Special Issue in the International Journal on Information and Communication Technologies/IEEE. June.Abouenour, L., Bouzoubaa, K., & Rosso, P. (2010b). Using the YAGO ontology as a resource for the enrichment of named entities in Arabic WordNet. In Workshop LR & HLT for semitic languages, LRECâ10. Malta. May, 2010.Ahonen-Myka, H. (2002). Discovery of frequent word sequences in text. In Proceedings of the ESF exploratory workshop on pattern detection and discovery (pp. 180â189). London, UK: Springer.Al Khalifa, M., & RodrĂguez, H. (2009). Automatically extending NE coverage of Arabic WordNet using Wikipedia. In Proceedings of the 3rd international conference on Arabic language processing CITALAâ09, May, Rabat, Morocco.Alotaiby, F., Alkharashi, I., & Foda, S. (2009). Processing large Arabic text corpora: Preliminary analysis and results. In Proceedings of the second international conference on Arabic language resources and tools (pp. 78â82), Cairo, Egypt.Baker, C. F., Fillmore, C. J., & Cronin, B. (2003). The structure of the FrameNet database. International Journal of Lexicography, 16(3), 281â296.Baldwin, T., Pool, P., & Colowick, S. M. (2010). PanLex and LEXTRACT: Translating all words of all languages of the world. In Proceedings of Coling 2010, demonstration volume (pp. 37â40), Beijing.Benajiba, Y., Diab, M., & Rosso, P. (2009). Using language independent and language specific features to enhance Arabic named entity recognition. In IEEE transactions on audio, speech and language processing. Special Issue on Processing Morphologically Rich Languages, 17(5), 2009.Benajiba, Y., Rosso, P., & Lyhyaoui, A. (2007). Implementation of the ArabiQA question answering systemâs components. In Proceedings of workshop on Arabic natural language processing, 2nd Information Communication Technologies int. symposium, ICTIS-2007, April 3â5, Fez, Morocco.BenoĂŽt, S., & Darja, F. (2008). Building a free French WordNet from multilingual resources. Workshop on Ontolex 2008, LRECâ08, June, Marrakech, Morocco.Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Vossen, P., Pease, A., et al. (2006). Introducing the Arabic WordNet project. In Proceedings of the third international WordNet conference. Sojka, Choi: Fellbaum & Vossen (eds).Boudelaa, S., & Gaskell, M. G. (2002). A reexamination of the default system for Arabic plurals. Language and Cognitive Processes, 17, 321â343.Brini, W., Ellouze & M., Hadrich, B. L. (2009a). QASAL: Un système de question-rĂŠponse dĂŠdiĂŠ pour les questions factuelles en langue Arabe. In 9th JournĂŠes Scientifiques des Jeunes Chercheurs en GĂŠnie Electrique et Informatique, Tunisia.Brini, W., Trigui, O., Ellouze, M., Mesfar, S., Hadrich, L., & Rosso, P. (2009b). Factoid and definitional Arabic question answering system. In Post-proceedings of NOOJ-2009, June 8â10, Tozeur, Tunisia.Buscaldi, D., Rosso, P., GĂłmez, J. M., & Sanchis, E. (2010). Answering questions with an n-gram based passage retrieval engine. Journal of Intelligent Information Systems, 34(2), 113â134.Costa, R. P., & Seco, N. (2008). Hyponymy extraction and Web search behavior analysis based on query reformulation. In Proceedings of the 11th Ibero-American conference on AI: advances in artificial intelligence (pp. 1â10).Denicia-carral, C., Montes-y-GĂľmez, M., VillaseĂąor-pineda, L., & Hernandez, R. G. (2006). A text mining approach for definition question answering. In Proceedings of the 5th international conference on natural language processing, FinTalâ2006, Turku, Finland.Diab, M. T. (2004). Feasibility of bootstrapping an Arabic Wordnet leveraging parallel corpora and an English Wordnet. In Proceedings of the Arabic language technologies and resources, NEMLAR, Cairo, Egypt.El Amine, M. A. (2009). Vers une interface pour lâenrichissement des requĂŞtes en arabe dans un système de recherche dâinformation. In Proceedings of the 2nd confĂŠrence internationale sur lâinformatique et ses applications (CIIAâ09), May 3â4, Saida, Algeria.Elghamry, K. (2008). Using the web in building a corpus-based hypernymy-hyponymy Lexicon with hierarchical structure for Arabic. In Proceedings of the 6th international conference on informatics and systems, INFOS 2008. Cairo, Egypt.Elkateb, S., Black, W., Vossen, P., Farwell, D., RodrĂguez, H., Pease, A., et al. (2006). Arabic WordNet and the challenges of Arabic. In Proceedings of Arabic NLP/MT conference, London, UK.Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. MA: MIT Press.GarcĂa-Blasco, S., Danger, R., & Rosso, P. (2010). Drugâdrug interaction detection: A new approach based on maximal frequent sequences. Sociedad EspaĂąola para el Procesamiento del Lenguaje Natural, SEPLN, 45, 263â266.GarcĂa-HernĂĄndez, R. A. (2007). Algoritmos para el descubrimiento de patrones secuenciales maximales. Ph.D. Thesis, INAOE. September, Mexico.GarcĂa-HernĂĄndez, R. A., MartĂnez Trinidad, J. F., & Carrasco-ochoa, J. A. (2010). Finding maximal sequential patterns in text document collections and single documents. Informatica, 34(1), 93â101.Goweder, A., & De Roeck, A. (2001). Assessment of a significant Arabic corpus. In Proceedings of the Arabic NLP workshop at ACL/EACL, (pp. 73â79), Toulouse, France.Graff, D. (2007). Arabic Gigaword (3rd ed.). Philadelphia, USA: Linguistic Data Consortium.Graff, D., Kong, J., Chen, K., & Maeda, K. (2007). English Gigaword (3rd ed.). Philadelphia, USA: Linguistic Data Consortium.Hammou, B., Abu-salem, H., Lytinen, S., & Evens, M. (2002). QARAB: A question answering system to support the Arabic language. In Proceedings of the workshop on computational approaches to Semitic languages, ACL, (pp. 55â65), Philadelphia.Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, COLING â92 (vol. 2, pp. 539â545).Kanaan, G., Hammouri, A., Al-Shalabi, R., & Swalha, M. (2009). A new question answering system for the Arabic language. American Journal of Applied Sciences, 6(4), 797â805.Kim, H., Chen, S., & Veale, T. (2006). Analogical reasoning with a synergy of HowNet and WordNet. In Proceedings of GWCâ2006, the 3rd global WordNet conference, January, Cheju, Korea.Kipper-Schuler, K. (2006). VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. Thesis.Mohammed, F. A., Nasser, K., & Harb, H. M. (1993). A knowledge-based Arabic question answering system (AQAS). In ACM SIGART bulletin (pp. 21â33).Niles, I., & Pease, A. (2001). Towards a standard upper ontology. In Proceedings of FOIS-2 (pp. 2â9), Ogunquit, Maine.Niles, I., & Pease, A. (2003). Linking lexicons and ontologies: Mapping WordNet to the suggested upper merged ontology. In Proceedings of the 2003 international conference on information and knowledge engineering, Las Vegas, Nevada.Ortega-Mendoza, R. M., VillaseĂąor-pineda, L., & Montes-y-GĂľmez, M. (2007). Using lexical patterns to extract hyponyms from the Web. In Proceedings of the Mexican international conference on artificial intelligence MICAI 2007. November, Aguascalientes, Mexico. Lecture Notes in Artificial Intelligence 4827. Berlin: Springer.Palmer, M., P. Kingsbury, & D. Gildea. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 21. USA: MIT Press.Pantel, P., & Pennacchiotti, M. (2006). Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of conference on computational linguistics association for computational linguistics, (pp. 113â120), Sydney, Australia.Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., & MartĂ, A. (2008a). Arabic WordNet: Semi-automatic extensions using Bayesian Inference. In Proceedings of the the 6th conference on language resources and evaluation LREC2008, May, Marrakech, Morocco.Rodriguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M., Mart., M., et al. (2008b). Arabic WordNet: Current state and future extensions. In Proceedings of the fourth global WordNet conference, January 22â25, Szeged, Hungary.Sharaf, A. M. (2009). The Qurâan annotation for text mining. First year transfer report. School of Computing, Leeds University. December.Snow, R., Jurafsky, D., & Andrew, Y. N. (2005). Learning syntactic patterns for automatic hypernym discovery. In Lawrence K. Saul et al. (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press.Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. In Proceedings of 16th international World Wide Web conference WWWâ2007, (pp. 697â706), May, Banff, Alberta, Canada: ACM Press.Tjong Kim Sang, E., & Hofmann, K. (2007). Automatic extraction of Dutch hypernymâhyponym pairs. In Proceedings of CLIN-2006, Leuven, Belgium.Toral, A., Munoz, R., & Monachini, M. (2008). Named entity WordNet. In Proceedings of the Sixth international conference on language resources and evaluation (LRECâ08), Marrakech, Morocco.Vossen, P. (Ed.). (1998). EuroWordNet, a multilingual database with lexical semantic networks. The Netherlands: Kluwer.Wagner, A. (2005). Learning thematic role relations for lexical semantic nets. Ph.D. Thesis, University of TĂźbingen, 2005
Systematic Literature Review on Ontology-based Indonesian Question Answering System
Question-Answering (QA) systems at the intersection of natural language processing, information retrieval, and knowledge representation aim to provide efficient responses to natural language queries. These systems have seen extensive development in English and languages like Indonesian present unique challenges and opportunities. This literature review paper delves into the state of ontology-based Indonesian QA systems, highlighting critical challenges. The first challenge lies in sentence understanding, variations, and complexity. Most systems rely on syntactic analysis and struggle to grasp sentence semantics. Complex sentences, especially in Indonesian, pose difficulties in parsing, semantic interpretation, and knowledge extraction. Addressing these linguistic intricacies is pivotal for accurate responses. Secondly, template-based SPARQL query construction, commonly used in Indonesian QA systems, suffers from semantic gaps and inflexibility. Advanced techniques like semantic matching algorithms and dynamic template generation can bridge these gaps and adapt to evolving ontologies. Thirdly, lexical gaps and ambiguity hinder QA systems. Bridging vocabulary mismatches between user queries and ontology labels remains a challenge. Strategies like synonym expansion, word embedding, and ontology enrichment must be explored further to overcome these challenges. Lastly, the review discusses the potential of developing multi-domain ontologies to broaden the knowledge coverage of QA systems. While this presents complex linguistic and ontological challenges, it offers the advantage of responding to various user queries across various domains. This literature review identifies crucial challenges in developing ontology-based Indonesian QA systems and suggests innovative approaches to address these challenges
towards an optimal solution to lemmatization in arabic
Abstract Lemmatizationâcomputing the canonical forms of words in running textâis an important component in any NLP system and a key preprocessing step for most applications that rely on natural language understanding. In the case of Arabic, lemmatization is a complex task because of the rich morphology, agglutinative aspects, and lexical ambiguity due to the absence of short vowels in writing. In this paper, we introduce a new lemmatizer tool that combines a machine-learning-based approach with a lemmatization dictionary, the latter providing increased accuracy, robustness, and flexibility to the former. Our evaluations yield a performance of over 98% for the entire lemmatization pipeline. The lemmatizer tools are freely downloadable for private and research purposes
- âŚ