thesis

A hybrid NLP & semantic knowledgebase approach for the intelligent exploration of Arabic documents

Abstract

In the contemporary era, a colossal amount of information is published daily on the Web in the form of articles, documents, reviews, blogs and social media posts. As most of this data is available in the form of unstructured documents, it makes it challenging and timeconsuming to extract non-trivial, previously unknown, and potentially useful knowledge from the published documents. Hence, extracting useful knowledge from unstructured text, i.e., Information Extraction, is becoming an increasingly significant aspect of knowledge discovery. This work focuses on Information Extraction form Arabic unstructured text, which is an especially challenging task as Arabic is a highly inflectional and derivational language. The problem is compounded by the lack of mature tools and advanced research in Arabic Natural Language Processing (NLP) in comparison to European languages for instance. The principal objective of this research work is presenting a comprehensive methodology for integrating domain knowledge with Natural Language Processing techniques that were proven effective in solving most classification problems in order to improve the Information extraction process form online unstructured data. The importance of NLP tools lies in that they play a key role in allowing semantic concept tagging of unstructured text, and so realize the Semantic Web. This work presents a novel rule-based approach that uses linguistic grammar-based techniques to extract Arabic composite names from Arabic text. Our approach uniquely exploits the genitive Arabic grammar rules; in particular, the rules regarding the identification of definite nouns (معرفة) and indefinite nouns (نكرة) to support the process of extracting composite names. Furthermore, this approach does not place any constraints on the length of the Arabic composite name. The results of our experiments show that there are improvement in recognizing Arabic composite names entity in the Arabic language text. Our research also contributes a novel, knowledge-based approach to relation extraction from unstructured Arabic text, which is based on the principles of Functional Discourse Grammar (FDG). We further improve the approach by integrating it with Machine Learning relation classification, resulting in a hybrid relation extraction algorithm that can handle especially complex Arabic sentence structures. The accuracy of our relation classification efforts was extensively evaluated by means of experimental evaluation that evidenced the accuracy of the FDG relation extraction approach and the improvement gained by the Machine Learning integration. The essential NLP algorithms of entity recognition and relation extraction were deployed in a Semantic Knowledge-base that was built from the outset to model the knowledge of the problem domain. The semantic modelling of the knowledgebase aided improving the accuracy of the NLP algorithms by leveraging relevant domain knowledge published in Open Linked Datasets. Moreover, the extracted information was semantically tagged and inserted into the Semantic Knowledge-base, which facilitated building advanced rules to infer new interesting information from the extracted knowledge as well as utilising advanced query mechanisms for intelligently exploring the mined problem domain knowledge

    Similar works