3 research outputs found

    A Rules Based System for Named Entity Recognition in Modern Standard Arabic

    Get PDF
    The amount of textual information available electronically has made it difficult formany users to find and access the right information within acceptable time. Researchcommunities in the natural language processing (NLP) field are developing tools andtechniques to alleviate these problems and help users in exploiting these vast resources.These techniques include Information Retrieval (IR) and Information Extraction (IE). Thework described in this thesis concerns IE and more specifically, named entity extraction inArabic. The Arabic language is of significant interest to the NLP community mainly due toits political and economic significance, but also due to its interesting characteristics.Text usually contains all kinds of names such as person names, company names,city and country names, sports teams, chemicals and lots of other names from specificdomains. These names are called Named Entities (NE) and Named Entity Recognition(NER), one of the main tasks of IE systems, seeks to locate and classify automaticallythese names into predefined categories. NER systems are developed for differentapplications and can be beneficial to other information management technologies as it canbe built over an IR system or can be used as the base module of a Data Mining application.In this thesis we propose an efficient and effective framework for extracting Arabic NEsfrom text using a rule based approach. Our approach makes use of Arabic contextual andmorphological information to extract named entities. The context is represented by meansof words that are used as clues for each named entity type. Morphological information isused to detect the part of speech of each word given to the morphological analyzer.Subsequently we developed and implemented our rules in order to recognise each positionof the named entity. Finally, our system implementation, evaluation metrics andexperimental results are presented

    Semantic sentence similarity incorporating linguistic concepts

    Get PDF
    A natural language allows a set of simpler ideas to be combined together to communicate much more complex ideas. This ability gives language the potential for use as a highly intuitive method of human interaction. However, this freedom of expression makes interpreting language with automation extremely challenging. Semantic sentence similarity is an approach which allows the knowledge of how to compare simpler units, such as words, to obtain a measure of similarity between two sentences. This similarity can allow existing knowledge to be applied to new situations. The objective of this research is to show that a sentence similarity model can be improved through the inclusion of Linguistic concepts, with the aim of producing a more accurate model. This presents the challenge of adapting the human focused rules of Linguistics for sentence similarity and how to evaluate individual component effects in isolation. This research successfully overcame these barriers through the development of an extensible modular framework and construction of a new mathematical model for this framework , called SARUMAN. The core contribution of the research resulted from gradually incorporating fundamental Linguistic components to SARUMAN including: disambiguation by part of speech; treating the sentence as clauses, and advanced word interaction to handle where meanings merge. The most advanced being called SCAWIT. From experiments on a small data set, each of these introduced concepts showed statistically significant improvement in the Pearson's correlation (0.05 or more) over the previous version. The produced models were capable of processing several hundred sentence pairs a second with a single processor. A further significant advance to the field of sentence similarity was the introduction of opposites to sentence similarity. This was conceptually beyond the pre-existing models and showed strong results for an extension of SCAWIT, called SANO. Other novel contribution was added through automated word sense disambiguation from WordNet definitions; and the use of a properties of words model. Some of these changes have potential but did not yield significant improvement with the current knowledge base
    corecore