1 research outputs found

    Automatic Arabic Domain-Relevant Term Extraction

    No full text
    Term extraction from text corpus is an important step in knowledge acquisition and it is the first step in many Natural Language Processing (NLP) methods and computer lingual systems. In Arabic language there are some works in the field of term extraction and few of them try to extract domain-relevant terms. In this research a model for automatic Arabic domain-relevant term extraction from text corpus was proposed. The proposed model uses a hybrid approach composed of linguistic and statistical methods to extract terms relevant to specific domains depending on prevalence and tendency term ranking mechanism. In order to realize the proposed model a multi domain corpus separated into 10 domains (Economic, History, Education and family, Religious and Fatwa's, Sport, Health, Astronomy, Low, Stories, and Cooking recipes) was used. Then this corpus preprocessed by removing non Arabic letters, punctuations, diacritics, and stop words. Then a candidate terms vector was extracted using a sliding window with variant length dropping the windows that contain a stop word. Candidate terms have been ranked using Termhood method as a statistical method that measures the distributional behavior of candidate terms within the domain and across the rest of the corpus. Then Candidate terms have been distributed over the domains depending on the higher rank result for the extracted terms constructing a domain term matrix. This matrix has been used in a simple classifier that classifies the testing corpus. The final step gives us a confusion matrix that indicates that the domain term matrix worked as a best classifier achieving an accuracy rate of 100% for some domains and very good in others. The total accuracy of the classifier was 95%. This is a highly accurate classifier
    corecore