7,227 research outputs found

    The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification

    Get PDF
    This research presents and compares the impact of text preprocessing, which has not been addressed before, on Arabic text classification using popular text classification algorithms; Decision Tree, K Nearest Neighbors, Support Vector Machines, Naïve Bayes and its variations. Text preprocessing includes applying different term weighting schemes, and Arabic morphological analysis (stemming and light stemming). We implemented and integrated Arabic morphological analysis tools within the leading open source machine learning tools: Weka, and RapidMiner. Text Classification algorithms are applied on seven Arabic corpora (3 in-house collected and 4 existing corpora). Experimental results show: (1) Light stemming with term pruning is best feature reduction technique. (2) Support Vector Machines and Naïve Bayes variations outperform other algorithms. (3) Weighting schemes impact the performance of distance based classifier

    The classification of the modern arabic poetry using machine learning

    Get PDF
    In recent years, working on text classification and analysis of Arabic texts using machine learning has seen some progress, but most of this research has not focused on Arabic poetry. Because of some difficulties in the analysis of Arabic poetry, it was required the use of standard Arabic language on which “Al Arud”, the science of studying poetry is based. This paper presents an approach that uses machine learning for the classification of modern Arabic poetry into four types: love poems, Islamic poems, social poems, and political poems. Each of these species usually has features that indicate the class of the poem. Despite the challenges generated by the difficulty of the rules of the Arabic language on which this classification depends, we proposed a new automatic way of modern Arabic poems classification to solve these issues. The recommended method is suitable for the above-mentioned classes of poems. This study used Naïve Bayes, Support Vector Machines, and Linear Support Vector for the classification processes. Data preprocessing was an important step of the approach in this paper, as it increased the accuracy of the classification

    Free-text keystroke dynamics authentication for Arabic language

    Get PDF
    This study introduces an approach for user authentication using free-text keystroke dynamics which incorporates text in Arabic language. The Arabic language has completely different characteristics to those of English. The approach followed in this study involves the use of the keyboard's key-layout. The method extracts timing features from specific key-pairs in the typed text. Decision trees were exploited to classify each of the users' data. In parallel for comparison, support vector machines were also used for classification in association with an ant colony optimisation feature selection technique. The results obtained from this study are encouraging as low false accept rates and false reject rates were achieved in the experimentation phase. This signifies that satisfactory overall system performance was achieved by using the typing attributes in the proposed approach, while typing Arabic text

    New techniques for Arabic document classification

    Get PDF
    Text classification (TC) concerns automatically assigning a class (category) label to a text document, and has increasingly many applications, particularly in the domain of organizing, for browsing in large document collections. It is typically achieved via machine learning, where a model is built on the basis of a typically large collection of document features. Feature selection is critical in this process, since there are typically several thousand potential features (distinct words or terms). In text classification, feature selection aims to improve the computational e ciency and classification accuracy by removing irrelevant and redundant terms (features), while retaining features (words) that contain su cient information that help with the classification task. This thesis proposes binary particle swarm optimization (BPSO) hybridized with either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature selection in Arabic text classi cation tasks. Comparison between feature selection approaches is done on the basis of using the selected features in conjunction with SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test set. Using publically available Arabic datasets, results show that BPSO/KNN and BPSO/SVM techniques are promising in this domain. The sets of selected features (words) are also analyzed to consider the di erences between the types of features that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning the appropriate feature selection strategy, based on the relationship between the classes in the document categorization task at hand. The thesis also investigates the use of statistically extracted phrases of length two as terms in Arabic text classi cation. In comparison with Bag of Words text representation, results show that using phrases alone as terms in Arabic TC task decreases the classification accuracy of Arabic TC classifiers significantly while combining bag of words and phrase based representations may increase the classification accuracy of the SVM classifier slightly

    Multi-Category Support Vector Machines for Identifying Arabic Topics

    Get PDF
    International audienceIt is known that Support Vector Machines were designed for binary classification. Nevertless, it would be fruitful to extend this operation to what is called Multi-category classification. That is why, Multi-category Support Nector Machines (MSVM) become nowadays the current subject of several serious researches, aiming to achieve high levels of multi-category classification tasks. This technique has been assessed recently recently in some fields as text categorization, Cancer classification, etc. We should notify that experiments which have been realized until now using MSVM are limited to small data sets, since its computation is more expensive. In this paper, we are interested in the use of this method, for the first time in topic identification. The experiments conducted concern topic identification of Arabic language. The corpora are extracted from ALWATAN newspaper. Achieved results lead to an improvement of MSVM performance i comparison to the baseline SVM method. Nevertheless, SVM still outperforms MSVM when using larger sizes of the vocabulary

    KACST Arabic Text Classification Project: Overview and Preliminary Results

    No full text
    Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques
    corecore