7 research outputs found

    Can Multinomial Logistic Regression Predicts Research Group using Text Input?

    Get PDF
    While submitting proposals in SISINTA, students often confuse or falsely submit their proposals to the less relevant or incorrect research group. There are 13 research groups for the students to choose from. We proposed a text classification method to help students find the best research group based on the title and/or abstract. The stages in this study include data collection, preprocessing data, classification using Logistic Regression, and evaluation of the results. Three scenarios in research group classification are based on 1) title only, 2) abstract only, and 3) title and abstract. Based on the experiments, research group classification using title-only input is the best overall. This scenario gets the most optimal results with accuracy, precision, recall, and f1-score successively at 63.68%, 64.91%, 63.68%, and 63.46%. This result is sufficient to help students find the best research group based on the text titles. In addition, lecturers can comment more elaborately since the proposals are relevant to the research groupā€™s scope

    An improved Arabic text classification method using word embedding

    Get PDF
    Feature selection (FS) is a widely used method for removing redundant or irrelevant features to improve classification accuracy and decrease the modelā€™s computational cost. In this paper, we present an improved method (referred to hereafter as RARF) for Arabic text classification (ATC) that employs the term frequency-inverse document frequency (TF-IDF) and Word2Vec embedding technique to identify words that have a particular semantic relationship. In addition, we have compared our method with four benchmark FS methods namely principal component analysis (PCA), linear discriminant analysis (LDA), chi-square, and mutual information (MI). Support vector machine (SVM), k-nearest neighbors (K-NN), and naive Bayes (NB) are three machine learning based algorithms used in this work. Two different Arabic datasets are utilized to perform a comparative analysis of these algorithms. This paper also evaluates the efficiency of our method for ATC on the basis of performance metrics viz accuracy, precision, recall, and F-measure. Results revealed that the highest accuracy achieved for the SVM classifier applied to the Khaleej-2004 Arabic dataset with 94.75%, while the same classifier recorded an accuracy of 94.01% for the Watan-2004 Arabic dataset

    Analisis Sentimen Kemungkinan Depresi dan Kecemasan pada Twitter Menggunakan Support Vector Machine

    Get PDF

    Enhanced text stemmer for standard and non-standard word patterns in Malay texts

    Get PDF
    Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years, few text stemmers have been developed for the Malay language but unfortunately, these text stemmers suffer from various stemming errors. It is due to the difficulty in dealing with the complexity of the Malay language morphological rules. These text stemmers are developed for text stemming against affixation words only whereas there are other affixation, reduplication and compounding words in the Malay language. Furthermore, none of these text stemmers has been developed for text stemming against social media texts which comprise of the non-standard derived words. Therefore, this research study aims to improve the existing text stemmers capability of stemming affixation, reduplication and compounding words while minimising the possible stemming errors. Moreover, this research study also aims to address text stemming process for non-standard derived words on the social media platforms by removing non-standard affixes, clitics and particles. This research study adopts a multiple text stemming approach that use affix removal method and dictionary lookup in specific arrangement order to correctly stem standard and non-standard affixation, reduplication and compounding words in the standard texts and social media texts. The proposed text stemmer is evaluated against various text documents using the direct evaluation method and the text classification is used as the indirect evaluation method to validate the effectiveness of the proposed enhanced text stemmer. In general, the proposed enhanced text stemmer outperforms the baseline text stemmer. The stemming accuracy of the proposed enhanced text stemmer achieves an average of 98.7% against the standard texts and an average of 73.7% against the social media texts. Meanwhile, the performance of the proposed enhanced text stemmer in the sports news classification application achieves an average of 85% accuracy and the illicit content classification application achieves an average of 75% accuracy. Meanwhile, the baseline text stemmer achieves an average of 63.5% stemming accuracy against the standard texts but unfortunately, it is unable to stem non-standard derived words in the social media texts. The baseline text stemmer performs poorly in sports news classification and illicit content classification with an average accuracy of 78% and 63% respectively. In short, the experimental results suggest that the proposed enhanced text stemmer has promising stemming accuracy for text stemming against the standard texts and social media texts. It also influences the performance of the text classification application

    A Study of the Effects of Stemming Strategies on Arabic Document Classification

    No full text
    corecore