8 research outputs found

    Reduksi Dimensi Fitur Menggunakan Algoritma Aloft Untuk Pengelompokan Dokumen

    Full text link
    Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen. Algoritma ALOFT (At Least One FeaTure) dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Karena sebelumnya algoritma ALOFT digunakan pada klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Sebelum dilakukan proses reduksi dimensi langkah pertama yang harus dilakukan adalah tahap preprocessing kemudian dilakukan perhitungan bobot tfidf. Proses reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contribution (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan Hierarchical Agglomerative Clustering (HAC). Dari hasil ujicoba didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR

    A study of feature exraction techniques for classifying topics and sentiments from news posts

    Get PDF
    Recently, many news channels have their own Facebook pages in which news posts have been released in a daily basis. Consequently, these news posts contain temporal opinions about social events that may change over time due to external factors as well as may use as a monitor to the significant events happened around the world. As a result, many text mining researches have been conducted in the area of Temporal Sentiment Analysis, which one of its most challenging tasks is to detect and extract the key features from news posts that arrive continuously overtime. However, extracting these features is a challenging task due to post’s complex properties, also posts about a specific topic may grow or vanish overtime leading in producing imbalanced datasets. Thus, this study has developed a comparative analysis on feature extraction Techniques which has examined various feature extraction techniques (TF-IDF, TF, BTO, IG, Chi-square) with three different n-gram features (Unigram, Bigram, Trigram), and using SVM as a classifier. The aim of this study is to discover the optimal Feature Extraction Technique (FET) that could achieve optimum accuracy results for both topic and sentiment classification. Accordingly, this analysis is conducted on three news channels’ datasets. The experimental results for topic classification have shown that Chi-square with unigram have proven to be the best FET compared to other techniques. Furthermore, to overcome the problem of imbalanced data, this study has combined the best FET with OverSampling technology. The evaluation results have shown an improvement in classifier’s performance and has achieved a higher accuracy at 93.37%, 92.89%, and 91.92 for BBC, Al-Arabiya, and Al-Jazeera, respectively, compared to what have been obtained on original datasets. Similarly, same combination (Chi-square+Unigram) has been used for sentiment classification and obtained accuracies at rates of 81.87%, 70.01%, 77.36%. However, testing the recognized optimal FET on unseen randomly selected news posts has shown a relatively very low accuracies for both topic and sentiment classification due to the changes of topics and sentiments over time

    REDUKSI DIMENSI FITUR MENGGUNAKAN ALGORITMA ALOFT UNTUK PENGELOMPOKAN DOKUMEN

    Get PDF
    Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen. Algoritma ALOFT (At Least One FeaTure) dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Karena sebelumnya algoritma ALOFT digunakan pada klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Sebelum dilakukan proses reduksi dimensi langkah pertama yang harus dilakukan adalah tahap preprocessing kemudian dilakukan perhitungan bobot tfidf. Proses reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contribution (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan Hierarchical Agglomerative Clustering (HAC). Dari hasil ujicoba didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR

    Improved relative discriminative criterion using rare and informative terms and ringed seal search-support vector machine techniques for text classification

    Get PDF
    Classification has become an important task for automatically classifying the documents to their respective categories. For text classification, feature selection techniques are normally used to identify important features and to remove irrelevant, and noisy features for minimizing the dimensionality of feature space. These techniques are expected particularly to improve efficiency, accuracy, and comprehensibility of the classification models in text labeling problems. Most of the feature selection techniques utilize document and term frequencies to rank a term. Existing feature selection techniques (e.g. RDC, NRDC) consider frequently occurring terms and ignore rarely occurring terms count in a class. However, this study proposes the Improved Relative Discriminative Criterion (IRDC) technique which considers rarely occurring terms count. It is argued that rarely occurring terms count are also meaningful and important as frequently occurring terms in a class. The proposed IRDC is compared to the most recent feature selection techniques RDC and NRDC. The results reveal significant improvement by the proposed IRDC technique for feature selection in terms of precision 27%, recall 30%, macro-average 35% and micro- average 30%. Additionally, this study also proposes a hybrid algorithm named: Ringed Seal Search-Support Vector Machine (RSS-SVM) to improve the generalization and learning capability of the SVM. The proposed RSS-SVM optimizes kernel and penalty parameter with the help of RSS algorithm. The proposed RSS-SVM is compared to the most recent techniques GA-SVM and CS-SVM. The results show significant improvement by the proposed RSS-SVM for classification in terms of accuracy 18.8%, recall 15.68%, precision 15.62% and specificity 13.69%. In conclusion, the proposed IRDC has shown better performance as compare to existing techniques because its capability in considering rare and informative terms. Additionally, the proposed RSS- SVM has shown better performance as compare to existing techniques because it has capability to improve balance between exploration and exploitation

    An efficient Bayes classifier for word classification: an application on the EU Recovery and Resilience Plans

    Get PDF
    This paper proposes the Prior Adaptive Bayes (PAB) classifier, a new algorithm to assign words appearing in a text to their respective topics. It is an adaption of the Bayes classifier where, as the prior probabilities of classes, their posterior probabilities associated with the adjacent words are used. Simulations show an improvement of more than 20% over the standard Bayes classifier. The PAB classifier is applied to the Recovery and Resilience Plans (RRPs) of the 27 European Union member states to evaluate their alignment with the environmental dimension of the Sustainable Development Goals (SDGs) as compared to the socioeconomic one. Results show that the attention paid by the countries to the pro-environment SDGs increases with the funds per capita assigned, the gap in the environmental endowment and the touristic attractiveness. Finally, the environmental dimension appears associated positively with available GDP growth projections for the next few years

    Techniques for text classification: Literature review and current trends

    Get PDF
    Automated classification of text into predefined categories has always been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. This kind of web information, popularly known as the digital/electronic information is in the form of documents, conference material, publications, journals, editorials, web pages, e-mail etc. People largely access information from these online sources rather than being limited to archaic paper sources like books, magazines, newspapers etc. But the main problem is that this enormous information lacks organization which makes it difficult to manage. Text classification is recognized as one of the key techniques used for organizing such kind of digital data. In this paper we have studied the existing work in the area of text classification which will allow us to have a fair evaluation of the progress made in this field till date. We have investigated the papers to the best of our knowledge and have tried to summarize all existing information in a comprehensive and succinct manner. The studies have been summarized in a tabular form according to the publication year considering numerous key perspectives. The main emphasis is laid on various steps involved in text classification process viz. document representation methods, feature selection methods, data mining methods and the evaluation technique used by each study to carry out the results on a particular dataset

    Reduksi Dimensi Fitur Menggunakan Algoritma ALOFT Untuk Pengelompokan Dokumen

    Get PDF
    Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen, metode ini sering disebut variable rangking (VR). Cara mengatasi masalah ini adalah dengan Algoritma ALOFT (At Least One FeaTure) dimana ALOFT dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Algoritma ALOFT pada penelitian sebelumnya digunakan untuk klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT adalah metode filter yang membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Oleh karena itu, pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Proses pencarian kata dasar pada penelitian ini dilakukan dengan menggunakan kata turunan yang disediakan oleh Kateglo (kamus, tesaurus, dan glosarium). Fase reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contributtion (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan hierarchical agglomerative clustering (HAC). Kualitas dari cluster yang dihasilkan dievaluasi dengan menggunakan metode silhouette coefficient. Pengujian dilakukan dengan cara membandingkan nilai silhouette coefficient dari variasi metode filter pada ALOFT dengan pemilihan fitur secara VR. Berdasarkan pengujian variasi metode filter pada ALOFT untuk pengelompokan dokumen didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR. Kualitas cluster yang didapat memiliki kriteria “Baik” untuk filter TC, TV, TVQ, dan MAD dengan rata – rata silhouette lebih dari 0,5. ========== Document clustering still have a challenge when the volume of document increases, the dimensionality of term features increases as well. this contributes to the high dimensionality and may cause deteriorates performance and accuracy of clustering algorithm. The way to overcome this problem is dimension reduction. Dimension reduction methods such as feature selection using filter method has been used for document clustering. But the filter method is highly dependent on user input to select number of n top features from the whole document, this method often called variable ranking (VR). ALOFT (At Least One feature) Algorithm can generate a number of feature set automatically without user input. In the previous research ALOFT algorithm used on classification documents so the filter method require labels on classes. Such filter method can not be used on document clustering. This research proposed feature dimension reduction method by using variations of several filter methods in ALOFT algorithm for document clustering. Before the dimension reduction process first step that must be done is the preprocessing phase then calculate the weight of term using tfidf. filter method used in this study are such Document Frequency (DF), Term contributtion (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), and Arithmetic Mean geometric Mean (AMGM). Furthermore, the final feature set selected by the algorithm ALOFT. The last phase is document clustering using two different clustering methods, k-means and agglomerative hierarchical clustering (HAC). Quality of cluster are evaluated using coefficient silhouette. Experiment is done by comparing value of silhouette coefficient from variation of filter method in ALOFT with feature selection in VR. Experiment results showed that the proposed method using k-means algorithm able to improve results of VR methods. This research resulted quality of cluster with criteria of "Good" for filter TC, TV, TVQ, and MAD with average silhouette width (ASW) more than 0.
    corecore