8 research outputs found
Reduksi Dimensi Fitur Menggunakan Algoritma Aloft Untuk Pengelompokan Dokumen
Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen. Algoritma ALOFT (At Least One FeaTure) dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Karena sebelumnya algoritma ALOFT digunakan pada klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Sebelum dilakukan proses reduksi dimensi langkah pertama yang harus dilakukan adalah tahap preprocessing kemudian dilakukan perhitungan bobot tfidf. Proses reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contribution (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan Hierarchical Agglomerative Clustering (HAC). Dari hasil ujicoba didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR
A study of feature exraction techniques for classifying topics and sentiments from news posts
Recently, many news channels have their own Facebook pages in which news posts have been released in a daily basis. Consequently, these news posts contain temporal opinions about social events that may change over time due to external factors as well as may use as a monitor to the significant events happened around the world. As a result, many text mining researches have been conducted in the area of Temporal Sentiment Analysis, which one of its most challenging tasks is to detect and extract
the key features from news posts that arrive continuously overtime. However, extracting these features is a challenging task due to post’s complex properties, also posts about a specific topic may grow or vanish overtime leading in producing imbalanced datasets. Thus, this study has developed a comparative analysis on feature extraction Techniques which has examined various feature extraction techniques (TF-IDF, TF, BTO, IG, Chi-square) with three different n-gram features (Unigram, Bigram, Trigram), and using SVM as a classifier. The aim of this study is to discover the optimal Feature Extraction Technique (FET) that could achieve optimum accuracy results for both topic and sentiment classification. Accordingly, this analysis is conducted on three news channels’ datasets. The experimental results for topic classification have shown that Chi-square with unigram have proven to be the best FET compared to other techniques. Furthermore, to overcome the problem of imbalanced data, this study has combined the best FET with OverSampling
technology. The evaluation results have shown an improvement in classifier’s performance and has achieved a higher accuracy at 93.37%, 92.89%, and 91.92 for BBC, Al-Arabiya, and Al-Jazeera, respectively, compared to what have been obtained on original datasets. Similarly, same combination (Chi-square+Unigram) has been used for sentiment classification and obtained accuracies at rates of 81.87%, 70.01%, 77.36%. However, testing the recognized optimal FET on unseen randomly selected news posts has shown a relatively very low accuracies for both topic and sentiment classification due to the changes of topics and sentiments over time
REDUKSI DIMENSI FITUR MENGGUNAKAN ALGORITMA ALOFT UNTUK PENGELOMPOKAN DOKUMEN
Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen. Algoritma ALOFT (At Least One FeaTure) dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Karena sebelumnya algoritma ALOFT digunakan pada klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Sebelum dilakukan proses reduksi dimensi langkah pertama yang harus dilakukan adalah tahap preprocessing kemudian dilakukan perhitungan bobot tfidf. Proses reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contribution (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan Hierarchical Agglomerative Clustering (HAC). Dari hasil ujicoba didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR
Improved relative discriminative criterion using rare and informative terms and ringed seal search-support vector machine techniques for text classification
Classification has become an important task for automatically classifying the documents to their respective categories. For text classification, feature selection techniques are normally used to identify important features and to remove irrelevant, and noisy features for minimizing the dimensionality of feature space. These techniques are expected particularly to improve efficiency, accuracy, and comprehensibility of the classification models in text labeling problems. Most of the feature selection techniques utilize document and term frequencies to rank a term. Existing feature selection techniques (e.g. RDC, NRDC) consider frequently occurring terms and ignore rarely occurring terms count in a class. However, this study proposes the Improved Relative Discriminative Criterion (IRDC) technique which considers rarely occurring terms count. It is argued that rarely occurring terms count are also meaningful and important as frequently occurring terms in a class. The proposed IRDC is compared to the most recent feature selection techniques RDC and NRDC. The results reveal significant improvement by the proposed IRDC technique for feature selection in terms of precision 27%, recall 30%, macro-average 35% and micro- average 30%. Additionally, this study also proposes a hybrid algorithm named: Ringed Seal Search-Support Vector Machine (RSS-SVM) to improve the generalization and learning capability of the SVM. The proposed RSS-SVM optimizes kernel and penalty parameter with the help of RSS algorithm. The proposed RSS-SVM is compared to the most recent techniques GA-SVM and CS-SVM. The results show significant improvement by the proposed RSS-SVM for classification in terms of accuracy 18.8%, recall 15.68%, precision 15.62% and specificity 13.69%. In conclusion, the proposed IRDC has shown better performance as compare to existing techniques because its capability in considering rare and informative terms. Additionally, the proposed RSS- SVM has shown better performance as compare to existing techniques because it has capability to improve balance between exploration and exploitation
An efficient Bayes classifier for word classification: an application on the EU Recovery and Resilience Plans
This paper proposes the Prior Adaptive Bayes (PAB) classifier, a new algorithm to assign words appearing in a text to their respective topics. It is an adaption of the Bayes classifier where, as the prior probabilities of classes, their posterior probabilities associated with the adjacent words are used. Simulations show an improvement of more than 20% over the standard Bayes classifier. The PAB classifier is applied to the Recovery and Resilience Plans (RRPs) of the 27 European Union member states to evaluate their alignment with the environmental dimension of the Sustainable Development Goals (SDGs) as compared to the socioeconomic one. Results show that the attention paid by the countries to the pro-environment SDGs increases with the funds per capita assigned, the gap in the environmental endowment and the touristic attractiveness. Finally, the environmental dimension appears associated positively with available GDP growth projections for the next few years
Techniques for text classification: Literature review and current trends
Automated classification of text into predefined categories has always been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. This kind of web information, popularly known as the digital/electronic information is in the form of documents, conference material, publications, journals, editorials, web pages, e-mail etc. People largely access information from these online sources rather than being limited to archaic paper sources like books, magazines, newspapers etc. But the main problem is that this enormous information lacks organization which makes it difficult to manage. Text classification is recognized as one of the key techniques used for organizing such kind of digital data. In this paper we have studied the existing work in the area of text classification which will allow us to have a fair evaluation of the progress made in this field till date. We have investigated the papers to the best of our knowledge and have tried to summarize all existing information in a comprehensive and succinct manner. The studies have been summarized in a tabular form according to the publication year considering numerous key perspectives. The main emphasis is laid on various steps involved in text classification process viz. document representation methods, feature selection methods, data mining methods and the evaluation technique used by each study to carry out the results on a particular dataset
Reduksi Dimensi Fitur Menggunakan Algoritma ALOFT Untuk Pengelompokan Dokumen
Pengelompokan dokumen masih memiliki tantangan dimana semakin besar
dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga
berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk
terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan
reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter
telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat
tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari
keseluruhan dokumen, metode ini sering disebut variable rangking (VR). Cara
mengatasi masalah ini adalah dengan Algoritma ALOFT (At Least One FeaTure)
dimana ALOFT dapat menghasilkan sejumlah set fitur secara otomatis tanpa
adanya parameter masukan dari pengguna. Algoritma ALOFT pada penelitian
sebelumnya digunakan untuk klasifikasi dokumen, metode filter yang digunakan
pada algoritma ALOFT adalah metode filter yang membutuhkan adanya label pada
kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan
dokumen.
Oleh karena itu, pada penelitian ini diusulkan metode reduksi dimensi fitur
dengan menggunakan variasi metode filter pada algoritma ALOFT untuk
pengelompokan dokumen. Proses pencarian kata dasar pada penelitian ini
dilakukan dengan menggunakan kata turunan yang disediakan oleh Kateglo
(kamus, tesaurus, dan glosarium). Fase reduksi dimensi dilakukan dengan
menggunakan metode filter seperti Document Frequency (DF), Term Contributtion
(TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute
Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean
(AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT.
Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode
clustering yang berbeda yaitu k-means dan hierarchical agglomerative clustering
(HAC).
Kualitas dari cluster yang dihasilkan dievaluasi dengan menggunakan
metode silhouette coefficient. Pengujian dilakukan dengan cara membandingkan
nilai silhouette coefficient dari variasi metode filter pada ALOFT dengan pemilihan
fitur secara VR. Berdasarkan pengujian variasi metode filter pada ALOFT untuk
pengelompokan dokumen didapatkan bahwa kualitas cluster yang dihasilkan oleh
metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil
dari metode VR. Kualitas cluster yang didapat memiliki kriteria “Baik” untuk filter
TC, TV, TVQ, dan MAD dengan rata – rata silhouette lebih dari 0,5. ========== Document clustering still have a challenge when the volume of document
increases, the dimensionality of term features increases as well. this contributes to
the high dimensionality and may cause deteriorates performance and accuracy of
clustering algorithm. The way to overcome this problem is dimension reduction.
Dimension reduction methods such as feature selection using filter method has been
used for document clustering. But the filter method is highly dependent on user
input to select number of n top features from the whole document, this method often
called variable ranking (VR). ALOFT (At Least One feature) Algorithm can
generate a number of feature set automatically without user input. In the previous
research ALOFT algorithm used on classification documents so the filter method
require labels on classes. Such filter method can not be used on document
clustering.
This research proposed feature dimension reduction method by using
variations of several filter methods in ALOFT algorithm for document clustering.
Before the dimension reduction process first step that must be done is the
preprocessing phase then calculate the weight of term using tfidf. filter method used
in this study are such Document Frequency (DF), Term contributtion (TC), Term
Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD),
Mean Median (MM), and Arithmetic Mean geometric Mean (AMGM).
Furthermore, the final feature set selected by the algorithm ALOFT. The last phase
is document clustering using two different clustering methods, k-means and
agglomerative hierarchical clustering (HAC).
Quality of cluster are evaluated using coefficient silhouette. Experiment is
done by comparing value of silhouette coefficient from variation of filter method in
ALOFT with feature selection in VR. Experiment results showed that the proposed
method using k-means algorithm able to improve results of VR methods. This
research resulted quality of cluster with criteria of "Good" for filter TC, TV, TVQ,
and MAD with average silhouette width (ASW) more than 0.