12 research outputs found

    An Improved Similarity Matching based Clustering Framework for Short and Sentence Level Text

    Get PDF
    Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods

    Reduksi Dimensi Fitur Menggunakan Algoritma Aloft Untuk Pengelompokan Dokumen

    Full text link
    Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen. Algoritma ALOFT (At Least One FeaTure) dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Karena sebelumnya algoritma ALOFT digunakan pada klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Sebelum dilakukan proses reduksi dimensi langkah pertama yang harus dilakukan adalah tahap preprocessing kemudian dilakukan perhitungan bobot tfidf. Proses reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contribution (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan Hierarchical Agglomerative Clustering (HAC). Dari hasil ujicoba didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR

    Genetic Algorithm to Optimize k-Nearest Neighbor Parameter for Benchmarked Medical Datasets Classification

    Get PDF
    Computer assisted medical diagnosis is a major machine learning problem being researched recently. General classifiers learn from the data itself through training process, due to the inexperience of an expert in determining parameters. This research proposes a methodology based on machine learning paradigm. Integrates the search heuristic that is inspired by natural evolution called genetic algorithm with the simplest and the most used learning algorithm, k-nearest Neighbor. The genetic algorithm were used for feature selection and parameter optimization while k-nearest Neighbor were used as a classifier. The proposed method is experimented on five benchmarked medical datasets from University California Irvine Machine Learning Repository and compared with original k-NN and other feature selection algorithm i.e., forward selection, backward elimination and greedy feature selection.Ā  Experiment results show that the proposed method is able to achieve good performance with significant improvement with p value of t-Test is 0.0011

    Implementasi Algoritma Genetika pada k-nearest neighbours untuk Klasifikasi Kerusakan Tulang Belakang

    Get PDF
    Abstrak Kerusakan tulang belakang dialami oleh sekitar dua pertiga orang dewasa serta termasuk ke dalam penyakit yang paling umum kedua setelah sakit kepala. Klasifikasi gangguan tulang belakang sulit dilakukan karena membutuhkan radiologist untuk menganalisa citra Magnetic Resonance Imaging (MRI). Penggunaan Computer Aided Diagnosis (CAD) System dapat membantu radiologist untuk mendeteksi kelainan pada tulang belakang dengan lebih optimal. Dataset vertebral column memiliki tiga kelas sebagai klasifikasi penyakit kerusakan tulang belakang yaitu, herniated disk, spondylolisthesis dan kelas normal yang diambil berdasarkan hasil ekstraksi citra MRI. Dataset akan diolah dalam lima eksperimen berdasarkan validasi dataset menggunakan split validation dengan pembagian data training dan data testing yang bervariasi. Pada penelitian ini diusulkan implementasi algoritma genetika pada algoritma k-nearest neighbours untuk meningkatkan akurasi dari klasifikasi gangguan tulang belakang. Algoritma genetika digunakan untuk fitur seleksi dan optimasi parameter algoritma k-nearest neighbours. Hasil penelitian menunjukan bahwa metode yang diusulkan menghasilkan peningkatan yang signifikan dalam klasifikasi kerusakan pada tulang belakang. Metode yang diusulkan menghasilkan rata-rata akurasi sebesar 93% dari lima eksperimen. Hasil ini lebih baik dari algoritma k-nearest neighbours yang menghasilkan rata-rata akurasi hanya sebesar 82.54%. Ā  Kata kunci: algoritma genetika, k-nearest neighbours, kerusakan tulang belakang, vertebral Ā  Abstract Spinal disorder is experienced by about two-thirds of adults and is included in the second most common disease after headaches. Classification of spinal disorders is difficult because it requires a radiologist to analyze Magnetic Resonance Imaging (MRI) images. The use of Computer Aided Diagnosis (CAD) System can help radiologists to detect abnormalities in the spine more optimally. The vertebral column dataset has three classes as a classification of spinal disorders, namely, herniated disk, spondylolisthesis and normal classes taken based on MRI Image extraction. The dataset will be processed in five experiments based on dataset validation using split validation with various training data and testing data. In this study proposed the implementation of genetic algorithms in the k-nearest neighbors algorithm to improve the accuracy of the classification of spinal disorders. Genetic algorithms are used for algorithm feature selection and parameter optimization of k-nearest neighbors. The results showed that the proposed method produced a significant increase in the classification of spinal disorder. The proposed method produces an average accuracy of 93% from five experiments. This result is better than the k-nearest neighbors algorithm which produces an average accuracy of only 82.54%. Ā  Keywords: genetic algorithm, k-nearest neighbours, spinal disorder, vertebral column

    Unsupervised text Feature Selection using memetic Dichotomous Differential Evolution

    Get PDF
    Feature Selection (FS) methods have been studied extensively in the literature, and there are a crucial component in machine learning techniques. However, unsupervised text feature selection has not been well studied in document clustering problems. Feature selection could be modelled as an optimization problem due to the large number of possible solutions that might be valid. In this paper, a memetic method that combines Differential Evolution (DE) with Simulated Annealing (SA) for unsupervised FS was proposed. Due to the use of only two values indicating the existence or absence of the feature, a binary version of differential evolution is used. A dichotomous DE was used for the purpose of the binary version, and the proposed method is named Dichotomous Differential Evolution Simulated Annealing (DDESA). This method uses dichotomous mutation instead of using the standard mutation DE to be more effective for binary purposes. The Mean Absolute Distance (MAD) filter was used as the feature subset internal evaluation measure in this paper. The proposed method was compared with other state-of-the-art methods including the standard DE combined with SA, which is named DESA in this paper, using five benchmark datasets. The F-micro, F-macro (F-scores) and Average Distance of Document to Cluster (ADDC) measures were utilized as the evaluation measures. The Reduction Rate (RR) was also used as an evaluation measure. Test results showed that the proposed DDESA outperformed the other tested methods in performing the unsupervised text feature selection

    REDUKSI DIMENSI FITUR MENGGUNAKAN ALGORITMA ALOFT UNTUK PENGELOMPOKAN DOKUMEN

    Get PDF
    Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen. Algoritma ALOFT (At Least One FeaTure) dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Karena sebelumnya algoritma ALOFT digunakan pada klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Sebelum dilakukan proses reduksi dimensi langkah pertama yang harus dilakukan adalah tahap preprocessing kemudian dilakukan perhitungan bobot tfidf. Proses reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contribution (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan Hierarchical Agglomerative Clustering (HAC). Dari hasil ujicoba didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR

    Multi-objective of wind-driven optimization as feature selection and clustering to enhance text clustering

    Get PDF
    Text Clustering consists of grouping objects of similar categories. The initial centroids influence operation of the system with the potential to become trapped in local optima. The second issue pertains to the impact of a huge number of features on the determination of optimal initial centroids. The problem of dimensionality may be reduced by feature selection. Therefore, Wind Driven Optimization (WDO) was employed as Feature Selection to reduce the unimportant words from the text. In addition, the current study has integrated a novel clustering optimization technique called the WDO (Wasp Swarm Optimization) to effectively determine the most suitable initial centroids. The result showed the new meta-heuristic which is WDO was employed as the multi-objective first time as unsupervised Feature Selection (WDOFS) and the second time as a Clustering algorithm (WDOC). For example, the WDOC outperformed Harmony Search and Particle Swarm in terms of F-measurement by 93.3%; in contrast, text clustering's performance improves 0.9% because of using suggested clustering on the proposed feature selection. With WDOFS more than 50 percent of features have been removed from the other examination of features. The best result got the multi-objectives with F-measurement 98.3%

    Reduksi Dimensi Fitur Menggunakan Algoritma ALOFT Untuk Pengelompokan Dokumen

    Get PDF
    Pengelompokan dokumen masih memiliki tantangan dimana semakin besar dokumen maka akan menghasilkan fitur yang semakin banyak. Sehingga berdampak pada tingginya dimensi dan dapat menyebabkan performa yang buruk terhadap algoritma clustering. Cara untuk mengatasi masalah ini adalah dengan reduksi dimensi. Metode reduksi dimensi seperti seleksi fitur dengan metode filter telah digunakan untuk pengelompokan dokumen. Akan tetapi metode filter sangat tergantung pada masukan pengguna untuk memilih sejumlah n fitur teratas dari keseluruhan dokumen, metode ini sering disebut variable rangking (VR). Cara mengatasi masalah ini adalah dengan Algoritma ALOFT (At Least One FeaTure) dimana ALOFT dapat menghasilkan sejumlah set fitur secara otomatis tanpa adanya parameter masukan dari pengguna. Algoritma ALOFT pada penelitian sebelumnya digunakan untuk klasifikasi dokumen, metode filter yang digunakan pada algoritma ALOFT adalah metode filter yang membutuhkan adanya label pada kelas sehingga metode filter tersebut tidak dapat digunakan untuk pengelompokan dokumen. Oleh karena itu, pada penelitian ini diusulkan metode reduksi dimensi fitur dengan menggunakan variasi metode filter pada algoritma ALOFT untuk pengelompokan dokumen. Proses pencarian kata dasar pada penelitian ini dilakukan dengan menggunakan kata turunan yang disediakan oleh Kateglo (kamus, tesaurus, dan glosarium). Fase reduksi dimensi dilakukan dengan menggunakan metode filter seperti Document Frequency (DF), Term Contributtion (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), dan Arithmetic Mean Geometric Mean (AMGM). Selanjutnya himpunan fitur akhir dipilih dengan algoritma ALOFT. Tahap terakhir adalah pengelompokan dokumen menggunakan dua metode clustering yang berbeda yaitu k-means dan hierarchical agglomerative clustering (HAC). Kualitas dari cluster yang dihasilkan dievaluasi dengan menggunakan metode silhouette coefficient. Pengujian dilakukan dengan cara membandingkan nilai silhouette coefficient dari variasi metode filter pada ALOFT dengan pemilihan fitur secara VR. Berdasarkan pengujian variasi metode filter pada ALOFT untuk pengelompokan dokumen didapatkan bahwa kualitas cluster yang dihasilkan oleh metode usulan dengan menggunakan algoritma k-means mampu memperbaiki hasil dari metode VR. Kualitas cluster yang didapat memiliki kriteria ā€œBaikā€ untuk filter TC, TV, TVQ, dan MAD dengan rata ā€“ rata silhouette lebih dari 0,5. ========== Document clustering still have a challenge when the volume of document increases, the dimensionality of term features increases as well. this contributes to the high dimensionality and may cause deteriorates performance and accuracy of clustering algorithm. The way to overcome this problem is dimension reduction. Dimension reduction methods such as feature selection using filter method has been used for document clustering. But the filter method is highly dependent on user input to select number of n top features from the whole document, this method often called variable ranking (VR). ALOFT (At Least One feature) Algorithm can generate a number of feature set automatically without user input. In the previous research ALOFT algorithm used on classification documents so the filter method require labels on classes. Such filter method can not be used on document clustering. This research proposed feature dimension reduction method by using variations of several filter methods in ALOFT algorithm for document clustering. Before the dimension reduction process first step that must be done is the preprocessing phase then calculate the weight of term using tfidf. filter method used in this study are such Document Frequency (DF), Term contributtion (TC), Term Variance Quality (TVQ), Term Variance (TV), Mean Absolute Difference (MAD), Mean Median (MM), and Arithmetic Mean geometric Mean (AMGM). Furthermore, the final feature set selected by the algorithm ALOFT. The last phase is document clustering using two different clustering methods, k-means and agglomerative hierarchical clustering (HAC). Quality of cluster are evaluated using coefficient silhouette. Experiment is done by comparing value of silhouette coefficient from variation of filter method in ALOFT with feature selection in VR. Experiment results showed that the proposed method using k-means algorithm able to improve results of VR methods. This research resulted quality of cluster with criteria of "Good" for filter TC, TV, TVQ, and MAD with average silhouette width (ASW) more than 0.

    Advances in Meta-Heuristic Optimization Algorithms in Big Data Text Clustering

    Full text link
    This paper presents a comprehensive survey of the meta-heuristic optimization algorithms on the text clustering applications and highlights its main procedures. These Artificial Intelligence (AI) algorithms are recognized as promising swarm intelligence methods due to their successful ability to solve machine learning problems, especially text clustering problems. This paper reviews all of the relevant literature on meta-heuristic-based text clustering applications, including many variants, such as basic, modified, hybridized, and multi-objective methods. As well, the main procedures of text clustering and critical discussions are given. Hence, this review reports its advantages and disadvantages and recommends potential future research paths. The main keywords that have been considered in this paper are text, clustering, meta-heuristic, optimization, and algorithm

    An enhanced binary bat and Markov clustering algorithms to improve event detection for heterogeneous news text documents

    Get PDF
    Event Detection (ED) works on identifying events from various types of data. Building an ED model for news text documents greatly helps decision-makers in various disciplines in improving their strategies. However, identifying and summarizing events from such data is a non-trivial task due to the large volume of published heterogeneous news text documents. Such documents create a high-dimensional feature space that influences the overall performance of the baseline methods in ED model. To address such a problem, this research presents an enhanced ED model that includes improved methods for the crucial phases of the ED model such as Feature Selection (FS), ED, and summarization. This work focuses on the FS problem by automatically detecting events through a novel wrapper FS method based on Adapted Binary Bat Algorithm (ABBA) and Adapted Markov Clustering Algorithm (AMCL), termed ABBA-AMCL. These adaptive techniques were developed to overcome the premature convergence in BBA and fast convergence rate in MCL. Furthermore, this study proposes four summarizing methods to generate informative summaries. The enhanced ED model was tested on 10 benchmark datasets and 2 Facebook news datasets. The effectiveness of ABBA-AMCL was compared to 8 FS methods based on meta-heuristic algorithms and 6 graph-based ED methods. The empirical and statistical results proved that ABBAAMCL surpassed other methods on most datasets. The key representative features demonstrated that ABBA-AMCL method successfully detects real-world events from Facebook news datasets with 0.96 Precision and 1 Recall for dataset 11, while for dataset 12, the Precision is 1 and Recall is 0.76. To conclude, the novel ABBA-AMCL presented in this research has successfully bridged the research gap and resolved the curse of high dimensionality feature space for heterogeneous news text documents. Hence, the enhanced ED model can organize news documents into distinct events and provide policymakers with valuable information for decision making
    corecore