5 research outputs found
Feature selection for high dimensional imbalanced class data using harmony search
Misclassification costs of minority class data in real-world applications can be very high. This is a challenging problem especially when the data is also high in dimensionality because of the increase in overfitting and lower model interpretability. Feature selection is recently a popular way to address this problem by identifying features that best predict a minority class. This paper introduces a novel feature selection method call SYMON which uses symmetrical uncertainty and harmony search. Unlike existing methods, SYMON uses symmetrical uncertainty to weigh features with respect to their dependency to class labels. This helps to identify powerful features in retrieving the least frequent class labels. SYMON also uses harmony search to formulate the feature selection phase as an optimisation problem to select the best possible combination of features. The proposed algorithm is able to deal with situations where a set of features have the same weight, by incorporating two vector tuning operations embedded in the harmony search process. In this paper, SYMON is compared against various benchmark feature selection algorithms that were developed to address the same issue. Our empirical evaluation on different micro-array data sets using G-Mean and AUC measures confirm that SYMON is a comparable or a better solution to current benchmarks
Improved probabilistic distance based locality preserving projections method to reduce dimensionality in large datasets
In this paper, a dimensionality reduction is achieved in large datasets using the proposed distance based Non-integer Matrix Factorization (NMF) technique, which is intended to solve the data dimensionality problem. Here, NMF and distance measurement aim to resolve the non-orthogonality problem due to increased dataset dimensionality. It initially partitions the datasets, organizes them into a defined geometric structure and it avoids capturing the dataset structure through a distance based similarity measurement. The proposed method is designed to fit the dynamic datasets and it includes the intrinsic structure using data geometry. Therefore, the complexity of data is further avoided using an Improved Distance based Locality Preserving Projection. The proposed method is evaluated against existing methods in terms of accuracy, average accuracy, mutual information and average mutual information
Performance Study Of Uncertainty Based Feature Selection Method On Detection Of Chronic Kidney Disease With SVM Classification
Chronic Kidney Disease (CKD) is a disorder that impairs kidney function. Early signs of CKD patients are very difficult until they lose 25% of their kidney function. Therefore, early detection and effective treatment are needed to reduce the mortality rate of CKD sufferers. In this study, the authors diagnose the CKD dataset using the Support Vector Machine (SVM) classification method to obtain accurate diagnostic results. The authors propose a comparison of the result on applying the feature selec- tion method to get the best feature candidates in improving the classification result. The testing process compares the Symmetrical Uncertainty (SU) and Multivariate Symmetrical Uncertainty (MSU) feature selection method and the SVM method as a classification method. Several experimental scenarios were carried out using the SU and MSU feature selection methods using the CKD dataset. From the results of the tests carried out, it shows that using the MSU feature selection method with 80%: 20% data split produces nine important features with an accuracy value of 0.9, sensi- tivity 0.84, specification 1.0, and when viewed on the ROC graph, the MSU method graph shows the true positive value is higher than the false positive value. So the classification using the MSU feature selection method is better than the SU feature selection method by 90% accurac
The effectiveness of feature selection methods for ımbalanced text classification
Metin verilerinin sınıflar arasında dağılımı genellikle eşit değildir. Bu durum, metin sınıflandırma
işleminde sınıflandırıcıların performansına olumsuz yansımaktadır. Dengesiz metin sınıflandırma ile ilgili
birçok çalışma yapılmıştır. Metin sınıflandırma işleminin önemli aşamalarından olan öznitelik seçim
aşaması, dengesiz metin probleminde de kritik öneme sahiptir. Öznitelik seçme metotlarının dengesiz
metinlerin sınıflandırılması üzerindeki etkisi bu çalışmada etraflıca araştırılmıştır. Bu doğrultuda, iki
farklı veri seti üzerinde üç farklı sınıflandırıcı ve dokuz farklı öznitelik seçim metodu ile birçok deney
yapılmıştır. Ayrıca öznitelik seçim yöntemlerinin başarıları farklı öznitelik sayılarında da gözlemlenmiştir.
NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS ve MDFS olarak adlandırılan 9 farklı öznitelik seçim
metodu değerlendirilmiştir. Destek Vektör Makinesi (SVM), Karar Ağacı (DTREE) ve Basit Bayes (MNB)
sınıflandırıcıları ile deneysel sonuçlar elde edilmiştir. Reuters-21578 veri setinde DFS ve CHI2 öznitelik
seçim yöntemleri Makro-F1 değerlendirme metriği üzerinden yaklaşık en yüksek 80 değerini alırken,
SPAM SMS veri setinde, DFS öznitelik seçim yöntemi en yüksek skor olarak 95 ve CHI2 öznitelik seçim
yöntemi 94 değerlerini almıştır. Öznitelik seçme metotlarından DFS ve CHI2’nin dengesiz metin
sınıflandırmada daha başarılı olduğu görülmektedir.The distribution of text data across classes is often imbalanced. This situation has a negative impact on
the performance of classifiers in the text classification process. Many studies have been performed on
imbalanced text classification. The feature selection stage, which is one of the important stages of the
text classification process, is also critical in the imbalanced text classification problem. The effect of
feature selection methods on the classification of imbalanced texts has been thoroughly investigated
in this study. In this direction, many experiments were carried out with three different classifiers and
nine different feature selection methods on two different data sets. In addition, the success of feature
selection methods has been observed employing different number of features. Nine different feature
selection methods called NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS and MDFS were evaluated.
Experimental results obtained with Support Vector Machines (SVM), Decision Tree (DTREE), and Naïve
Bayes (MNB) classifiers. On the Reuters-21578 dataset, DFS and CHI2 feature selection methods
obtained approximately 80 as the highest Macro-F1 score. On the SPAM SMS dataset, DFS feature
selection method obtained 95 and CHI2 feature selection method obtained 94 as the highest Macro-F1
score. It is seen that feature selection methods DFS and CHI2 are more successful than the others for
imbalanced text classification
Filter � GA Based Approach to Feature Selection for Classification
This paper presents a new approach to select reduced number of features in databases. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as and can confuse the process of classification. The proposed method applies filter attribute measure and binary coded Genetic Algorithm to select a small subset of features. The importance of these features is judged by applying K-nearest neighbor (KNN) method of classification. The best reduced subset of features which has high classification accuracy on given databases is adopted. The classification accuracy obtained by proposed method is compared with that reported recently in publications on twenty eight databases. It is noted that proposed method performs satisfactory on these databases and achieves higher classification accuracy but with smaller number of features