5 research outputs found

    Feature selection for high dimensional imbalanced class data using harmony search

    Get PDF
    Misclassification costs of minority class data in real-world applications can be very high. This is a challenging problem especially when the data is also high in dimensionality because of the increase in overfitting and lower model interpretability. Feature selection is recently a popular way to address this problem by identifying features that best predict a minority class. This paper introduces a novel feature selection method call SYMON which uses symmetrical uncertainty and harmony search. Unlike existing methods, SYMON uses symmetrical uncertainty to weigh features with respect to their dependency to class labels. This helps to identify powerful features in retrieving the least frequent class labels. SYMON also uses harmony search to formulate the feature selection phase as an optimisation problem to select the best possible combination of features. The proposed algorithm is able to deal with situations where a set of features have the same weight, by incorporating two vector tuning operations embedded in the harmony search process. In this paper, SYMON is compared against various benchmark feature selection algorithms that were developed to address the same issue. Our empirical evaluation on different micro-array data sets using G-Mean and AUC measures confirm that SYMON is a comparable or a better solution to current benchmarks

    Improved probabilistic distance based locality preserving projections method to reduce dimensionality in large datasets

    Get PDF
    In this paper, a dimensionality reduction is achieved in large datasets using the proposed distance based Non-integer Matrix Factorization (NMF) technique, which is intended to solve the data dimensionality problem. Here, NMF and distance measurement aim to resolve the non-orthogonality problem due to increased dataset dimensionality. It initially partitions the datasets, organizes them into a defined geometric structure and it avoids capturing the dataset structure through a distance based similarity measurement. The proposed method is designed to fit the dynamic datasets and it includes the intrinsic structure using data geometry. Therefore, the complexity of data is further avoided using an Improved Distance based Locality Preserving Projection. The proposed method is evaluated against existing methods in terms of accuracy, average accuracy, mutual information and average mutual information

    Performance Study Of Uncertainty Based Feature Selection Method On Detection Of Chronic Kidney Disease With SVM Classification

    Get PDF
    Chronic Kidney Disease (CKD) is a disorder that impairs kidney function. Early signs of CKD patients are very difficult until they lose 25% of their kidney function. Therefore, early detection and effective treatment are needed to reduce the mortality rate of CKD sufferers. In this study, the authors diagnose the CKD dataset using the Support Vector Machine (SVM) classification method to obtain accurate diagnostic results. The authors propose a comparison of the result on applying the feature selec- tion method to get the best feature candidates in improving the classification result. The testing process compares the Symmetrical Uncertainty (SU) and Multivariate Symmetrical Uncertainty (MSU) feature selection method and the SVM method as a classification method. Several experimental scenarios were carried out using the SU and MSU feature selection methods using the CKD dataset. From the results of the tests carried out, it shows that using the MSU feature selection method with 80%: 20% data split produces nine important features with an accuracy value of 0.9, sensi- tivity 0.84, specification 1.0, and when viewed on the ROC graph, the MSU method graph shows the true positive value is higher than the false positive value. So the classification using the MSU feature selection method is better than the SU feature selection method by 90% accurac

    The effectiveness of feature selection methods for ımbalanced text classification

    Get PDF
    Metin verilerinin sınıflar arasında dağılımı genellikle eşit değildir. Bu durum, metin sınıflandırma işleminde sınıflandırıcıların performansına olumsuz yansımaktadır. Dengesiz metin sınıflandırma ile ilgili birçok çalışma yapılmıştır. Metin sınıflandırma işleminin önemli aşamalarından olan öznitelik seçim aşaması, dengesiz metin probleminde de kritik öneme sahiptir. Öznitelik seçme metotlarının dengesiz metinlerin sınıflandırılması üzerindeki etkisi bu çalışmada etraflıca araştırılmıştır. Bu doğrultuda, iki farklı veri seti üzerinde üç farklı sınıflandırıcı ve dokuz farklı öznitelik seçim metodu ile birçok deney yapılmıştır. Ayrıca öznitelik seçim yöntemlerinin başarıları farklı öznitelik sayılarında da gözlemlenmiştir. NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS ve MDFS olarak adlandırılan 9 farklı öznitelik seçim metodu değerlendirilmiştir. Destek Vektör Makinesi (SVM), Karar Ağacı (DTREE) ve Basit Bayes (MNB) sınıflandırıcıları ile deneysel sonuçlar elde edilmiştir. Reuters-21578 veri setinde DFS ve CHI2 öznitelik seçim yöntemleri Makro-F1 değerlendirme metriği üzerinden yaklaşık en yüksek 80 değerini alırken, SPAM SMS veri setinde, DFS öznitelik seçim yöntemi en yüksek skor olarak 95 ve CHI2 öznitelik seçim yöntemi 94 değerlerini almıştır. Öznitelik seçme metotlarından DFS ve CHI2’nin dengesiz metin sınıflandırmada daha başarılı olduğu görülmektedir.The distribution of text data across classes is often imbalanced. This situation has a negative impact on the performance of classifiers in the text classification process. Many studies have been performed on imbalanced text classification. The feature selection stage, which is one of the important stages of the text classification process, is also critical in the imbalanced text classification problem. The effect of feature selection methods on the classification of imbalanced texts has been thoroughly investigated in this study. In this direction, many experiments were carried out with three different classifiers and nine different feature selection methods on two different data sets. In addition, the success of feature selection methods has been observed employing different number of features. Nine different feature selection methods called NDM, DFSS, PFS, POISSON, CHI2, IG, GINI, DFS and MDFS were evaluated. Experimental results obtained with Support Vector Machines (SVM), Decision Tree (DTREE), and Naïve Bayes (MNB) classifiers. On the Reuters-21578 dataset, DFS and CHI2 feature selection methods obtained approximately 80 as the highest Macro-F1 score. On the SPAM SMS dataset, DFS feature selection method obtained 95 and CHI2 feature selection method obtained 94 as the highest Macro-F1 score. It is seen that feature selection methods DFS and CHI2 are more successful than the others for imbalanced text classification

    Filter � GA Based Approach to Feature Selection for Classification

    Get PDF
    This paper presents a new approach to select reduced number of features in databases. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as and can confuse the process of classification. The proposed method applies filter attribute measure and binary coded Genetic Algorithm to select a small subset of features. The importance of these features is judged by applying K-nearest neighbor (KNN) method of classification. The best reduced subset of features which has high classification accuracy on given databases is adopted. The classification accuracy obtained by proposed method is compared with that reported recently in publications on twenty eight databases. It is noted that proposed method performs satisfactory on these databases and achieves higher classification accuracy but with smaller number of features
    corecore