23,028 research outputs found

    Feature selection for high dimensional imbalanced class data using harmony search

    Get PDF
    Misclassification costs of minority class data in real-world applications can be very high. This is a challenging problem especially when the data is also high in dimensionality because of the increase in overfitting and lower model interpretability. Feature selection is recently a popular way to address this problem by identifying features that best predict a minority class. This paper introduces a novel feature selection method call SYMON which uses symmetrical uncertainty and harmony search. Unlike existing methods, SYMON uses symmetrical uncertainty to weigh features with respect to their dependency to class labels. This helps to identify powerful features in retrieving the least frequent class labels. SYMON also uses harmony search to formulate the feature selection phase as an optimisation problem to select the best possible combination of features. The proposed algorithm is able to deal with situations where a set of features have the same weight, by incorporating two vector tuning operations embedded in the harmony search process. In this paper, SYMON is compared against various benchmark feature selection algorithms that were developed to address the same issue. Our empirical evaluation on different micro-array data sets using G-Mean and AUC measures confirm that SYMON is a comparable or a better solution to current benchmarks

    Entropy Based Fuzzy Support Vector Machine (EFSVM) untuk Klasifikasi Microarray Imbalanced Data

    Get PDF
    DNA microarray merupakan data yang mengandung ekspresi gen dengan ukuran sampel kecil, namun memiliki jumlah feature yang sangat besar. Selain itu masalah kelas imbalanced merupakan masalah umum dalam data microarray. Oleh karena itu diperlukan metode klasifikasi yang mampu mengatasi pemasalahan high dimensional dan juga permasalahan imbalanced. SVM merupakan salah satu metode klasifikasi yang mampu menangani sampel besar atau kecil, non-linear, high dimensional, over learning dan masalah lokal minimum. Metode SVM juga telah banyak diterapkan untuk klasifikasi data DNA microarray dan didapatkan hasil bahwa SVM memberikan kinerja terbaik di antara metode machine learning lainnya. Namun pengaruh dari imbalanced data pada SVM akan menjadi kekurangan dikarenakan SVM memperlakukan semua sampel dengan kepentingan yang sama sehingga mengakibatkan bias terhadap kelas minoritas. Salah satu metode yang mampu mengatasi imbalanced data adalah EFSVM. EFSVM mampu menghasilkan nilai AUC yang tertinggi apabila dibandingkan dengan SVM dan FSVM. Mengingat data DNA microarray merupakan high dimensional data dengan jumlah feature yang sangat besar, maka perlu dilakukan feature selection terlebih dahulu. Pada penelitian dilakukan klasifikasi terhadap data DNA microarray dengan kasus data yang imbalanced menggunakan EFSVM dengan terlebih dahulu dilakukan seleksi fitur menggunakan FCBF. Hasil performansi klasifikasi menunjukkan bahwa feature selection mampu meningkatkan performansi klasifikasi. Adanya penambahan entropy based fuzzy membership terbukti mampu menghasilkan performansi paling tinggi dibandingkan dengan SVM dan FSVM, namun untuk data yang telah dilakukan feature selection, antara FSVM dan EFSVM diperoleh hasil yang hampir sama. ============================================================================DNA microarrays are data containing gene expression with small sample sizes and high number of features. Furthermore, imbalanced classes is a common problem in microarray data. This occurs when a dataset is dominated by a major class which have significantly more instances than the other minority classes in the data. Therefore, it is needed a classification method that can solve the problem of high dimensional and imbalanced data. SVM is one of the classification methods that is capable of handling large or small samples, nonlinear, high dimensional, over learning and local minimum issues. SVM has been widely applied to DNA microarray data classification and it has been shown that SVM provides the best performance among other machine learning methods. However, imbalanced data will be a problem because SVM treats all samples in the same importance thus the results is bias for minority class. To overcome the imbalanced data, EFSVM is proposed. This method apply a fuzzy membership to each input point and reformulate the SVM such that different input points provide different constributions to the classifier. The samples with higher class certainty, that measured by entropy, are assigned to larger fuzzy membership. The importance of the minority classes have large fuzzy membership and EFSVM can pay more attention to the samples with larger fuzzy membership. Given DNA microarray data is high dimensional data with a very large number of features, it is necessary to do feature selection first using FCBF. Based on the overall results, EFSVM has the highest AUC value compared to SVM and FSVM

    Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data

    Full text link
    In this paper, a robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem. The method addresses one of the most challenging problems of highly skewed class distributions in gene expression datasets that adversely affect the performance of classification algorithms. First, the training dataset is balanced by synthetically generating data points from minority class observations. Second, a minimum subset of genes is selected using a greedy search approach. Third, a novel weighted robust score, where the weights are computed by support vectors, is introduced to obtain a refined set of genes. The highest-scoring genes based on this approach are combined with the minimum subset of genes selected by the greedy search approach to form the final set of genes. The novel method ensures the selection of the most discriminative genes, even in the presence of skewed class distribution, thus improving the performance of the classifiers. The performance of the proposed ROWSU method is evaluated on 66 gene expression datasets. Classification accuracy and sensitivity are used as performance metrics to compare the proposed ROWSU algorithm with several other state-of-the-art methods. Boxplots and stability plots are also constructed for a better understanding of the results. The results show that the proposed method outperforms the existing feature selection procedures based on classification performance from k nearest neighbours (kNN) and random forest (RF) classifiers.Comment: 25 page

    Impact of Feature Extraction Combined with Data Sampling Methods on Heartbeat Categorization

    Get PDF
    Dealing with class-imbalanced datasets in data analytics poses challenges, especially when faced with high-dimensional data. In order to handle this issue, researchers often utilize preprocessed methods like feature selection. Feature selection attempts to create a more informative and condensed feature set, while data sampling helps alleviate class imbalance. In our study, aim is to explore the effectiveness of data sampling preprocessed techniques combined with feature extraction using a dataset on ECG Heartbeat. We evaluate ensemble classifiers: Decision Tree; Random Forests (RF), Gradient-Boosted Trees (GBT) for feature extraction. In terms of data sampling, we assess the effectiveness of two methods: Random Under sampling (RUS) and Synthetic Minority Oversampling (SMOTE). The performance of this feature extraction is measured using the sensitivity and the specificity, two important metrics used for accuracy. Our findings depict that the combination of the RUS and GBT method yields the highest performance for ECG Heartbeat detection

    Hyperspectral Image Analysis with Subspace Learning-based One-Class Classification

    Full text link
    Hyperspectral image (HSI) classification is an important task in many applications, such as environmental monitoring, medical imaging, and land use/land cover (LULC) classification. Due to the significant amount of spectral information from recent HSI sensors, analyzing the acquired images is challenging using traditional Machine Learning (ML) methods. As the number of frequency bands increases, the required number of training samples increases exponentially to achieve a reasonable classification accuracy, also known as the curse of dimensionality. Therefore, separate band selection or dimensionality reduction techniques are often applied before performing any classification task over HSI data. In this study, we investigate recently proposed subspace learning methods for one-class classification (OCC). These methods map high-dimensional data to a lower-dimensional feature space that is optimized for one-class classification. In this way, there is no separate dimensionality reduction or feature selection procedure needed in the proposed classification framework. Moreover, one-class classifiers have the ability to learn a data description from the category of a single class only. Considering the imbalanced labels of the LULC classification problem and rich spectral information (high number of dimensions), the proposed classification approach is well-suited for HSI data. Overall, this is a pioneer study focusing on subspace learning-based one-class classification for HSI data. We analyze the performance of the proposed subspace learning one-class classifiers in the proposed pipeline. Our experiments validate that the proposed approach helps tackle the curse of dimensionality along with the imbalanced nature of HSI data
    • …
    corecore