23,028 research outputs found
Feature selection for high dimensional imbalanced class data using harmony search
Misclassification costs of minority class data in real-world applications can be very high. This is a challenging problem especially when the data is also high in dimensionality because of the increase in overfitting and lower model interpretability. Feature selection is recently a popular way to address this problem by identifying features that best predict a minority class. This paper introduces a novel feature selection method call SYMON which uses symmetrical uncertainty and harmony search. Unlike existing methods, SYMON uses symmetrical uncertainty to weigh features with respect to their dependency to class labels. This helps to identify powerful features in retrieving the least frequent class labels. SYMON also uses harmony search to formulate the feature selection phase as an optimisation problem to select the best possible combination of features. The proposed algorithm is able to deal with situations where a set of features have the same weight, by incorporating two vector tuning operations embedded in the harmony search process. In this paper, SYMON is compared against various benchmark feature selection algorithms that were developed to address the same issue. Our empirical evaluation on different micro-array data sets using G-Mean and AUC measures confirm that SYMON is a comparable or a better solution to current benchmarks
Entropy Based Fuzzy Support Vector Machine (EFSVM) untuk Klasifikasi Microarray Imbalanced Data
DNA microarray merupakan data yang mengandung ekspresi gen dengan ukuran sampel kecil, namun memiliki jumlah feature yang sangat besar. Selain itu masalah kelas imbalanced merupakan masalah umum dalam data microarray. Oleh karena itu diperlukan metode klasifikasi yang mampu mengatasi pemasalahan high dimensional dan juga permasalahan imbalanced. SVM merupakan salah satu metode klasifikasi yang mampu menangani sampel besar atau kecil, non-linear, high dimensional, over learning dan masalah lokal minimum. Metode SVM juga telah banyak diterapkan untuk klasifikasi data DNA microarray dan didapatkan hasil bahwa SVM memberikan kinerja terbaik di antara metode machine learning lainnya. Namun pengaruh dari imbalanced data pada SVM akan menjadi kekurangan dikarenakan SVM memperlakukan semua sampel dengan kepentingan yang sama sehingga mengakibatkan bias terhadap kelas minoritas. Salah satu metode yang mampu mengatasi imbalanced data adalah EFSVM. EFSVM mampu menghasilkan nilai AUC yang tertinggi apabila dibandingkan dengan SVM dan FSVM. Mengingat data DNA microarray merupakan high dimensional data dengan jumlah feature yang sangat besar, maka perlu dilakukan feature selection terlebih dahulu. Pada penelitian dilakukan klasifikasi terhadap data DNA microarray dengan kasus data yang imbalanced menggunakan EFSVM dengan terlebih dahulu dilakukan seleksi fitur menggunakan FCBF. Hasil performansi klasifikasi menunjukkan bahwa feature selection mampu meningkatkan performansi klasifikasi. Adanya penambahan entropy based fuzzy membership terbukti mampu menghasilkan performansi paling tinggi dibandingkan dengan SVM dan FSVM, namun untuk data yang telah dilakukan feature selection, antara FSVM dan EFSVM diperoleh hasil yang hampir sama.
============================================================================DNA microarrays are data containing gene expression with small sample sizes and high number of features. Furthermore, imbalanced classes is a common problem in microarray data. This occurs when a dataset is dominated by a major class which have significantly more instances than the other minority classes in the data. Therefore, it is needed a classification method that can solve the problem of high dimensional and imbalanced data. SVM is one of the classification methods that is capable of handling large or small samples, nonlinear, high dimensional, over learning and local minimum issues. SVM has been widely applied to DNA microarray data classification and it has been shown that SVM provides the best performance among other machine learning methods. However, imbalanced data will be a problem because SVM treats all samples in the same importance thus the results is bias for minority class. To overcome the imbalanced data, EFSVM is proposed. This method apply a fuzzy membership to each input point and reformulate the SVM such that different input points provide different constributions to the classifier. The samples with higher class certainty, that measured by entropy, are assigned to larger fuzzy membership. The importance of the minority classes have large fuzzy membership and EFSVM can pay more attention to the samples with larger fuzzy membership. Given DNA microarray data is high dimensional data with a very large number of features, it is necessary to do feature selection first using FCBF. Based on the overall results, EFSVM has the highest AUC value compared to SVM and FSVM
Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data
In this paper, a robust weighted score for unbalanced data (ROWSU) is
proposed for selecting the most discriminative feature for high dimensional
gene expression binary classification with class-imbalance problem. The method
addresses one of the most challenging problems of highly skewed class
distributions in gene expression datasets that adversely affect the performance
of classification algorithms. First, the training dataset is balanced by
synthetically generating data points from minority class observations. Second,
a minimum subset of genes is selected using a greedy search approach. Third, a
novel weighted robust score, where the weights are computed by support vectors,
is introduced to obtain a refined set of genes. The highest-scoring genes based
on this approach are combined with the minimum subset of genes selected by the
greedy search approach to form the final set of genes. The novel method ensures
the selection of the most discriminative genes, even in the presence of skewed
class distribution, thus improving the performance of the classifiers. The
performance of the proposed ROWSU method is evaluated on gene expression
datasets. Classification accuracy and sensitivity are used as performance
metrics to compare the proposed ROWSU algorithm with several other
state-of-the-art methods. Boxplots and stability plots are also constructed for
a better understanding of the results. The results show that the proposed
method outperforms the existing feature selection procedures based on
classification performance from k nearest neighbours (kNN) and random forest
(RF) classifiers.Comment: 25 page
Impact of Feature Extraction Combined with Data Sampling Methods on Heartbeat Categorization
Dealing with class-imbalanced datasets in data analytics poses challenges, especially when faced with high-dimensional data. In order to handle this issue, researchers often utilize preprocessed methods like feature selection. Feature selection attempts to create a more informative and condensed feature set, while data sampling helps alleviate class imbalance. In our study, aim is to explore the effectiveness of data sampling preprocessed techniques combined with feature extraction using a dataset on ECG Heartbeat. We evaluate ensemble classifiers: Decision Tree; Random Forests (RF), Gradient-Boosted Trees (GBT) for feature extraction. In terms of data sampling, we assess the effectiveness of two methods: Random Under sampling (RUS) and Synthetic Minority Oversampling (SMOTE). The performance of this feature extraction is measured using the sensitivity and the specificity, two important metrics used for accuracy. Our findings depict that the combination of the RUS and GBT method yields the highest performance for ECG Heartbeat detection
Hyperspectral Image Analysis with Subspace Learning-based One-Class Classification
Hyperspectral image (HSI) classification is an important task in many
applications, such as environmental monitoring, medical imaging, and land
use/land cover (LULC) classification. Due to the significant amount of spectral
information from recent HSI sensors, analyzing the acquired images is
challenging using traditional Machine Learning (ML) methods. As the number of
frequency bands increases, the required number of training samples increases
exponentially to achieve a reasonable classification accuracy, also known as
the curse of dimensionality. Therefore, separate band selection or
dimensionality reduction techniques are often applied before performing any
classification task over HSI data. In this study, we investigate recently
proposed subspace learning methods for one-class classification (OCC). These
methods map high-dimensional data to a lower-dimensional feature space that is
optimized for one-class classification. In this way, there is no separate
dimensionality reduction or feature selection procedure needed in the proposed
classification framework. Moreover, one-class classifiers have the ability to
learn a data description from the category of a single class only. Considering
the imbalanced labels of the LULC classification problem and rich spectral
information (high number of dimensions), the proposed classification approach
is well-suited for HSI data. Overall, this is a pioneer study focusing on
subspace learning-based one-class classification for HSI data. We analyze the
performance of the proposed subspace learning one-class classifiers in the
proposed pipeline. Our experiments validate that the proposed approach helps
tackle the curse of dimensionality along with the imbalanced nature of HSI
data
- …