62 research outputs found

    CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

    Full text link
    Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater interest than the majority class instances in real-life applications. Recently, several techniques based on sampling methods (under-sampling of the majority class and over-sampling the minority class), cost-sensitive learning methods, and ensemble learning have been used in the literature for classifying imbalanced datasets. In this paper, we introduce a new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification. The proposed algorithm provides an alternative to RUSBoost (random under-sampling with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost) algorithms. We evaluated the performance of CUSBoost algorithm with the state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class datasets with various imbalance ratios. The experimental results show that the CUSBoost is a promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201

    Evaluation of F-Measure and Feature Analysis of C5.0 Implementation on Single Nucleotide Polymorphism Calling

    Full text link
    Data growing in molecular biology has increased rapidly since Next-Generation Sequencing (NGS) technology introduced in 2000, the latest technology used to sequence DNA with high throughput. Single Nucleotide Polymorphism (SNP) is a marker based on DNA which can be used to identify organism specifically. SNPs are usually exploited for optimizing parents selection in producing high-quality seed for plant breeding. This paper discusses SNP calling underlying NGS data of cultivated soybean (Glycine max [L]. Merr) using C5.0, an improved rule-based algorithm of C4.5. The evaluation illustrated that C5.0 is better than the other rule-based algorithm CART based on f-measure. The value of f-measure using C5.0 and CART are 0.63 and 0.58. Besides of that, C5.0 is robust for imbalanced training dataset up to 1:17 but it is suffer in large training dataset. C5.0's performance may be increased by applying bagging or the other ensemble technique as improvement of CART by applying bagging in final decision. The other important thing is using appropriate features in representing SNP candidates. Based on information gain of C5.0, this paper recommends error probability, homopolymer left, mismatch alt and mean nearby qual as features for SNP calling

    Adaptive subspace sampling for class imbalance processing

    Full text link
    © 2016 IEEE. This paper presents a novel oversampling technique that addresses highly imbalanced data distribution. At present, the imbalanced data that have anomalous class distribution and underrepresented data are difficult to deal with through a variety of conventional machine learning technologies. In order to balance class distributions, an adaptive subspace self-organizing map (ASSOM) that combines the local mapping scheme and globally competitive rule is proposed to artificially generate synthetic samples focusing on minority class samples. The ASSOM is conformed with feature-invariant characteristics, including translation, scaling and rotation, and it retains the independence of basis vectors in each module. Specifically, basis vectors generated via each ASSOM module can avoid generating repeated representative features that offer nothing but heavy computational load. Several experimental results demonstrate that the proposed ASSOM method with supervised learning manner is superior to other existing oversampling techniques

    Implementasi Teknik Sampling untuk Mengatasi Imbalanced Data pada Penentuan Status Gizi Balita dengan Menggunakan Learning Vector Quantization

    Full text link
    Balita memerlukan suatu pengawasan khusus karena pada masa-masa tersebut Balita rentan terhadap serangan penyakit dan kekurangan gizi. Untuk itu penelitian ini bertujuan untuk menerapkan metode Learning Vector Quantization (LVQ) dalam proses klasifikasi status gizi Balita ke dalam gizi lebih, gizi baik, gizi rentan, dan gizi kurang. Data yang digunakan dalam penelitian ini adalah data gizi Balita sebanyak 612, terdiri dari 38 data gizi lebih, 491 data gizi baik, 63 gizi rentan, dan 20 data gizi kurang. Data tersebut disebut sebagai data tidak seimbang. Selanjutnya, penelitian ini menerapkan teknik undersampling dan oversampling untuk mengatasi permasalahan tersebut. Hasil penelitian menunjukkan bahwa penerapan metode LVQ terhadap data tak seimbang menghasilkan nilai akurasi sebesar 84.15 %, tetapi nilai overall accuracy sebesar 43.27%. Sedangkan penerapan metode LVQ terhadap data yang seimbang menghasilkan nilai akurasi dan overall accuracy yang sama yakni sebesar 74.38%

    The Empirical Comparison of Machine Learning Algorithm for the Class Imbalanced Problem in Conformational Epitope Prediction

    Get PDF
    A conformational epitope is a part of a protein-based vaccine. It is challenging to identify using an experiment. A computational model is developed to support identification. However, the imbalance class is one of the constraints to achieving optimal performance on the conformational epitope B cell prediction. In this paper, we compare several conformational epitope B cell prediction models from non-ensemble and ensemble approaches. A sampling method from Random undersampling, SMOTE, and cluster-based undersampling is combined with a decision tree or SVM to build a non-ensemble model. A random forest model and several variants of the bagging method is used to construct the ensemble model. A 10-fold cross-validation method is used to validate the model.  The experiment results show that the combination of the cluster-based under-sampling and decision tree outperformed the other sampling method when combined with the non-ensemble and the ensemble method. This study provides a baseline to improve existing models for dealing with the class imbalance in the conformational epitope prediction
    • …
    corecore