4 research outputs found
Class Imbalanced Learning Menggunakan Algoritma Synthetic Minority Over-sampling Technique – Nominal (SMOTE-N) pada Dataset Tuberculosis Anak
Abstract. Class Imbalanced Learning (CIL) is the learning process for data representation and information extraction with severe data distribution to develop effective decisions supporting the decision-making process. SMOTE-N is one of the data level approach in CIL using over-sampling method. It generates synthetic instances to balance its minority class. This research applied SMOTE-N on Children Tuberculosis Dataset that has class imbalance. Over-sampling method is chosen to avoid important information loss because the Childhood Tuberculosis Dataset has a small number of instances. The Naive Bayes Classifier has been applied to the balance dataset to evaluate its model. The results show that SMOTE-N can improve CIL performance metrics.Keywords: Class Imbalance Learning, Over-sampling, SMOTE-N, Naïve Bayes ClassifierAbstrak. Class Imbalance Learning (CIL) merupakan proses pembelajaran untuk representasi data dan ekstraksi informasi dengan distribusi data yang buruk untuk mendukung pembuatan keputusan yang efektif dalam proses pengambilan keputusan. SMOTE-N adalah salah satu pendekatan data-level dalam CIL mengunakan metode over-sampling. SMOTE-N menghasilkan instance sintesis untuk menyeimbangkan jumlah instance pada kelas minoritasnya. Penelitian ini mengaplikasikan SMOTE-N pada dataset Tuberculosis Anak (TB Anak) yang memiliki ketidakseimbangan kelas. Metode over-sampling dipilih untuk menghindari kehilangan informasi yang penting dikarenakan dataset TB Anak memiliki jumlah instance yang sedikit. Naïve Bayes Classifier digunakan untuk mengevaluasi model dari dataset yang sudah seimbang. Hasilnya menunjukkan bahwa SMOTE-N dapat meningkatkan kinerja pada CIL.Kata Kunci: Class Imbalance Learning, Over-sampling, SMOTE-N, Naïve Bayes Classifie
Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems
Traditional classification algorithms often
fail in learning from highly imbalanced datasets because the training involves
most of the samples from majority class compared to the other existing minority
class. In this paper, a Multiple Learners-based Ensemble SMOTEBagging (ML-ESB)
technique is proposed. The ML-ESB algorithm is a modified SMOTEBagging technique
in which the ensemble of multiple instances of the single learner is replaced
by multiple distinct classifiers. The proposed ML-ESB is designed for handling
only the binary class imbalance problem. In ML-ESB the ensembles of multiple
distinct classifiers include Naïve Bays, Support Vector Machine, Logistic Regression
and Decision Tree (C4.5) is used. The performance of ML-ESB is evaluated based
on six binary imbalanced benchmark datasets using evaluation measures such as
specificity, sensitivity, and area under receiver operating curve. The obtained
results are compared with those of SMOTEBagging, SMOTEBoost, and cost-sensitive
MCS algorithms with different imbalance ratios (IR). The ML-ESB algorithm
outperformed other existing methods on four datasets with high dimensions and
class IR, whereas it showed moderate performance on the remaining two low
dimensions and small IR value datasets
Unified processing framework of high-dimensional and overly imbalanced chemical datasets for virtual screening.
Virtual screening in drug discovery involves processing large datasets containing unknown molecules in order to find the ones that are likely to have the desired effects on a biological target, typically a protein receptor or an enzyme. Molecules are thereby classified into active or non-active in relation to the target. Misclassification of molecules in cases such as drug discovery and medical diagnosis is costly, both in time and finances. In the process of discovering a drug, it is mainly the inactive molecules classified as active towards the biological target i.e. false positives that cause a delay in the progress and high late-stage attrition. However, despite the pool of techniques available, the selection of the suitable approach in each situation is still a major challenge. This PhD thesis is designed to develop a pioneering framework which enables the analysis of the virtual screening of chemical compounds datasets in a wide range of settings in a unified fashion. The proposed method provides a better understanding of the dynamics of innovatively combining data processing and classification methods in order to screen massive, potentially high dimensional and overly imbalanced datasets more efficiently