14 research outputs found

    Komparasi metode SMOTE dan ADASYN untuk penanganan data tidak seimbang MultiClass

    Get PDF
    Data Mining is an activity that combines various branches of science into one, consisting of database systems, statistics, machine learning, and visualization, to analyze a large dataset in order to obtain useful data characteristics. To address the problem of imbalanced datasets, the distribution of non-uniform classes among classes is balanced by using a comparison of the SMOTE and ADASYN methods to ensure that the number is balanced between majority (negative) and minority (positive) classes. Based on the results of experiments conducted in this study, testing the SMOTE method with a classification method can handle the number of majority (negative) and minority (positive) classes in imbalanced data by producing MCC and Gmean values that achieve better predictive performance than using a classification method alone or using the ADASYN method. Furthermore, for the MultiClass dataset, the highest MCC and Gmean values were achieved using SMOTE + KNN with the highest MCC value of 0.64 and Gmean value of 0.74. This indicates that the handling process of imbalanced class distribution in the data preprocessing stage has an influence on the accuracy values of MCC and Gmean in the SMOTE + KNN method

    SMOTified-GAN for class imbalanced pattern classification problems

    Get PDF
    Class imbalance in a dataset is a major problem for classifiers that results in poor prediction with a high true positive rate (TPR) but a low true negative rate (TNR) for a majority positive training dataset. Generally, the pre-processing technique of oversampling of minority class(es) are used to overcome this deficiency. Our focus is on using the hybridization of Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to address class imbalanced problems. We propose a novel two-phase oversampling approach involving knowledge transfer that has the synergy of SMOTE and GAN. The unrealistic or overgeneralized samples of SMOTE are transformed into realistic distribution of data by GAN where there is not enough minority class data available for GAN to process them by itself effectively. We named it SMOTified-GAN as GAN works on pre-sampled minority data produced by SMOTE rather than randomly generating the samples itself. The experimental results prove the sample quality of minority class(es) has been improved in a variety of tested benchmark datasets. Its performance is improved by up to 9\% from the next best algorithm tested on F1-score measurements. Its time complexity is also reasonable which is around O(N2d2T) for a sequential algorithm

    Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine

    Get PDF
    Data tidak seimbang menjadi salah satu masalah yang muncul pada masalah prediksi atau klasifikasi. Penelitian ini memfokuskan untuk mengatasi masalah data tidak seimbang pada prediksi drug-target interaction (interaksi senyawa-protein). Ada banyak protein target dan senyawa obat yang terdapat pada basis data interaksi senyawa-protein yang belum divalidasi interaksinya secara eksperimen. Belum diketahuinya interaksi antar senyawa dan target tersebut membuat proporsi antara data yang diketahui interaksinya dan yang belum dikethui menjadi tidak seimbang. Data interaksi yang sangat tidak seimbang dapat menyebabkan hasil prediksi menjadi bias. Terdapat banyak cara untuk mengatasi data tidak seimbang ini, namun pada penelitian ini diimplementasikan metode yang menggabungkan Biased Support Vector Machine (BSVM), oversampling, dan undersampling dengan Ensemble Support Vector Machine (SVM). Penelitian ini mengeksplorasi efek sampling yang digabungkan dalam metode tersebut pada data interaksi senyawa-protein. Metode ini sudah diuji pada dataset Nuclear Receptor, G-Protein Coupled Receptor dan Ion Channel dengan rasio ketidakseimbangannya sebesar 14.6%, 32.36%, dan 28.2%. Hasil pengujian dengan menggunakan ketiga dataset tersebut menunjukkan nilai area under curve (AUC) secara berturut-turut sebesar 63.4%, 71.4%, 61.3% dan F-measure sebesar 54%, 60.7% dan 39%. Nilai akurasi dari metode yang digunakan masih terbilang cukup baik, walaupun nilai tersebut lebih kecil dari metode SVM tanpa perlakuan apapun. Nilai tersebut bias karena nilai AUC dan F-measure ternyata lebih kecil. Hal ini membuktikan bahwa metode yang diusulkan dapat menurunkan tingkat bias pada data tidak seimbang yang diuji dan meningkatkan nilai AUC dan f-measure sekitar 5%-20%. AbstractImbalanced data has been one of the problems that arise in processing data. This research is focusing on handling imbalanced data problem for drug-target (compound-protein) interaction data. There are many target protein and drug compound existed in compound-protein interaction databases, which many interactions are not validated yet by experiment. This unknown interaction led drug target interaction to become imbalanced data. A really imbalanced data may cause bias to prediction result. There are many ways of handling imbalanced data, but this research implemented some methods such as BSVM, oversampling, undersampling with SVM ensemble. These method already solve the imbalanced data problem on other kind of data like image data. This research is focusing on exploration of effect on the sampling that used in these method for compound-protein interaction data. This method had been tested on compound-protein interaction Nuclear Receptor, GPCR and Ion Channel with 14.6%, 32.36% and 28.2% of imbalance ratio. The evaluation result using these three dataset show the value of AUC respectively 63.4%, 71.4%, 61.3% and F-measure of 54%, 60.7% and 39%. The score from this method is quite good, even though the score of accuracy and precision is smaller than the SVM. The value is bias because the AUC and F-measure score is smaller. This proves that the proposed method could reduce the bias rate in the evaluated imbalanced data and increase AUC and f-measure score from 5% to 20%

    COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM

    Get PDF
    The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different industries and to conduct an experimental comparison of algorithms. The article solves the following tasks: to analyze approaches to solving the problem – preprocessing methods, learning methods, hybrid methods and algorithmic approaches; to define and describe the oversampling algorithms most often used to balance datasets; to select classification algorithms that will serve as a tool for establishing the quality of balancing by checking the applicability of the datasets obtained after oversampling; to determine metrics for assessing the quality of classification for comparison; to conduct experiments according to the proposed methodology. For clarity, we considered datasets with varying degrees of imbalance (the number of instances of the minority class was equal to 15, 30, 45, and 60% of the number of samples of the majority class). The following methods are used: analytical and inductive methods for determining the necessary set of experiments and building hypotheses regarding their results, experimental and graphic methods for obtaining a visual comparative characteristic of the selected algorithms. The following results were obtained: with the help of quality metrics, an experiment was conducted for all algorithms on two different datasets – the Titanic passenger dataset and the dataset for detecting fraudulent transactions in bank accounts. The obtained results indicated the best applicability of SMOTE and SVM SMOTE algorithms, the worst performance of Borderline SMOTE and k-means SMOTE, and at the same time described the results of each algorithm and the potential of their usage. Conclusions: the application of the analytical and experimental method provided a comprehensive comparative description of the existing balancing algorithms. The superiority of oversampling algorithms over undersampling algorithms was proven. The selected algorithms were compared using different classification algorithms. The results were presented using graphs and tables, as well as demonstrated in general using heat maps. Conclusions that were made can be used when choosing the optimal balancing algorithm in the field of machine learning

    Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers

    Get PDF

    Investigation into the Predictive Power of Artificial Neural Networks and Logistic Regression for Predicting Default in Chit Funds

    Get PDF
    This study evaluated the performance of an artificial neural network (ANN) multi-layer perceptron model and a logistic regression logitboost (LR) model to predict default in chit funds. The two types of default investigated were late payment of 30 days and late payment of 90 days. The dataset was broken up into training and validation datasets using random sampling and K folds cross validation was used on the training dataset to assess performance of the tuning parameters. The validation dataset was used to compare performance of both algorithms. Principle component analysis (PCA) was used to reduce the feature set while still explaining 95% of the variance in the data. The classes were highly imbalanced and Synthetic Minority Oversampling Technique (SMOTE) and down sampling were used to overcome the class imbalance. 16 experiments were ran, 8 for each of the two defaults. The three key metrics that were measured for these experiments were balanced accuracy, Area under the ROC curve (AUC) and F1 score. After making Bonferroni’s adjustment to the original p value statistical significance was set to 0.003 when comparing multiple experiments. In these experiments the ANN model had the best results for balanced accuracy, AUC and F1score. Statistical analysis using a paired t test showed that there was a statistically significant difference in the results between ANN and LR. The results of these experiments also showed that there was very little difference in the contribution of the top 20 features to the first 30 principal components, which were used to predict default. These features included family id, income and address. Features that had little or no contribution to the principle components included Commission, Auction Amount, and type of relation the nominee is to the chit fund member. These findings are context specific and in this case the context is chit funds from a digital chit fund operator in Indi

    Improving OCR Post Processing with Machine Learning Tools

    Full text link
    Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

    Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data

    No full text
    corecore