An Improved SMOTE Algorithm Based on Genetic Algorithm for Imbalanced Data Collection

Abstract

Classification of imbalanced data has been recognized as a crucial problem in machine learning and data mining. In an imbalanced dataset, minority class instances are likely to be misclassified. When the synthetic minority over-sampling technique (SMOTE) is applied in imbalanced dataset classification, the same sampling rate is set for all samples of the minority class in the process of synthesizing new samples, this scenario involves blindness. To overcome this problem, an improved SMOTE algorithm based on genetic algorithm (GA), namely, GASMOTE was proposed. First, GASMOTE set different sampling rates for different minority class samples. A combination of the sampling rates corresponded to an individual in the population. Second, the selection, crossover, and mutation operators of GA were iteratively applied to the population to obtain the best combination of sampling rates when the stopping criteria were met. Lastly, the best combination of sampling rates was used in SMOTE to synthetize new samples. Experimental results on 10 typical imbalanced datasets show that GASMOTE increases the F-measure value by 5.9% and the G-mean value by 1.6% compared with the SMOTE algorithm. Meanwhile, GASMOTE increases the F-measure value by 3.7% and the G-mean value by 2.3% compared with the borderline-SMOTE algorithm. GASMOTE can be utilized as a new over-sampling technique to address the problem of imbalanced dataset classification. The GASMOTE algorithm can be then adopted in a practical engineering application, namely, prediction of rockburst in VCR rockburst datasets. The experimental results indicate that the GASMOTE algorithm can accurately predict the rockburst occurrence and thus provides guidance to the design and construction of safe deep-mining engineering structures

    Similar works