28 research outputs found

    Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

    Full text link
    We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 201

    Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

    Full text link
    We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 201

    Is One Hyperparameter Optimizer Enough?

    Full text link
    Hyperparameter tuning is the black art of automatically finding a good combination of control parameters for a data miner. While widely applied in empirical Software Engineering, there has not been much discussion on which hyperparameter tuner is best for software analytics. To address this gap in the literature, this paper applied a range of hyperparameter optimizers (grid search, random search, differential evolution, and Bayesian optimization) to defect prediction problem. Surprisingly, no hyperparameter optimizer was observed to be `best' and, for one of the two evaluation measures studied here (F-measure), hyperparameter optimization, in 50\% cases, was no better than using default configurations. We conclude that hyperparameter optimization is more nuanced than previously believed. While such optimization can certainly lead to large improvements in the performance of classifiers used in software analytics, it remains to be seen which specific optimizers should be applied to a new dataset.Comment: 7 pages, 2 columns, accepted for SWAN1

    Software Defect Prediction Using Neural Network Based SMOTE

    Get PDF
    Software defect prediction is a practical approach to improve the quality and efficiency of time and costs for software testing by focusing on defect modules. The defect prediction software dataset naturally has a class imbalance problem with very few defective modules compared to non-defective modules. Class imbalance can reduce performance from classification. In this study, we applied the Neural Networks Based Synthetic Minority Over-sampling Technique (SMOTE) to overcome class imbalances in the six NASA datasets. Neural Network based on SMOTE is a combination of Neural Network and SMOTE with each hyperparameters that are optimized using random search. The results use a nested 5-cross validation show increases Bal by 25.48% and Recall by 45.99% compared to the original Neural Network. We also compare the performance of Neural Network based SMOTE with SMOTE + Traditional Machine Learning Algorithm. The Neural Network based SMOTE takes first place in the average rank
    corecore