2 research outputs found
Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)
We report and fix an important systematic error in prior studies that ranked
classifiers for software analytics. Those studies did not (a) assess
classifiers on multiple criteria and they did not (b) study how variations in
the data affect the results. Hence, this paper applies (a) multi-criteria tests
while (b) fixing the weaker regions of the training data (using SMOTUNED, which
is a self-tuning version of SMOTE). This approach leads to dramatically large
increases in software defect predictions. When applied in a 5*5
cross-validation study for 3,681 JAVA classes (containing over a million lines
of code) from open source systems, SMOTUNED increased AUC and recall by 60% and
20% respectively. These improvements are independent of the classifier used to
predict for quality. Same kind of pattern (improvement) was observed when a
comparative analysis of SMOTE and SMOTUNED was done against the most recent
class imbalance technique. In conclusion, for software analytic tasks like
defect prediction, (1) data pre-processing can be more important than
classifier choice, (2) ranking studies are incomplete without such
pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing.Comment: 10 pages + 2 references. Accepted to International Conference of
Software Engineering (ICSE), 201