1 research outputs found
An extensive empirical study of inconsistent labels in multi-version-project defect data sets
The label quality of defect data sets has a direct influence on the
reliability of defect prediction models. In this study, for
multi-version-project defect data sets, we propose an approach to automatically
detecting instances with inconsistent labels (i.e. the phenomena of instances
having the same source code but different labels over multiple versions of a
software project) and understand their influence on the evaluation and
interpretation of defect prediction models. Based on five multi-version-project
defect data sets (either widely used or the most up-to-date in the literature)
collected by diverse approaches, we find that: (1) most versions in the
investigated defect data sets contain inconsistent labels with varying degrees;
(2) the existence of inconsistent labels in a training data set may
considerably change the prediction performance of a defect prediction model as
well as can lead to the identification of substantially different true
defective modules; and (3) the importance ranking of independent variables in a
defect prediction model can be substantially shifted due to the existence of
inconsistent labels. The above findings reveal that inconsistent labels in
defect data sets can profoundly change the prediction ability and
interpretation of a defect prediction model. Therefore, we strongly suggest
that practitioners should detect and exclude inconsistent labels in defect data
sets to avoid their potential negative influence on defect prediction models.
What is more, it is necessary for researchers to improve existing defect label
collection approaches to reduce inconsistent labels. Furthermore, there is a
need to re-examine the experimental conclusions of previous studies using
multi-version-project defect data sets with a high ratio of inconsistent
labels.Comment: 63 pages, 24 figures, 14 table