Search CORE

375 research outputs found

Snoring: A Noise Defect Prediction Datasets

Author: Ahluwalia Aalok
Publication venue: DigitalCommons@CalPoly
Publication date: 01/06/2019
Field of study

Defect prediction aims at identifying software artifacts that are likely to exhibit a defect. The main purpose of defect prediction is to reduce the cost of testing and code review, by letting developers focus on speciﬁc artifacts. Several researchers have worked on improving the accuracy of defect estimation models using techniques such as tuning, re-balancing, or feature selection. Ultimately, the reliability of a prediction model depends on the quality of the dataset. Therefore eﬀort has been spent in identifying sources of noise in the datasets, and how to deal with them, including defect misclassiﬁcation and defect origin. A key component of defect prediction approaches is the attribution of a defect to a projects release. Although developers might be able to attribute a defect to a speciﬁc release, in most cases a defect is attributed to the release after which the defect has been discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon a “sleeping defect”. We call “snoring” the phenomenon in which classes are aﬀected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this work, we analyze, on data from more than 4,000 bugs and 600 releases of 20 open source projects from the Apache ecosystem for investigating: 1)the magnitude of the sleeping defects, 2) the magnitude of the snoring classes, 3)if snoring impacts the evaluation of classiﬁers, 4)if snoring impacts classiﬁer accuracy, and 5)if removing the last releases of data is beneﬁcial in reducing the negative impact of the snoring noise on classiﬁers accuracy. Our results show that, on average across projects: 1)most of the defects in a project slept for more than 19% of the existing releases, 2)the missing rate is more than 50% unless we remove more than 20% of the releases, 3) the relative error in measuring the classiﬁer accuracy achieved by using a dataset with snoring is about 100% in all accuracy metrics other than AUC, 4) the presence of snoring decreases the accuracy in each of the 15 classiﬁers, in each of the 6 accuracy metrics. For instance, Recall, F1, Kappa and Matthews decreases by about 80%, and 5) removing one release of data is better than removing no data in all accuracy metrics. For instance, Recall, F1, Kappa and Matthews increase by about 30%

Using Negative Binomial Regression Analysis to Predict Software Faults: A Study of Apache Ant

Author: Yu Liguo
Publication venue: I.J. Information Technology and Computer Science
Publication date: 01/01/2012
Field of study

Negative binomial regression has been proposed as an approach to predicting fault-prone software modules. However, little work has been reported to study the strength, weakness, and applicability of this method. In this paper, we present a deep study to investigate the effectiveness of using negative binomial regression to predict fault-prone software modules under two different conditions, self-assessment and forward assessment. The performance of negative binomial regression model is also compared with another popular fault prediction model—binary logistic regression method. The study is performed on six versions of an open-source objected-oriented project, Apache Ant. The study shows (1) the performance of forward assessment is better than or at least as same as the performance of self-assessment; (2) in predicting fault-prone modules, negative binomial regression model could not outperform binary logistic regression model; and (3) negative binomial regression is effective in predicting multiple errors in one modul

IUScholarWorks (University of Indiana)