88 research outputs found

    SZZ in the time of Pull Requests

    Full text link
    In the multi-commit development model, programmers complete tasks (e.g., implementing a feature) by organizing their work in several commits and packaging them into a commit-set. Analyzing data from developers using this model can be useful to tackle challenging developers' needs, such as knowing which features introduce a bug as well as assessing the risk of integrating certain features in a release. However, to do so one first needs to identify fix-inducing commit-sets. For such an identification, the SZZ algorithm is the most natural candidate, but its performance has not been evaluated in the multi-commit context yet. In this study, we conduct an in-depth investigation on the reliability and performance of SZZ in the multi-commit model. To obtain a reliable ground truth, we consider an already existing SZZ dataset and adapt it to the multi-commit context. Moreover, we devise a second dataset that is more extensive and directly created by developers as well as Quality Assurance (QA) engineers of Mozilla. Based on these datasets, we (1) test the performance of B-SZZ and its non-language-specific SZZ variations in the context of the multi-commit model, (2) investigate the reasons behind their specific behavior, and (3) analyze the impact of non-relevant commits in a commit-set and automatically detect them before using SZZ

    An Empirical Study of Regression Bug Chains in Linux

    Full text link

    Snoring: A Noise Defect Prediction Datasets

    Get PDF
    Defect prediction aims at identifying software artifacts that are likely to exhibit a defect. The main purpose of defect prediction is to reduce the cost of testing and code review, by letting developers focus on specific artifacts. Several researchers have worked on improving the accuracy of defect estimation models using techniques such as tuning, re-balancing, or feature selection. Ultimately, the reliability of a prediction model depends on the quality of the dataset. Therefore effort has been spent in identifying sources of noise in the datasets, and how to deal with them, including defect misclassification and defect origin. A key component of defect prediction approaches is the attribution of a defect to a projects release. Although developers might be able to attribute a defect to a specific release, in most cases a defect is attributed to the release after which the defect has been discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon a “sleeping defect”. We call “snoring” the phenomenon in which classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this work, we analyze, on data from more than 4,000 bugs and 600 releases of 20 open source projects from the Apache ecosystem for investigating: 1)the magnitude of the sleeping defects, 2) the magnitude of the snoring classes, 3)if snoring impacts the evaluation of classifiers, 4)if snoring impacts classifier accuracy, and 5)if removing the last releases of data is beneficial in reducing the negative impact of the snoring noise on classifiers accuracy. Our results show that, on average across projects: 1)most of the defects in a project slept for more than 19% of the existing releases, 2)the missing rate is more than 50% unless we remove more than 20% of the releases, 3) the relative error in measuring the classifier accuracy achieved by using a dataset with snoring is about 100% in all accuracy metrics other than AUC, 4) the presence of snoring decreases the accuracy in each of the 15 classifiers, in each of the 6 accuracy metrics. For instance, Recall, F1, Kappa and Matthews decreases by about 80%, and 5) removing one release of data is better than removing no data in all accuracy metrics. For instance, Recall, F1, Kappa and Matthews increase by about 30%

    Studying the Use of SZZ with Non-functional bugs

    Get PDF
    Non-functional bugs bear a heavy cost on both software developers and end-users. Tools to reduce the occurrence, impact, and repair time of non-functional bugs can therefore provide key assistance for software developers racing to fix these issues. Classification models that focus on identifying defect-prone commits, referred to as \emph{Just-In-Time (JIT) Quality Assurance} are known to be useful in allowing developers to review risky commits. JIT models, however, leverage the SZZ approach to identify whether or not a past change is bug-inducing. However, the due to the nature of non-functional bugs, their fixes may be scattered and separate from their bug-inducing locations in the source code. Yet, prior studies that leverage or evaluate the SZZ approach do not consider non-functional bugs, leading to potential bias on the results. In this thesis, we conduct an empirical study on the results of the SZZ approach on the non-functional bugs in the NFBugs dataset, and the performance bugs in Cassandra, and Hadoop. We manually examine whether each identified bug-inducing change is indeed the correct bug-inducing change. Our manual study shows that a large portion of non-functional bugs cannot be properly identified by the SZZ approach. We uncover root causes for false detection that have not been previously found. We evaluate the identified bug-inducing changes based on criteria from prior research. Our results may be used to assist in future research on non-functional bugs, and highlight the need to complement SZZ to accommodate the unique characteristics of non-functional bugs. Furthermore, we conduct an empirical study to evaluate model performance for JIT models by using them to identify bug-inducing code commits for performance related bugs. Our findings show that JIT defect prediction classifies non-performance bug-inducing commits better than performance bug-inducing commits. However, we find that manually correcting errors in the training data only slightly improves the models. In the absence of a large number of correctly labelled performance bug-inducing commits, our findings show that combining all available training data yields the best classification results
    corecore