6 research outputs found
Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP
Context: The identification of bugs within the reported issues in an issue
tracker is crucial for the triage of issues. Machine learning models have shown
promising results regarding the performance of automated issue type prediction.
However, we have only limited knowledge beyond our assumptions how such models
identify bugs. LIME and SHAP are popular technique to explain the predictions
of classifiers.
Objective: We want to understand if machine learning models provide
explanations for the classification that are reasonable to us as humans and
align with our assumptions of what the models should learn. We also want to
know if the prediction quality is correlated with the quality of explanations.
Method: We conduct a study where we rate LIME and SHAP explanations based on
their quality of explaining the outcome of an issue type prediction model. For
this, we rate the quality of the explanations themselves, i.e., if they align
with our expectations and if they help us to understand the underlying machine
learning model.Comment: This registered report received a In-Principal Acceptance (IPA) in
the ESEM 2022 RR trac
Issues with SZZ: An empirical assessment of the state of practice of defect prediction data collection
Defect prediction research has a strong reliance on published data sets that
are shared between researchers. The SZZ algorithm is the de facto standard for
collecting defect labels for this kind of data and is used by most public data
sets. Thus, problems with the SZZ algorithm may have a strong indirect impact
on almost the complete state of the art of defect prediction. Recent research
uncovered potential problems in different parts of the SZZ algorithm. Within
this article, we provide an extensive empirical analysis of the defect labels
created with the SZZ algorithm. We used a combination of manual validation and
adopted or improved heuristics for the collection of defect data to establish
ground truth data for bug fixing commits, improved the heuristic for the
identification of inducing changes for defects, as well as the assignment of
bugs to releases. We conducted an empirical study on 398 releases of 38 Apache
projects and found that only half of the bug fixing commits determined by SZZ
are actually bug fixing. Moreover, if a six month time frame is used in
combination with SZZ to determine which bugs affect a release, one file is
incorrectly labeled as defective for every file that is correctly labeled as
defective. In addition, two defective files are missed. We also explored the
impact of the relatively small set of features that are available in most
defect prediction data sets, as there are multiple publications that indicate
that, e.g., churn related features are important for defect prediction. We
found that the difference of using more features is negligible.Comment: Submitted and under review. First three authors are equally
contributin
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Context: Tangled commits are changes to software that address multiple
concerns at once. For researchers interested in bugs, tangled commits mean that
they actually study not only bugs, but also other concerns irrelevant for the
study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling
and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate
which changes contribute to bug fixes for each line in bug fixing commits. Each
line is labeled by four participants. If at least three participants agree on
the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing
commits modify the source code to fix the underlying problem. However, when we
only consider changes to the production code files this ratio increases to 66%
to 87%. We find that about 11% of lines are hard to label leading to active
disagreements between participants. Due to confirmed tangling and the
uncertainty in our data, we estimate that 3% to 47% of data is noisy without
manual untangling, depending on the use case.
Conclusion: Tangled commits have a high prevalence in bug fixes and can lead
to a large amount of noise in the data. Prior research indicates that this
noise may alter results. As researchers, we should be skeptics and assume that
unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin
A fine-grained data set and analysis of tangling in bug fixing commits
Abstract
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.
Objectives: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.
Conclusions: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise