17 research outputs found
On the feasibility of automated prediction of bug and non-bug issues
Context: Issue tracking systems are used to track and describe tasks in the
development process, e.g., requested feature improvements or reported bugs.
However, past research has shown that the reported issue types often do not
match the description of the issue.
Objective: We want to understand the overall maturity of the state of the art
of issue type prediction with the goal to predict if issues are bugs and
evaluate if we can improve existing models by incorporating manually specified
knowledge about issues.
Method: We train different models for the title and description of the issue
to account for the difference in structure between these fields, e.g., the
length. Moreover, we manually detect issues whose description contains a null
pointer exception, as these are strong indicators that issues are bugs.
Results: Our approach performs best overall, but not significantly different
from an approach from the literature based on the fastText classifier from
Facebook AI Research. The small improvements in prediction performance are due
to structural information about the issues we used. We found that using
information about the content of issues in form of null pointer exceptions is
not useful. We demonstrate the usefulness of issue type prediction through the
example of labelling bugfixing commits.
Conclusions: Issue type prediction can be a useful tool if the use case
allows either for a certain amount of missed bug reports or the prediction of
too many issues as bug is acceptable
Issues with SZZ: An empirical assessment of the state of practice of defect prediction data collection
Defect prediction research has a strong reliance on published data sets that
are shared between researchers. The SZZ algorithm is the de facto standard for
collecting defect labels for this kind of data and is used by most public data
sets. Thus, problems with the SZZ algorithm may have a strong indirect impact
on almost the complete state of the art of defect prediction. Recent research
uncovered potential problems in different parts of the SZZ algorithm. Within
this article, we provide an extensive empirical analysis of the defect labels
created with the SZZ algorithm. We used a combination of manual validation and
adopted or improved heuristics for the collection of defect data to establish
ground truth data for bug fixing commits, improved the heuristic for the
identification of inducing changes for defects, as well as the assignment of
bugs to releases. We conducted an empirical study on 398 releases of 38 Apache
projects and found that only half of the bug fixing commits determined by SZZ
are actually bug fixing. Moreover, if a six month time frame is used in
combination with SZZ to determine which bugs affect a release, one file is
incorrectly labeled as defective for every file that is correctly labeled as
defective. In addition, two defective files are missed. We also explored the
impact of the relatively small set of features that are available in most
defect prediction data sets, as there are multiple publications that indicate
that, e.g., churn related features are important for defect prediction. We
found that the difference of using more features is negligible.Comment: Submitted and under review. First three authors are equally
contributin
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Context: Tangled commits are changes to software that address multiple
concerns at once. For researchers interested in bugs, tangled commits mean that
they actually study not only bugs, but also other concerns irrelevant for the
study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling
and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate
which changes contribute to bug fixes for each line in bug fixing commits. Each
line is labeled by four participants. If at least three participants agree on
the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing
commits modify the source code to fix the underlying problem. However, when we
only consider changes to the production code files this ratio increases to 66%
to 87%. We find that about 11% of lines are hard to label leading to active
disagreements between participants. Due to confirmed tangling and the
uncertainty in our data, we estimate that 3% to 47% of data is noisy without
manual untangling, depending on the use case.
Conclusion: Tangled commits have a high prevalence in bug fixes and can lead
to a large amount of noise in the data. Prior research indicates that this
noise may alter results. As researchers, we should be skeptics and assume that
unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin
Automated Static Analysis Tools: A Multidimensional view on Software Quality Evolution
Software use is ubiquitous. The quality and the evolution of quality over long
periods of time is therefore of notable importance. Software engineering research
investigates software quality in multiple areas. One of these areas are predictive
models, in which measurements of past changes to the source code or file contents are
used to assess the quality of changes, files or even product releases. However, these
predictive models have yet to transition from research to practice on a larger scale. In
contrast, Automated Static Analysis Tools (ASATs) are used in practice and are also
part of several software quality models. ASATs are able to warn developers about
parts of the source code that violate best practices or match common defect patterns.
One downside of ASATs are false positives, i.e., warnings about parts of the code
which are not problematic. Developers have to manually assess the warnings and
annotate the code or the ASAT configuration to mitigate this. Within this thesis, we
investigate the evolution of software quality with a focus on a general purpose ASAT
for Java. Our main objective is to determine if the use of an ASAT can improve
software quality, as measured by defects, significantly enough to mitigate additional
effort by the developers to use the ASAT. We combine multiple software engineering
research techniques and data validation studies to improve the signal-to-noise ratio
to increase the validity and stability of our results. We focus on a general purpose
ASAT for the Java programming language due to the maturity of the language and
the large number of projects available for this language. Both the language and
the general purpose ASAT have been available for a long time, which allows us to
include longer periods of time for our analyses. We study how the ASAT is applied,
how the generated warnings evolve over long time periods, and how it affects the
quality of the source code in terms of defects. In addition, we include the perspective
of the developers regarding software quality improvement by measuring changes
when developers intend to improve the quality of the source code. Our studies
yield surprising insights. While our results show that ASATs have a positive impact
on software quality, the magnitude of the impact is much smaller than expected.
Moreover, we can show that corrective changes are the main driver of complexity in
software projects. They introduce more complexity than feature additions or any
other type of maintenance. In addition, we find that software quality estimation
models benefit more from size and complexity metrics than static analysis warnings
of an ASAT. Our study of developer intents to increase software quality mirrors this
result.2022-09-0
Predicting Issue Types with seBERT
Pre-trained transformer models are the current state-of-the-art for natural
language models processing. seBERT is such a model, that was developed based on
the BERT architecture, but trained from scratch with software engineering data.
We fine-tuned this model for the NLBSE challenge for the task of issue type
prediction. Our model dominates the baseline fastText for all three issue types
in both recall and precisio} to achieve an overall F1-score of 85.7%, which is
an increase of 4.1% over the baseline.Comment: Accepted for Publication at the NLBSE'22 Tool Competitio