17 research outputs found

    On the feasibility of automated prediction of bug and non-bug issues

    Get PDF
    Context: Issue tracking systems are used to track and describe tasks in the development process, e.g., requested feature improvements or reported bugs. However, past research has shown that the reported issue types often do not match the description of the issue. Objective: We want to understand the overall maturity of the state of the art of issue type prediction with the goal to predict if issues are bugs and evaluate if we can improve existing models by incorporating manually specified knowledge about issues. Method: We train different models for the title and description of the issue to account for the difference in structure between these fields, e.g., the length. Moreover, we manually detect issues whose description contains a null pointer exception, as these are strong indicators that issues are bugs. Results: Our approach performs best overall, but not significantly different from an approach from the literature based on the fastText classifier from Facebook AI Research. The small improvements in prediction performance are due to structural information about the issues we used. We found that using information about the content of issues in form of null pointer exceptions is not useful. We demonstrate the usefulness of issue type prediction through the example of labelling bugfixing commits. Conclusions: Issue type prediction can be a useful tool if the use case allows either for a certain amount of missed bug reports or the prediction of too many issues as bug is acceptable

    Issues with SZZ: An empirical assessment of the state of practice of defect prediction data collection

    Get PDF
    Defect prediction research has a strong reliance on published data sets that are shared between researchers. The SZZ algorithm is the de facto standard for collecting defect labels for this kind of data and is used by most public data sets. Thus, problems with the SZZ algorithm may have a strong indirect impact on almost the complete state of the art of defect prediction. Recent research uncovered potential problems in different parts of the SZZ algorithm. Within this article, we provide an extensive empirical analysis of the defect labels created with the SZZ algorithm. We used a combination of manual validation and adopted or improved heuristics for the collection of defect data to establish ground truth data for bug fixing commits, improved the heuristic for the identification of inducing changes for defects, as well as the assignment of bugs to releases. We conducted an empirical study on 398 releases of 38 Apache projects and found that only half of the bug fixing commits determined by SZZ are actually bug fixing. Moreover, if a six month time frame is used in combination with SZZ to determine which bugs affect a release, one file is incorrectly labeled as defective for every file that is correctly labeled as defective. In addition, two defective files are missed. We also explored the impact of the relatively small set of features that are available in most defect prediction data sets, as there are multiple publications that indicate that, e.g., churn related features are important for defect prediction. We found that the difference of using more features is negligible.Comment: Submitted and under review. First three authors are equally contributin

    A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

    Get PDF
    Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin

    Automated Static Analysis Tools: A Multidimensional view on Software Quality Evolution

    No full text
    Software use is ubiquitous. The quality and the evolution of quality over long periods of time is therefore of notable importance. Software engineering research investigates software quality in multiple areas. One of these areas are predictive models, in which measurements of past changes to the source code or file contents are used to assess the quality of changes, files or even product releases. However, these predictive models have yet to transition from research to practice on a larger scale. In contrast, Automated Static Analysis Tools (ASATs) are used in practice and are also part of several software quality models. ASATs are able to warn developers about parts of the source code that violate best practices or match common defect patterns. One downside of ASATs are false positives, i.e., warnings about parts of the code which are not problematic. Developers have to manually assess the warnings and annotate the code or the ASAT configuration to mitigate this. Within this thesis, we investigate the evolution of software quality with a focus on a general purpose ASAT for Java. Our main objective is to determine if the use of an ASAT can improve software quality, as measured by defects, significantly enough to mitigate additional effort by the developers to use the ASAT. We combine multiple software engineering research techniques and data validation studies to improve the signal-to-noise ratio to increase the validity and stability of our results. We focus on a general purpose ASAT for the Java programming language due to the maturity of the language and the large number of projects available for this language. Both the language and the general purpose ASAT have been available for a long time, which allows us to include longer periods of time for our analyses. We study how the ASAT is applied, how the generated warnings evolve over long time periods, and how it affects the quality of the source code in terms of defects. In addition, we include the perspective of the developers regarding software quality improvement by measuring changes when developers intend to improve the quality of the source code. Our studies yield surprising insights. While our results show that ASATs have a positive impact on software quality, the magnitude of the impact is much smaller than expected. Moreover, we can show that corrective changes are the main driver of complexity in software projects. They introduce more complexity than feature additions or any other type of maintenance. In addition, we find that software quality estimation models benefit more from size and complexity metrics than static analysis warnings of an ASAT. Our study of developer intents to increase software quality mirrors this result.2022-09-0

    Predicting Issue Types with seBERT

    Full text link
    Pre-trained transformer models are the current state-of-the-art for natural language models processing. seBERT is such a model, that was developed based on the BERT architecture, but trained from scratch with software engineering data. We fine-tuned this model for the NLBSE challenge for the task of issue type prediction. Our model dominates the baseline fastText for all three issue types in both recall and precisio} to achieve an overall F1-score of 85.7%, which is an increase of 4.1% over the baseline.Comment: Accepted for Publication at the NLBSE'22 Tool Competitio
    corecore