3,710 research outputs found
An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction
Bugs are inescapable during software development due to frequent code
changes, tight deadlines, etc.; therefore, it is important to have tools to
find these errors. One way of performing bug identification is to analyze the
characteristics of buggy source code elements from the past and predict the
present ones based on the same characteristics, using e.g. machine learning
models. To support model building tasks, code elements and their
characteristics are collected in so-called bug datasets which serve as the
input for learning.
We present the \emph{BugHunter Dataset}: a novel kind of automatically
constructed and freely available bug dataset containing code elements (files,
classes, methods) with a wide set of code metrics and bug information. Other
available bug datasets follow the traditional approach of gathering the
characteristics of all source code elements (buggy and non-buggy) at only one
or more pre-selected release versions of the code. Our approach, on the other
hand, captures the buggy and the fixed states of the same source code elements
from the narrowest timeframe we can identify for a bug's presence, regardless
of release versions. To show the usefulness of the new dataset, we built and
evaluated bug prediction models and achieved F-measure values over 0.74
Towards Automated Performance Bug Identification in Python
Context: Software performance is a critical non-functional requirement,
appearing in many fields such as mission critical applications, financial, and
real time systems. In this work we focused on early detection of performance
bugs; our software under study was a real time system used in the
advertisement/marketing domain.
Goal: Find a simple and easy to implement solution, predicting performance
bugs.
Method: We built several models using four machine learning methods, commonly
used for defect prediction: C4.5 Decision Trees, Na\"{\i}ve Bayes, Bayesian
Networks, and Logistic Regression.
Results: Our empirical results show that a C4.5 model, using lines of code
changed, file's age and size as explanatory variables, can be used to predict
performance bugs (recall=0.73, accuracy=0.85, and precision=0.96). We show that
reducing the number of changes delivered on a commit, can decrease the chance
of performance bug injection.
Conclusions: We believe that our approach can help practitioners to eliminate
performance bugs early in the development cycle. Our results are also of
interest to theoreticians, establishing a link between functional bugs and
(non-functional) performance bugs, and explicitly showing that attributes used
for prediction of functional bugs can be used for prediction of performance
bugs
Analysis and Detection of Information Types of Open Source Software Issue Discussions
Most modern Issue Tracking Systems (ITSs) for open source software (OSS)
projects allow users to add comments to issues. Over time, these comments
accumulate into discussion threads embedded with rich information about the
software project, which can potentially satisfy the diverse needs of OSS
stakeholders. However, discovering and retrieving relevant information from the
discussion threads is a challenging task, especially when the discussions are
lengthy and the number of issues in ITSs are vast. In this paper, we address
this challenge by identifying the information types presented in OSS issue
discussions. Through qualitative content analysis of 15 complex issue threads
across three projects hosted on GitHub, we uncovered 16 information types and
created a labeled corpus containing 4656 sentences. Our investigation of
supervised, automated classification techniques indicated that, when prior
knowledge about the issue is available, Random Forest can effectively detect
most sentence types using conversational features such as the sentence length
and its position. When classifying sentences from new issues, Logistic
Regression can yield satisfactory performance using textual features for
certain information types, while falling short on others. Our work represents a
nontrivial first step towards tools and techniques for identifying and
obtaining the rich information recorded in the ITSs to support various software
engineering activities and to satisfy the diverse needs of OSS stakeholders.Comment: 41st ACM/IEEE International Conference on Software Engineering
(ICSE2019
- …