933 research outputs found
The effects of change decomposition on code review -- a controlled experiment
Background: Code review is a cognitively demanding and time-consuming
process. Previous qualitative studies hinted at how decomposing change sets
into multiple yet internally coherent ones would improve the reviewing process.
So far, literature provided no quantitative analysis of this hypothesis.
Aims: (1) Quantitatively measure the effects of change decomposition on the
outcome of code review (in terms of number of found defects, wrongly reported
issues, suggested improvements, time, and understanding); (2) Qualitatively
analyze how subjects approach the review and navigate the code, building
knowledge and addressing existing issues, in large vs. decomposed changes.
Method: Controlled experiment using the pull-based development model
involving 28 software developers among professionals and graduate students.
Results: Change decomposition leads to fewer wrongly reported issues,
influences how subjects approach and conduct the review activity (by increasing
context-seeking), yet impacts neither understanding the change rationale nor
the number of found defects.
Conclusions: Change decomposition reduces the noise for subsequent data
analyses but also significantly supports the tasks of the developers in charge
of reviewing the changes. As such, commits belonging to different concepts
should be separated, adopting this as a best practice in software engineering
Detecting and Characterizing Propagation of Security Weaknesses in Puppet-based Infrastructure Management
Despite being beneficial for managing computing infrastructure automatically,
Puppet manifests are susceptible to security weaknesses, e.g., hard-coded
secrets and use of weak cryptography algorithms. Adequate mitigation of
security weaknesses in Puppet manifests is thus necessary to secure computing
infrastructure that are managed with Puppet manifests. A characterization of
how security weaknesses propagate and affect Puppet-based infrastructure
management, can inform practitioners on the relevance of the detected security
weaknesses, as well as help them take necessary actions for mitigation. To that
end, we conduct an empirical study with 17,629 Puppet manifests mined from 336
open source repositories. We construct Taint Tracker for Puppet Manifests
(TaintPup), for which we observe 2.4 times more precision compared to that of a
state-of-the-art security static analysis tool. TaintPup leverages
Puppet-specific information flow analysis using which we characterize
propagation of security weaknesses. From our empirical study, we observe
security weaknesses to propagate into 4,457 resources, i.e, Puppet-specific
code elements used to manage infrastructure. A single instance of a security
weakness can propagate into as many as 35 distinct resources. We observe
security weaknesses to propagate into 7 categories of resources, which include
resources used to manage continuous integration servers and network
controllers. According to our survey with 24 practitioners, propagation of
security weaknesses into data storage-related resources is rated to have the
most severe impact for Puppet-based infrastructure management.Comment: 14 pages, currently under revie
An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction
Bugs are inescapable during software development due to frequent code
changes, tight deadlines, etc.; therefore, it is important to have tools to
find these errors. One way of performing bug identification is to analyze the
characteristics of buggy source code elements from the past and predict the
present ones based on the same characteristics, using e.g. machine learning
models. To support model building tasks, code elements and their
characteristics are collected in so-called bug datasets which serve as the
input for learning.
We present the \emph{BugHunter Dataset}: a novel kind of automatically
constructed and freely available bug dataset containing code elements (files,
classes, methods) with a wide set of code metrics and bug information. Other
available bug datasets follow the traditional approach of gathering the
characteristics of all source code elements (buggy and non-buggy) at only one
or more pre-selected release versions of the code. Our approach, on the other
hand, captures the buggy and the fixed states of the same source code elements
from the narrowest timeframe we can identify for a bug's presence, regardless
of release versions. To show the usefulness of the new dataset, we built and
evaluated bug prediction models and achieved F-measure values over 0.74
Attracting Contributions to your GitHub Project
International audienceMost Open Source Software projects can only progress thanks to developers willing to voluntarily contribute. Therefore, their vitality and success largely depend on their ability to attract developers. Code hosting platforms like GitHub aim at making software development more collabo-rative and attractive for contributors by providing facilities such as issue-tracking, code review or team management on top of a Git repository following a pull-based model to handle external contributions. We study whether the use of these facilities actually help to get more contributions based on a quantitative analysis over a dataset composed by all the GitHub projects created in the last two years. We discovered that most projects actually ignore them and that, those that don't, do not advance faster either. A manual analysis of the most successful projects suggests that other factors like clear description of the contribution and gover-nance rules for the project have a greater impact
Root cause prediction based on bug reports
This paper proposes a supervised machine learning approach for predicting the
root cause of a given bug report. Knowing the root cause of a bug can help
developers in the debugging process - either directly or indirectly by choosing
proper tool support for the debugging task. We mined 54755 closed bug reports
from the issue trackers of 103 GitHub projects and applied a set of heuristics
to create a benchmark consisting of 10459 reports. A subset was manually
classified into three groups (semantic, memory, and concurrency) based on the
bugs' root causes. Since the types of root cause are not equally distributed, a
combination of keyword search and random selection was applied. Our data set
for the machine learning approach consists of 369 bug reports (122 concurrency,
121 memory, and 126 semantic bugs). The bug reports are used as input to a
natural language processing algorithm. We evaluated the performance of several
classifiers for predicting the root causes for the given bug reports. Linear
Support Vector machines achieved the highest mean precision (0.74) and recall
(0.72) scores. The created bug data set and classification are publicly
available.Comment: 6 page
- …