3 research outputs found
Sampling Projects in GitHub for MSR Studies
Almost every Mining Software Repositories (MSR) study requires, as first
step, the selection of the subject software repositories. These repositories
are usually collected from hosting services like GitHub using specific
selection criteria dictated by the study goal. For example, a study related to
licensing might be interested in selecting projects explicitly declaring a
license. Once the selection criteria have been defined, utilities such as the
GitHub APIs can be used to "query" the hosting service. However, researchers
have to deal with usage limitations imposed by these APIs and a lack of
required information. For example, the GitHub search APIs allow 30 requests per
minute and, when searching repositories, only provide limited information
(e.g., the number of commits in a repository is not included). To support
researchers in sampling projects from GitHub, we present GHS (GitHub Search), a
dataset containing 25 characteristics (e.g., number of commits, license, etc.)
of 735,669 repositories written in 10 programming languages. The set of
characteristics has been derived by looking for frequently used project
selection criteria in MSR studies and the dataset is continuously updated to
(i) always provide fresh data about the existing projects, and (ii) increase
the number of indexed projects. The GHS dataset can be queried through a web
application we built that allows to set many combinations of selection criteria
needed for a study and download the information of matching repositories:
https://seart-ghs.si.usi.ch.Comment: Accepted to the 18th International Conference on Mining Software
Repositories (MSR 2021
Continuous Integration Theater
Background: Continuous Integration (CI) systems are now the bedrock of
several software development practices. Several tools such as TravisCI,
CircleCI, and Hudson, that implement CI practices, are commonly adopted by
software engineers. However, the way that software engineers use these tools
could lead to what we call "Continuous Integration Theater", a situation in
which software engineers do not employ these tools effectively, leading to
unhealthy CI practices. Aims: The goal of this paper is to make sense of how
commonplace are these unhealthy continuous integration practices being employed
in practice. Method: By inspecting 1,270 open-source projects that use
TravisCI, the most used CI service, we quantitatively studied how common is to
use CI (1) with infrequent commits, (2) in a software project with poor test
coverage, (3) with builds that stay broken for long periods, and (4) with
builds that take too long to run. Results: We observed that 748 (60%)
projects face infrequent commits, which essentially makes the merging process
harder. Moreover, we were able to find code coverage information for 51
projects. The average code coverage was 78%, although Ruby projects have a
higher code coverage than Java projects (86% and 63%, respectively). However,
some projects with very small coverage (4%) were found. Still, we observed
that 85% of the studied projects have at least one broken build that take more
than four days to be fixed. Interestingly, very small projects (up to 1,000
lines of code) are the ones that take the longest to fix broken builds.
Finally, we noted that, for the majority of the studied projects, the build is
executed under the 10 minutes rule of thumb. Conclusions: Our results are
important to an increasing community of software engineers that employ CI
practices on daily basis but may not be aware of bad practices that are
eventually employed.Comment: to appear at ESEM 201
Does Code Quality Affect Pull Request Acceptance? An empirical study
Background. Pull requests are a common practice for contributing and
reviewing contributions, and are employed both in open-source and industrial
contexts. One of the main goals of code reviews is to find defects in the code,
allowing project maintainers to easily integrate external contributions into a
project and discuss the code contributions. Objective. The goal of this paper
is to understand whether code quality is actually considered when pull requests
are accepted. Specifically, we aim at understanding whether code quality issues
such as code smells, antipatterns, and coding style violations in the pull
request code affect the chance of its acceptance when reviewed by a maintainer
of the project. Method. We conducted a case study among 28 Java open-source
projects, analyzing the presence of 4.7 M code quality issues in 36 K pull
requests. We analyzed further correlations by applying Logistic Regression and
seven machine learning techniques (Decision Tree, Random Forest, Extremely
Randomized Trees, AdaBoost, Gradient Boosting, XGBoost). Results. Unexpectedly,
code quality turned out not to affect the acceptance of a pull request at all.
As suggested by other works, other factors such as the reputation of the
maintainer and the importance of the feature delivered might be more important
than code quality in terms of pull request acceptance. Conclusions. Researchers
already investigated the influence of the developers' reputation and the pull
request acceptance. This is the first work investigating if quality of the code
in pull requests affects the acceptance of the pull request or not. We
recommend that researchers further investigate this topic to understand if
different measures or different tools could provide some useful measures