434 research outputs found
Extracting Build Changes with BUILDDIFF
Build systems are an essential part of modern software engineering projects.
As software projects change continuously, it is crucial to understand how the
build system changes because neglecting its maintenance can lead to expensive
build breakage. Recent studies have investigated the (co-)evolution of build
configurations and reasons for build breakage, but they did this only on a
coarse grained level. In this paper, we present BUILDDIFF, an approach to
extract detailed build changes from MAVEN build files and classify them into 95
change types. In a manual evaluation of 400 build changing commits, we show
that BUILDDIFF can extract and classify build changes with an average precision
and recall of 0.96 and 0.98, respectively. We then present two studies using
the build changes extracted from 30 open source Java projects to study the
frequency and time of build changes. The results show that the top 10 most
frequent change types account for 73% of the build changes. Among them, changes
to version numbers and changes to dependencies of the projects occur most
frequently. Furthermore, our results show that build changes occur frequently
around releases. With these results, we provide the basis for further research,
such as for analyzing the (co-)evolution of build files with other artifacts or
improving effort estimation approaches. Furthermore, our detailed change
information enables improvements of refactoring approaches for build
configurations and improvements of models to identify error-prone build files.Comment: Accepted at the International Conference of Mining Software
Repositories (MSR), 201
An Introduction to Software Ecosystems
This chapter defines and presents different kinds of software ecosystems. The
focus is on the development, tooling and analytics aspects of software
ecosystems, i.e., communities of software developers and the interconnected
software components (e.g., projects, libraries, packages, repositories,
plug-ins, apps) they are developing and maintaining. The technical and social
dependencies between these developers and software components form a
socio-technical dependency network, and the dynamics of this network change
over time. We classify and provide several examples of such ecosystems. The
chapter also introduces and clarifies the relevant terms needed to understand
and analyse these ecosystems, as well as the techniques and research methods
that can be used to analyse different aspects of these ecosystems.Comment: Preprint of chapter "An Introduction to Software Ecosystems" by Tom
Mens and Coen De Roover, published in the book "Software Ecosystems: Tooling
and Analytics" (eds. T. Mens, C. De Roover, A. Cleve), 2023, ISBN
978-3-031-36059-6, reproduced with permission of Springer. The final
authenticated version of the book and this chapter is available online at:
https://doi.org/10.1007/978-3-031-36060-
Vulnerable Open Source Dependencies: Counting Those That Matter
BACKGROUND: Vulnerable dependencies are a known problem in today's
open-source software ecosystems because OSS libraries are highly interconnected
and developers do not always update their dependencies. AIMS: In this paper we
aim to present a precise methodology, that combines the code-based analysis of
patches with information on build, test, update dates, and group extracted from
the very code repository, and therefore, caters to the needs of industrial
practice for correct allocation of development and audit resources. METHOD: To
understand the industrial impact of the proposed methodology, we considered the
200 most popular OSS Java libraries used by SAP in its own software. Our
analysis included 10905 distinct GAVs (group, artifact, version) when
considering all the library versions. RESULTS: We found that about 20% of the
dependencies affected by a known vulnerability are not deployed, and therefore,
they do not represent a danger to the analyzed library because they cannot be
exploited in practice. Developers of the analyzed libraries are able to fix
(and actually responsible for) 82% of the deployed vulnerable dependencies. The
vast majority (81%) of vulnerable dependencies may be fixed by simply updating
to a new version, while 1% of the vulnerable dependencies in our sample are
halted, and therefore, potentially require a costly mitigation strategy.
CONCLUSIONS: Our case study shows that the correct counting allows software
development companies to receive actionable information about their library
dependencies, and therefore, correctly allocate costly development and audit
resources, which is spent inefficiently in case of distorted measurements.Comment: This is a pre-print of the paper that appears, with the same title,
in the proceedings of the 12th International Symposium on Empirical Software
Engineering and Measurement, 201
REI:An integrated measure for software reusability
To capitalize upon the benefits of software reuse, an efficient selection among candidate reusable assets should be performed in terms of functional fitness and adaptability. The reusability of assets is usually measured through reusability indices. However, these do not capture all facets of reusability, such as structural characteristics, external quality attributes, and documentation. In this paper, we propose a reusability index (REI) as a synthesis of various software metrics and evaluate its ability to quantify reuse, based on IEEE Standard on Software Metrics Validity. The proposed index is compared with existing ones through a case study on 80 reusable open-source assets. To illustrate the applicability of the proposed index, we performed a pilot study, where real-world reuse decisions have been compared with decisions imposed by the use of metrics (including REI). The results of the study suggest that the proposed index presents the highest predictive and discriminative power; it is the most consistent in ranking reusable assets and the most strongly correlated to their levels of reuse. The findings of the paper are discussed to understand the most important aspects in reusability assessment (interpretation of results), and interesting implications for research and practice are provided
RefDiff: Detecting Refactorings in Version Histories
Refactoring is a well-known technique that is widely adopted by software
engineers to improve the design and enable the evolution of a system. Knowing
which refactoring operations were applied in a code change is a valuable
information to understand software evolution, adapt software components, merge
code changes, and other applications. In this paper, we present RefDiff, an
automated approach that identifies refactorings performed between two code
revisions in a git repository. RefDiff employs a combination of heuristics
based on static analysis and code similarity to detect 13 well-known
refactoring types. In an evaluation using an oracle of 448 known refactoring
operations, distributed across seven Java projects, our approach achieved
precision of 100% and recall of 88%. Moreover, our evaluation suggests that
RefDiff has superior precision and recall than existing state-of-the-art
approaches.Comment: Paper accepted at 14th International Conference on Mining Software
Repositories (MSR), pages 1-11, 201
Predicting Changes to Source Code
Organizations typically use issue tracking systems (ITS) such as Jira to plan software releases and assign requirements to developers. Organizations typically also use source control management (SCM) repositories such as Git to track historical changes to a code-base. These ITS and SCM repositories contain valuable data that remains largely untapped. As developers churn through an organization, it becomes expensive for developers to spend time determining which software artifact must be modified to implement a requirement. In this work we created, developed, tested and evaluated a tool called Class Change Predictor, otherwise known as CCP, for predicting which class will implement a requirement. Understanding which class will implement a requirement supports several software engineering tasks such as refactoring and assigning requirements to developers.
CCP is a data-mining tool operating on top of ITS and SCM repositories which gathers a unique combination of metrics. CCP leverages requirement text to compare current requirements to past requirements and requirements to source code files. CCP performs static analysis on the code-base of each major release of the software artifact. We evaluated CCP on different open source datasets (and the Digital Democracy dataset) by using several machine learning classifiers and pre-processing procedures. Our results show that we can achieve high precision on three out of four datasets. We conclude that accurate class change prediction is feasible, and we propose numerous solutions to increase future accuracy
Empirically-Grounded Construction of Bug Prediction and Detection Tools
There is an increasing demand on high-quality software as software bugs have an economic impact not only on software projects, but also on national economies in general. Software quality is achieved via the main quality assurance activities of testing and code reviewing. However, these activities are expensive, thus they need to be carried out efficiently.
Auxiliary software quality tools such as bug detection and bug prediction tools help developers focus their testing and reviewing activities on the parts of software that more likely contain bugs. However, these tools are far from adoption as mainstream development tools. Previous research points to their inability to adapt to the peculiarities of projects and their high rate of false positives as the main obstacles of their adoption.
We propose empirically-grounded analysis to improve the adaptability and efficiency of bug detection and prediction tools. For a bug detector to be efficient, it needs to detect bugs that are conspicuous, frequent, and specific to a software project. We empirically show that the null-related bugs fulfill these criteria and are worth building detectors for. We analyze the null dereferencing problem and find that its root cause lies in methods that return null. We propose an empirical solution to this problem that depends on the wisdom of the crowd. For each API method, we extract the nullability measure that expresses how often the return value of this method is checked against null in the ecosystem of the API. We use nullability to annotate API methods with nullness annotation and warn developers about missing and excessive null checks.
For a bug predictor to be efficient, it needs to be optimized as both a machine learning model and a software quality tool. We empirically show how feature selection and hyperparameter optimizations improve prediction accuracy. Then we optimize bug prediction to locate the maximum number of bugs in the minimum amount of code by finding the most cost-effective combination of bug prediction configurations, i.e., dependent variables, machine learning model, and response variable. We show that using both source code and change metrics as dependent variables, applying feature selection on them, then using an optimized Random Forest to predict the number of bugs results in the most cost-effective bug predictor.
Throughout this thesis, we show how empirically-grounded analysis helps us achieve efficient bug prediction and detection tools and adapt them to the characteristics of each software project
- …