Search CORE

34 research outputs found

20-MAD -- 20 Years of Issues and Commits of Mozilla and Apache Development

Author: Claes Maëlick
Mäntylä Mika
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/03/2020
Field of study

Data of long-lived and high profile projects is valuable for research on successful software engineering in the wild. Having a dataset with different linked software repositories of such projects, enables deeper diving investigations. This paper presents 20-MAD, a dataset linking the commit and issue data of Mozilla and Apache projects. It includes over 20 years of information about 765 projects, 3.4M commits, 2.3M issues, and 17.3M issue comments, and its compressed size is over 6 GB. The data contains all the typical information about source code commits (e.g., lines added and removed, message and commit time) and issues (status, severity, votes, and summary). The issue comments have been pre-processed for natural language processing and sentiment analysis. This includes emoticons and valence and arousal scores. Linking code repository and issue tracker information, allows studying individuals in two types of repositories and provide more accurate time zone information for issue trackers as well. To our knowledge, this the largest linked dataset in size and in project lifetime that is not based on GitHub.Comment: 17th International Conference on Mining Software Repositories, 202

arXiv.org e-Print Archive

Crossref

Pitfalls and Guidelines for Using Time-Based Git Data

Author: Chauhan Jigyasa
Dyer Robert
Flint Samuel W.
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 13/03/2022
Field of study

Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could aect the research results. This paper presents an extended survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 754 technical track and data papers published in MSR 2004{2021, we saw at least 290 (38%) papers utilized time-based data. We also observed that most time-based data used in research papers comes in the form of Git commits, often from GitHub. Based on those results, we then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty Git timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories

DigitalCommons@University of Nebraska

EALink: An Efficient and Accurate Pre-trained Framework for Issue-Commit Link Recovery

Author: Ji Rongrong
Li Hui
Wang Juhong
Wang Yanlin
Wei Zhao
Xu Yong
Zhang Chenyuan
Publication venue
Publication date: 21/08/2023
Field of study

Issue-commit links, as a type of software traceability links, play a vital role in various software development and maintenance tasks. However, they are typically deficient, as developers often forget or fail to create tags when making commits. Existing studies have deployed deep learning techniques, including pretrained models, to improve automatic issue-commit link recovery.Despite their promising performance, we argue that previous approaches have four main problems, hindering them from recovering links in large software projects. To overcome these problems, we propose an efficient and accurate pre-trained framework called EALink for issue-commit link recovery. EALink requires much fewer model parameters than existing pre-trained methods, bringing efficient training and recovery. Moreover, we design various techniques to improve the recovery accuracy of EALink. We construct a large-scale dataset and conduct extensive experiments to demonstrate the power of EALink. Results show that EALink outperforms the state-of-the-art methods by a large margin (15.23%-408.65%) on various evaluation metrics. Meanwhile, its training and inference overhead is orders of magnitude lower than existing methods.Comment: 13 pages, 6 figures, published to AS

arXiv.org e-Print Archive

Does Code Review Speed Matter for Practitioners?

Author: Kudrjavets Gunnar
Rastogi Ayushi
Publication venue
Publication date: 22/11/2023
Field of study

Increasing code velocity is a common goal for a variety of software projects. The efficiency of the code review process significantly impacts how fast the code gets merged into the final product and reaches the customers. We conducted a survey to study the code velocity-related beliefs and practices in place. We analyzed 75 completed surveys from 39 participants from the industry and 36 from the open-source community. Our critical findings are (a) the industry and open-source community hold a similar set of beliefs, (b) quick reaction time is of utmost importance and applies to the tooling infrastructure and the behavior of other engineers, (c) time-to merge is the essential code review metric to improve, (d) engineers have differing opinions about the benefits of increased code velocity for their career growth, and (e) the controlled application of the commit-then-review model can increase code velocity. Our study supports the continued need to invest in and improve code velocity regardless of the underlying organizational ecosystem

Proceedings - University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Effort Estimation of FLOSS Projects: A Study of the Linux Kernel

Author: Capiluppi Andrea
Capiluppi Andrea
Izquierdo-Cortazar Daniel
Izquierdo-Cortazar Daniel
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2011
Field of study

Empirical research on Free/Libre/Open Source Software (FLOSS) has shown that developers tend to cluster around two main roles: "core" contributors differ from "peripheral" developers in terms of a larger number of responsibilities and a higher productivity pattern. A further, cross-cutting characterization of developers could be achieved by associating developers with "time slots", and different patterns of activity and effort could be associated to such slots. Such analysis, if replicated, could be used not only to compare different FLOSS communities, and to evaluate their stability and maturity, but also to determine within projects, how the effort is distributed in a given period, and to estimate future needs with respect to key points in the software life-cycle (e.g., major releases). This study analyses the activity patterns within the Linux kernel project, at first focusing on the overall distribution of effort and activity within weeks and days;then, dividing each day into three 8-hour time slots, and focusing on effort and activity around major releases. Such analyses have the objective of evaluating effort,productivity and types of activity globally and around major releases. They enable a comparison of these releases and patterns of effort and activities with traditional software products and processes, and in turn, the identification of company-driven projects (i.e., working mainly during of?ce hours) among FLOSS endeavors. The results of this research show that, overall, the effort within the Linux kernel community is constant (albeit at different levels) throughout the week, signalling the need of updated estimation models, different from those used in traditional 9am-5pm,Monday to Friday commercial companies. It also becomes evident that the activity before a release is vastly different from after a release, and that the changes show an increase in code complexity in specific time slots (notably in the late night hours),which will later require additional maintenance efforts

UEL Research Repository at University of East London

RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename Refactoring

Author: Ji Rongrong
Li Hui
Liu Hao
Wang Juhong
Wang Yanlin
Wei Zhao
Xu Yong
Publication venue
Publication date: 28/05/2023
Field of study

Refactoring is an indispensable practice of improving the quality and maintainability of source code in software evolution. Rename refactoring is the most frequently performed refactoring that suggests a new name for an identifier to enhance readability when the identifier is poorly named. However, most existing works only identify renaming activities between two versions of source code, while few works express concern about how to suggest a new name. In this paper, we study automatic rename refactoring on variable names, which is considered more challenging than other rename refactoring activities. We first point out the connections between rename refactoring and various prevalent learning paradigms and the difference between rename refactoring and general text generation in natural language processing. Based on our observations, we propose RefBERT, a two-stage pre-trained framework for rename refactoring on variable names. RefBERT first predicts the number of sub-tokens in the new name and then generates sub-tokens accordingly. Several techniques, including constrained masked language modeling, contrastive learning, and the bag-of-tokens loss, are incorporated into RefBERT to tailor it for automatic rename refactoring on variable names. Through extensive experiments on our constructed refactoring datasets, we show that the generated variable names of RefBERT are more accurate and meaningful than those produced by the existing method

arXiv.org e-Print Archive

Architectural Vulnerabilities in Plug-and-Play Systems

Author: Corrello Taylor
Publication venue: RIT Scholar Works
Publication date: 01/05/2018
Field of study

Plug-and-play architectures enhance systems’ extensibility by providing a framework that enables additional functionalities to be added or removed from the system at their runtime. Such frameworks are often implemented through a set of well-defined interfaces that form the extension points for the pluggable functionalities. However, the plug-ins can increase the applications attack surface or introduce untrusted behavior into the system. Designing a secure plug-and-play architecture is critical and non-trivial as the features provided by plug-ins are not known in advance. In this paper, we conduct an in-depth study of seven systems with plug-and-play architectures. In total, we have analyzed 3,183 vulnerabilities from Chromium, Thunderbird, Firefox, Pidgin, WordPress, Apache OfBiz, and OpenMRS whose core architecture is based on a plug-and-play approach. We have also identified the common security vulnerabilities related to the plug-and-play architectures, and mechanisms to mitigate them by following a grounded theory approach. We found a total of 303 vulnerabilities that are rooted in extensibility design decisions. We also observed that these plugin-related vulnerabilities were caused by 15 different types of problems. We present these 15 types of security issues observed in the case studies and the design mechanisms that could prevent such vulnerabilities. Finally, as a result of this study, we have used formal modeling in order to guide developers of plug and play systems in verifying that their architectures are free of many of these types of security issues

RIT Scholar Works

20-MAD:20 years of issues and commits of Mozilla and Apache development

Author: Claes M. (Maëlick)
Mäntylä M. V. (Mika V.)
Publication venue
Publication date: 01/01/2020
Field of study

Abstract Data of long-lived and high profile projects is valuable for research on successful software engineering in the wild. Having a dataset with different linked software repositories of such projects, enables deeper diving investigations. This paper presents 20-MAD, a dataset linking the commit and issue data of Mozilla and Apache projects. It includes over 20 years of information about 765 projects, 3.4M commits, 2.3M issues, and 17.3M issue comments, and its compressed size is over 6 GB. The data contains all the typical information about source code commits (e.g., lines added and removed, message and commit time) and issues (status, severity, votes, and summary). The issue comments have been pre-processed for natural language processing and sentiment analysis. This includes emoticons and valence and arousal scores. Linking code repository and issue tracker information, allows studying individuals in two types of repositories and provide more accurate time zone information for issue trackers as well. To our knowledge, this the largest linked dataset in size and in project lifetime that is not based on GitHub

University of Oulu Repository - Jultika

Libre culture: meditations on free culture

Author: Berry David M
Moss Giles
Publication venue: Pygmalion Books
Publication date: 01/01/2008
Field of study

Libre Culture is the essential expression of the free culture/copyleft movement. This anthology, brought together here for the first time, represents the early groundwork of Libre Society thought. Referring to the development of creativity and ideas, capital works to hoard and privatize the knowledge and meaning of what is created. Expression becomes monopolized, secured within an artificial market-scarcity enclave and finally presented as a novelty on the culture industry in order to benefit cloistered profit motives. In the way that physical resources such as forests or public services are free, Libre Culture argues for the freeing up of human ideas and expression from copyright bulwarks in all forms

Sussex Research Online