Search CORE

3,335 research outputs found

An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction

Author: Ferenc Rudolf
Gyimesi Gábor
Gyimesi Péter
Gyimóthy Tibor
Tóth Zoltán
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Bugs are inescapable during software development due to frequent code changes, tight deadlines, etc.; therefore, it is important to have tools to find these errors. One way of performing bug identification is to analyze the characteristics of buggy source code elements from the past and predict the present ones based on the same characteristics, using e.g. machine learning models. To support model building tasks, code elements and their characteristics are collected in so-called bug datasets which serve as the input for learning. We present the \emph{BugHunter Dataset}: a novel kind of automatically constructed and freely available bug dataset containing code elements (files, classes, methods) with a wide set of code metrics and bug information. Other available bug datasets follow the traditional approach of gathering the characteristics of all source code elements (buggy and non-buggy) at only one or more pre-selected release versions of the code. Our approach, on the other hand, captures the buggy and the fixed states of the same source code elements from the narrowest timeframe we can identify for a bug's presence, regardless of release versions. To show the usefulness of the new dataset, we built and evaluated bug prediction models and achieved F-measure values over 0.74

arXiv.org e-Print Archive

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Massive Multi-Agent Data-Driven Simulations of the GitHub Ecosystem

Author: Ahn Yong-Yeol
Blythe Jim
Bollenbacher John
Ferrara Emilio
Flammini Alessandro
Huang Di
Hui Pik-Mai
Krohn Rachel
Lerman Kristina
Menczer Filippo
Muric Goran
Pacheco Diogo
Sapienza Anna
Tregubov Alexey
Weninger Tim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/08/2019
Field of study

Simulating and predicting planetary-scale techno-social systems poses heavy computational and modeling challenges. The DARPA SocialSim program set the challenge to model the evolution of GitHub, a large collaborative software-development ecosystem, using massive multi-agent simulations. We describe our best performing models and our agent-based simulation framework, which we are currently extending to allow simulating other planetary-scale techno-social systems. The challenge problem measured participant's ability, given 30 months of meta-data on user activity on GitHub, to predict the next months' activity as measured by a broad range of metrics applied to ground truth, using agent-based simulation. The challenge required scaling to a simulation of roughly 3 million agents producing a combined 30 million actions, acting on 6 million repositories with commodity hardware. It was also important to use the data optimally to predict the agent's next moves. We describe the agent framework and the data analysis employed by one of the winning teams in the challenge. Six different agent models were tested based on a variety of machine learning and statistical methods. While no single method proved the most accurate on every metric, the broadly most successful sampled from a stationary probability distribution of actions and repositories for each agent. Two reasons for the success of these agents were their use of a distinct characterization of each agent, and that GitHub users change their behavior relatively slowly

arXiv.org e-Print Archive

Crossref

A Novel Bug Prediction Dataset, Process Metrics, and a Public Dataset of JavaScript Bugs

Author: Gyimesi Péter
Publication venue
Publication date: 21/05/2024
Field of study

SZTE Doktori Értekezések Repozitórium (SZTE Repository of Dissertations)

The effects of change decomposition on code review -- a controlled experiment

Author: Bacchelli Alberto
Bruntink Magiel
di Biase Marco
van Deursen Arie
Publication venue: 'PeerJ'
Publication date: 01/05/2019
Field of study

Background: Code review is a cognitively demanding and time-consuming process. Previous qualitative studies hinted at how decomposing change sets into multiple yet internally coherent ones would improve the reviewing process. So far, literature provided no quantitative analysis of this hypothesis. Aims: (1) Quantitatively measure the effects of change decomposition on the outcome of code review (in terms of number of found defects, wrongly reported issues, suggested improvements, time, and understanding); (2) Qualitatively analyze how subjects approach the review and navigate the code, building knowledge and addressing existing issues, in large vs. decomposed changes. Method: Controlled experiment using the pull-based development model involving 28 software developers among professionals and graduate students. Results: Change decomposition leads to fewer wrongly reported issues, influences how subjects approach and conduct the review activity (by increasing context-seeking), yet impacts neither understanding the change rationale nor the number of found defects. Conclusions: Change decomposition reduces the noise for subsequent data analyses but also significantly supports the tasks of the developers in charge of reviewing the changes. As such, commits belonging to different concepts should be separated, adopting this as a best practice in software engineering

arXiv.org e-Print Archive

Directory of Open Access Journals

The Role of Data Filtering in Open Source Software Ranking and Selection

Author: Malviya-Thakur Addi
Mockus Audris
Publication venue
Publication date: 18/01/2024
Field of study

Faced with over 100M open source projects most empirical investigations select a subset. Most research papers in leading venues investigated filtering projects by some measure of popularity with explicit or implicit arguments that unpopular projects are not of interest, may not even represent "real" software projects, or that less popular projects are not worthy of study. However, such filtering may have enormous effects on the results of the studies if and precisely because the sought-out response or prediction is in any way related to the filtering criteria. We exemplify the impact of this practice on research outcomes: how filtering of projects listed on GitHub affects the assessment of their popularity. We randomly sample over 100,000 repositories and use multiple regression to model the number of stars (a proxy for popularity) based on the number of commits, the duration of the project, the number of authors, and the number of core developers. Comparing control with the entire dataset with a filtered model projects having ten or more authors we find that while certain characteristics of the repository consistently predict popularity, the filtering process significantly alters the relation ships between these characteristics and the response. The number of commits exhibited a positive correlation with popularity in the control sample but showed a negative correlation in the filtered sample. These findings highlight the potential biases introduced by data filtering and emphasize the need for careful sample selection in empirical research of mining software repositories. We recommend that empirical work should either analyze complete datasets such as World of Code, or employ stratified random sampling from a complete dataset to ensure that filtering is not biasing the results.Comment: International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE 2024

arXiv.org e-Print Archive