Search CORE

4 research outputs found

A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Author: Aghamohammadi Alireza
Ahmadabadi Matin Nili
Aktas Ethem Utku
Alam Omar
Albrecht Ella
Aldaeej Abdullah
Amit Idan
Bossenmaier Tim
Chahal Kuljit Kaur
Chakroborti Debasish
Colomo-Palacios Ricardo
Davis James
Davis Willard
Eismann Simon
Erbel Johannes
Fard Fatemeh
Ghaleb Taher Ahmed
Henley Austin Z.
Herbold Steffen
Hoy Nathaniel
Kourtzanidis Stratos
Ledel Benjamin
Lenarduzzi Valentina
Madeja Matej
Makedonski Philip
Malavolta Ivano
Marcilio Diego
Nagaria Bhaveet
Pashchenko Ivan
Qin Yihao
Rodríguez-Pérez Gema
Serebrenik Alexander
Shamasbi Simin Maleki
Singh Paramvir
Spieker Helge
Strüber Daniel
Sulir Matus
Szabados Kristof
Trautsch Alexander
Treude Christoph
Turhan Burak
Tuzun Eray
Verdecchia Roberto
Walunj Vijay
Wang Shangwen
Wickert Anna-Katharina
Wu Hongjun
Wyrich Marvin
Publication venue
Publication date: 01/01/2021
Field of study

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin

arXiv.org e-Print Archive

University of Oulu Repository - Jultika

Monash University Research Portal

Recommended from our members

Power Harvesting Practices and Technology Gaps for Sensor Networks

Author: Andrews Jr William H
Clayton Dwight A
Lenarduzzi Roberto
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 01/09/2012
Field of study

UNT Digital Library

Recommended from our members

Hot Water Distribution System Program Documentation and Comparison to Experimental Data

Author: Baskin Evelyn
Craddick William G
Lenarduzzi Roberto
Wendt Robert L
Woodbury Professor Keith A.
Publication venue: Building Technologies Research and Integration Center
Publication date: 01/09/2007
Field of study

In 2003, the California Energy Commission s (CEC s) Public Interest Energy Research (PIER) program funded Oak Ridge National Laboratory (ORNL) to create a computer program to analyze hot water distribution systems for single family residences, and to perform such analyses for a selection of houses. This effort and its results were documented in a report provided to CEC in March, 2004 [1]. The principal objective of effort was to compare the water and energy wasted between various possible hot water distribution systems for various different house designs. It was presumed that water being provided to a user would be considered suitably warm when it reached 105 F. Therefore, what was needed was a tool which could compute the time it takes for water reaching the draw point to reach 105 F, and the energy wasted during this wait. The computer program used to perform the analyses was a combination of a calculational core, produced by Dr. Keith A. Woodbury, Professor of Mechanical Engineering and Director, Alabama Industrial Assessment Center, University of Alabama, and a user interface based on LabVIEW, created by Dr. Roberto Lenarduzzi of ORNL. At that time, the computer program was in a relatively rough and undocumented form adequate to perform the contracted work but not in a condition where it could be readily used by those not involved in its generation. Subsequently, the CEC provided funding through Lawrence Berkeley National Laboratory (LBNL) to improve the program s documentation and user interface to facilitate use by others, and to compare the program s results to experimental data generated by Dr. Carl Hiller. This report describes the program and provides user guidance. It also summarizes the comparisons made to experimental data, along with options built into the program specifically to allow these comparisons. These options were necessitated by the fact that some of the experimental data required options and features not originally included in the program. A more detailed description of these program modifications along with detailed comparisons to the experimental data are provided in a report produced by Dr. Woodbury, which accompanies this report as Appendix H

UNT Digital Library

A fine-grained data set and analysis of tangling in bug fixing commits

Author: Aghamohammadi A. (Alireza)
Ahmadabadi M. N. (Matin Nili)
Aktas E. U. (Ethem Utku)
Alam O. (Omar)
Albrecht E. (Ella)
Aldaeej A. (Abdullah)
Amit I. (Idan)
Bossenmaier T. (Tim)
Chahal K. K. (Kuljit Kaur)
Chakroborti D. (Debasish)
Colomo-Palacios R. (Ricardo)
Davis J. (James)
Davis W. (Willard)
Eismann S. (Simon)
Erbel J. (Johannes)
Fard F. (Fatemeh)
Ghaleb T. A. (Taher A.)
Henley A. Z. (Austin Z.)
Herbold S. (Steffen)
Hoy N. (Nathaniel)
Kourtzanidis S. (Stratos)
Ledel B. (Benjamin)
Lenarduzzi V. (Valentina)
Madeja M. (Matej)
Makedonski P. (Philip)
Malavolta I. (Ivano)
Marcilio D. (Diego)
Nagaria B. (Bhaveet)
Pashchenko I. (Ivan)
Qin Y. (Yihao)
Rodríguez-Pérez G. (Gema)
Serebrenik A. (Alexander)
Shamasbi S. M. (Simin Maleki)
Singh P. (Paramvir)
Spieker H. (Helge)
Strüber D. (Daniel)
Sulír M. (Matúš)
Szabados K. (Kristof)
Trautsch A. (Alexander)
Treude C. (Christoph)
Turhan B. (Burak)
Tuzun E. (Eray)
Verdecchia R. (Roberto)
Walunj V. (Vijay)
Wang S. (Shangwen)
Wickert A.-K. (Anna-Katharina)
Wu H. (Hongjun)
Wyrich M. (Marvin)
Publication venue: Springer Nature
Publication date: 01/01/2022
Field of study

Abstract Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objectives: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusions: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise

University of Oulu Repository - Jultika