Search CORE

25,881 research outputs found

The Technical Debt Dataset

Author: Lenarduzzi Valentina
Saarimäki Nyyti
Taibi Davide
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/08/2019
Field of study

Technical Debt analysis is increasing in popularity as nowadays researchers and industry are adopting various tools for static code analysis to evaluate the quality of their code. Despite this, empirical studies on software projects are expensive because of the time needed to analyze the projects. In addition, the results are difficult to compare as studies commonly consider different projects. In this work, we propose the Technical Debt Dataset, a curated set of project measurement data from 33 Java projects from the Apache Software Foundation. In the Technical Debt Dataset, we analyzed all commits from separately defined time frames with SonarQube to collect Technical Debt information and with Ptidej to detect code smells. Moreover, we extracted all available commit information from the git logs, the refactoring applied with Refactoring Miner, and fault information reported in the issue trackers (Jira). Using this information, we executed the SZZ algorithm to identify the fault-inducing and -fixing commits. We analyzed 78K commits from the selected 33 projects, detecting 1.8M SonarQube issues, 38K code smells, 28K faults and 57K refactorings. The project analysis took more than 200 days. In this paper, we describe the data retrieval pipeline together with the tools used for the analysis. The dataset is made available through CSV files and an SQLite database to facilitate queries on the data. The Technical Debt Dataset aims to open up diverse opportunities for Technical Debt research, enabling researchers to compare results on common projects

arXiv.org e-Print Archive

Profiling Developers Through the Lens of Technical Debt

Author: Codabux Zadia
Cunningham Ward
Li Xiaozhou
Publication venue
Publication date: 08/09/2020
Field of study

Context: Technical Debt needs to be managed to avoid disastrous consequences, and investigating developers' habits concerning technical debt management is invaluable information in software development. Objective: This study aims to characterize how developers manage technical debt based on the code smells they induce and the refactorings they apply. Method: We mined a publicly-available Technical Debt dataset for Git commit information, code smells, coding violations, and refactoring activities for each developer of a selected project. Results: By combining this information, we profile developers to recognize prolific coders, highlight activities that discriminate among developer roles (reviewer, lead, architect), and estimate coding maturity and technical debt tolerance

arXiv.org e-Print Archive

Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?

Author: Bavota Gabriele
Di Penta Massimiliano
Mastropaolo Antonio
Publication venue
Publication date: 17/08/2023
Field of study

Upon evolving their software, organizations and individual developers have to spend a substantial effort to pay back technical debt, i.e., the fact that software is released in a shape not as good as it should be, e.g., in terms of functionality, reliability, or maintainability. This paper empirically investigates the extent to which technical debt can be automatically paid back by neural-based generative models, and in particular models exploiting different strategies for pre-training and fine-tuning. We start by extracting a dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595 open-source projects. SATD refers to technical debt instances documented (e.g., via code comments) by developers. We use this dataset to experiment with seven different generative deep learning (DL) model configurations. Specifically, we compare transformers pre-trained and fine-tuned with different combinations of training objectives, including the fixing of generic code changes, SATD removals, and SATD-comment prompt tuning. Also, we investigate the applicability in this context of a recently-available Large Language Model (LLM)-based chat bot. Results of our study indicate that the automated repayment of SATD is a challenging task, with the best model we experimented with able to automatically fix ~2% to 8% of test instances, depending on the number of attempts it is allowed to make. Given the limited size of the fine-tuning dataset (~5k instances), the model's pre-training plays a fundamental role in boosting performance. Also, the ability to remove SATD steadily drops if the comment documenting the SATD is not provided as input to the model. Finally, we found general-purpose LLMs to not be a competitive approach for addressing SATD

arXiv.org e-Print Archive

Does Technical Debt Lead to the Rejection of Pull Requests?

Author: Silva Marcelino Campos Oliveira
Terra Ricardo
Valente Marco Tulio
Publication venue
Publication date: 05/04/2016
Field of study

Technical Debt is a term used to classify non-optimal solutions during software development. These solutions cause several maintenance problems and hence they should be avoided or at least documented. Although there are a considered number of studies that focus on the identification of Technical Debt, we focus on the identification of Technical Debt in pull requests. Specifically, we conduct an investigation to reveal the different types of Technical Debt that can lead to the rejection of pull requests. From the analysis of 1,722 pull requests, we classify Technical Debt in seven categories namely design, documentation, test, build, project convention, performance, or security debt. Our results indicate that the most common category of Technical Debt is design with 39.34%, followed by test with 23.70% and project convention with 15.64%. We also note that the type of Technical Debt influences on the size of push request discussions, e.g., security and project convention debts instigate more discussion than the other types.Comment: Accepted at the Brazilian Symposium on Information Systems (SBSI), p. 1-7, 201

arXiv.org e-Print Archive

Feature Selection of Post-Graduation Income of College Students in the United States

Author: C Avery
D Goldberg
D Witteveen
DH Autor
JE Beasley
M Hall
M Hout
P Beaudry
R Chetty
RD Putnam
SR Lucas
SR Lucas
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/05/2018
Field of study

This study investigated the most important attributes of the 6-year post-graduation income of college graduates who used financial aid during their time at college in the United States. The latest data released by the United States Department of Education was used. Specifically, 1,429 cohorts of graduates from three years (2001, 2003, and 2005) were included in the data analysis. Three attribute selection methods, including filter methods, forward selection, and Genetic Algorithm, were applied to the attribute selection from 30 relevant attributes. Five groups of machine learning algorithms were applied to the dataset for classification using the best selected attribute subsets. Based on our findings, we discuss the role of neighborhood professional degree attainment, parental income, SAT scores, and family college education in post-graduation incomes and the implications for social stratification.Comment: 14 pages, 6 tables, 3 figure

arXiv.org e-Print Archive

Sentiment Classification using N-gram IDF and Automated Machine Learning

Author: Hata Hideaki
Maipradit Rungroj
Matsumoto Kenichi
Publication venue
Publication date: 25/05/2019
Field of study

We propose a sentiment classification method with a general machine learning framework. For feature representation, n-gram IDF is used to extract software-engineering-related, dataset-specific, positive, neutral, and negative n-gram expressions. For classifiers, an automated machine learning tool is used. In the comparison using publicly available datasets, our method achieved the highest F1 values in positive and negative sentences on all datasets.Comment: 4 pages, IEEE Softwar

arXiv.org e-Print Archive

Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes

Author: Casarsa Leonardo
Mansinghka Vikash
Saad Feras
Publication venue
Publication date: 04/04/2017
Field of study

Databases are widespread, yet extracting relevant data can be difficult. Without substantial domain knowledge, multivariate search queries often return sparse or uninformative results. This paper introduces an approach for searching structured data based on probabilistic programming and nonparametric Bayes. Users specify queries in a probabilistic language that combines standard SQL database search operators with an information theoretic ranking function called predictive relevance. Predictive relevance can be calculated by a fast sparse matrix algorithm based on posterior samples from CrossCat, a nonparametric Bayesian model for high-dimensional, heterogeneously-typed data tables. The result is a flexible search technique that applies to a broad class of information retrieval problems, which we integrate into BayesDB, a probabilistic programming platform for probabilistic data analysis. This paper demonstrates applications to databases of US colleges, global macroeconomic indicators of public health, and classic cars. We found that human evaluators often prefer the results from probabilistic search to results from a standard baseline

arXiv.org e-Print Archive

Estimating Refactoring Efforts for Architecture Technical Debt

Author: Deeb Samir
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2020
Field of study

Paying-off the Architectural Technical Debt by refactoring the flawed code is important to control the debt and to keep it as low as possible. Project Managers tend to delay paying off this debt because they face difficulties in comparing the cost of the refactoring against the benefits they gain. For these managers to decide whether to refactor or to postpone, they need to estimate the cost and the efforts required to conduct these refactoring activities as well as to decide which flaws have higher priority to be refactored among others. Our research is based on a dataset used by other researchers in the technical debt field. It includes more than 18,000 refactoring operations performed on 33 apache java projects. To estimate the refactoring efforts done, we applied the COCOMO II:2000 model to calculate the refactoring cost in person-months units per release. Furthermore, we investigated the correlation between the refactoring efforts and two static code metrics of the refactored code, mainly, the LOC and the complexity. The research revealed a moderate correlation between the refactoring efforts and each one of the size of the project and code complexity. Finally, we applied the DesigniteJava tool and machine learning practices to verify our research results. From the analysis we found a significant correlation between the ranking of the architecture smells and the ranking of refactoring efforts for each package. Using machine learning practices, we took the architecture smells level and the code metrics of each release as an input to predict the levels of the refactoring effort of the next release. We calculated the results using our model and found that we can predict the higher refactoring cost levels with 93\% accuracy

The Research Repository @ WVU (West Virginia University)

Does Infrastructure Investment Lead to Economic Growth or Economic Fragility? Evidence from China

Author: Ansar Atif
Budzier Alexander
Flyvbjerg Bent
Lunn Daniel
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

The prevalent view in the economics literature is that a high level of infrastructure investment is a precursor to economic growth. China is especially held up as a model to emulate. Based on the largest dataset of its kind, this paper punctures the twin myths that, first, infrastructure creates economic value, and, second, China has a distinct advantage in its delivery. Far from being an engine of economic growth, the typical infrastructure investment fails to deliver a positive risk adjusted return. Moreover, China's track record in delivering infrastructure is no better than that of rich democracies. Where investments are debt-financed, overinvesting in unproductive projects results in the buildup of debt, monetary expansion, instability in financial markets, and economic fragility, exactly as we see in China today. We conclude that poorly managed infrastructure investments are a main explanation of surfacing economic and financial problems in China. We predict that, unless China shifts to a lower level of higher-quality infrastructure investments, the country is headed for an infrastructure-led national financial and economic crisis, which is likely also to be a crisis for the international economy. China's infrastructure investment model is not one to follow for other countries but one to avoid

arXiv.org e-Print Archive

From Academia to Software Development: Publication Citations in Source Code Comments

Author: Hata Hideaki
Inokuchi Akira
Konishi Fumiaki
Matsumoto Kenichi
Monden Akito
Nugroho Yusuf Sulistyo
Treude Christoph
Wattanakriengkrai Supatsara
Publication venue
Publication date: 01/05/2020
Field of study

Academic publications have been evaluated in terms of their impact on research communities based on many metrics, such as the number of citations. On the other hand, the impact of academic publications on industry has been rarely studied. This paper investigates how academic publications contribute to software development by analyzing publication citations in source code comments in open source software repositories. We propose an automated approach for detecting academic publications based on Named Entity Recognition, and achieve 0.90 in

F_1

as detection accuracy. We conduct a large-scale study of publication citations with 319,438,977 comments collected from 25,925 active repositories written in seven programming languages. Our findings indicate that academic publications can be knowledge sources for software development. These referenced publications are particularly from journals. In terms of knowledge transfer, algorithm is the most prevalent type of knowledge transferred from the publications, with proposed formulas or equations typically implemented in methods or functions in source code files. In a closer look at GitHub repositories referencing academic publications, we find that science-related repositories are the most frequent among GitHub repositories with publication citations, and that the vast majority of these publications are referenced by repository owners who are different from the publication authors. We also find that referencing older publications can lead to potential issues related to obsolete knowledge.Comment: 33 page

arXiv.org e-Print Archive