25,881 research outputs found
The Technical Debt Dataset
Technical Debt analysis is increasing in popularity as nowadays researchers
and industry are adopting various tools for static code analysis to evaluate
the quality of their code. Despite this, empirical studies on software projects
are expensive because of the time needed to analyze the projects. In addition,
the results are difficult to compare as studies commonly consider different
projects. In this work, we propose the Technical Debt Dataset, a curated set of
project measurement data from 33 Java projects from the Apache Software
Foundation. In the Technical Debt Dataset, we analyzed all commits from
separately defined time frames with SonarQube to collect Technical Debt
information and with Ptidej to detect code smells. Moreover, we extracted all
available commit information from the git logs, the refactoring applied with
Refactoring Miner, and fault information reported in the issue trackers (Jira).
Using this information, we executed the SZZ algorithm to identify the
fault-inducing and -fixing commits. We analyzed 78K commits from the selected
33 projects, detecting 1.8M SonarQube issues, 38K code smells, 28K faults and
57K refactorings. The project analysis took more than 200 days. In this paper,
we describe the data retrieval pipeline together with the tools used for the
analysis. The dataset is made available through CSV files and an SQLite
database to facilitate queries on the data. The Technical Debt Dataset aims to
open up diverse opportunities for Technical Debt research, enabling researchers
to compare results on common projects
Profiling Developers Through the Lens of Technical Debt
Context: Technical Debt needs to be managed to avoid disastrous consequences,
and investigating developers' habits concerning technical debt management is
invaluable information in software development. Objective: This study aims to
characterize how developers manage technical debt based on the code smells they
induce and the refactorings they apply. Method: We mined a publicly-available
Technical Debt dataset for Git commit information, code smells, coding
violations, and refactoring activities for each developer of a selected
project. Results: By combining this information, we profile developers to
recognize prolific coders, highlight activities that discriminate among
developer roles (reviewer, lead, architect), and estimate coding maturity and
technical debt tolerance
Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?
Upon evolving their software, organizations and individual developers have to
spend a substantial effort to pay back technical debt, i.e., the fact that
software is released in a shape not as good as it should be, e.g., in terms of
functionality, reliability, or maintainability. This paper empirically
investigates the extent to which technical debt can be automatically paid back
by neural-based generative models, and in particular models exploiting
different strategies for pre-training and fine-tuning. We start by extracting a
dateset of 5,039 Self-Admitted Technical Debt (SATD) removals from 595
open-source projects. SATD refers to technical debt instances documented (e.g.,
via code comments) by developers. We use this dataset to experiment with seven
different generative deep learning (DL) model configurations. Specifically, we
compare transformers pre-trained and fine-tuned with different combinations of
training objectives, including the fixing of generic code changes, SATD
removals, and SATD-comment prompt tuning. Also, we investigate the
applicability in this context of a recently-available Large Language Model
(LLM)-based chat bot. Results of our study indicate that the automated
repayment of SATD is a challenging task, with the best model we experimented
with able to automatically fix ~2% to 8% of test instances, depending on the
number of attempts it is allowed to make. Given the limited size of the
fine-tuning dataset (~5k instances), the model's pre-training plays a
fundamental role in boosting performance. Also, the ability to remove SATD
steadily drops if the comment documenting the SATD is not provided as input to
the model. Finally, we found general-purpose LLMs to not be a competitive
approach for addressing SATD
Does Technical Debt Lead to the Rejection of Pull Requests?
Technical Debt is a term used to classify non-optimal solutions during
software development. These solutions cause several maintenance problems and
hence they should be avoided or at least documented. Although there are a
considered number of studies that focus on the identification of Technical
Debt, we focus on the identification of Technical Debt in pull requests.
Specifically, we conduct an investigation to reveal the different types of
Technical Debt that can lead to the rejection of pull requests. From the
analysis of 1,722 pull requests, we classify Technical Debt in seven categories
namely design, documentation, test, build, project convention, performance, or
security debt. Our results indicate that the most common category of Technical
Debt is design with 39.34%, followed by test with 23.70% and project convention
with 15.64%. We also note that the type of Technical Debt influences on the
size of push request discussions, e.g., security and project convention debts
instigate more discussion than the other types.Comment: Accepted at the Brazilian Symposium on Information Systems (SBSI), p.
1-7, 201
Feature Selection of Post-Graduation Income of College Students in the United States
This study investigated the most important attributes of the 6-year
post-graduation income of college graduates who used financial aid during their
time at college in the United States. The latest data released by the United
States Department of Education was used. Specifically, 1,429 cohorts of
graduates from three years (2001, 2003, and 2005) were included in the data
analysis. Three attribute selection methods, including filter methods, forward
selection, and Genetic Algorithm, were applied to the attribute selection from
30 relevant attributes. Five groups of machine learning algorithms were applied
to the dataset for classification using the best selected attribute subsets.
Based on our findings, we discuss the role of neighborhood professional degree
attainment, parental income, SAT scores, and family college education in
post-graduation incomes and the implications for social stratification.Comment: 14 pages, 6 tables, 3 figure
Sentiment Classification using N-gram IDF and Automated Machine Learning
We propose a sentiment classification method with a general machine learning
framework. For feature representation, n-gram IDF is used to extract
software-engineering-related, dataset-specific, positive, neutral, and negative
n-gram expressions. For classifiers, an automated machine learning tool is
used. In the comparison using publicly available datasets, our method achieved
the highest F1 values in positive and negative sentences on all datasets.Comment: 4 pages, IEEE Softwar
Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes
Databases are widespread, yet extracting relevant data can be difficult.
Without substantial domain knowledge, multivariate search queries often return
sparse or uninformative results. This paper introduces an approach for
searching structured data based on probabilistic programming and nonparametric
Bayes. Users specify queries in a probabilistic language that combines standard
SQL database search operators with an information theoretic ranking function
called predictive relevance. Predictive relevance can be calculated by a fast
sparse matrix algorithm based on posterior samples from CrossCat, a
nonparametric Bayesian model for high-dimensional, heterogeneously-typed data
tables. The result is a flexible search technique that applies to a broad class
of information retrieval problems, which we integrate into BayesDB, a
probabilistic programming platform for probabilistic data analysis. This paper
demonstrates applications to databases of US colleges, global macroeconomic
indicators of public health, and classic cars. We found that human evaluators
often prefer the results from probabilistic search to results from a standard
baseline
Estimating Refactoring Efforts for Architecture Technical Debt
Paying-off the Architectural Technical Debt by refactoring the flawed code is important to control the debt and to keep it as low as possible. Project Managers tend to delay paying off this debt because they face difficulties in comparing the cost of the refactoring against the benefits they gain. For these managers to decide whether to refactor or to postpone, they need to estimate the cost and the efforts required to conduct these refactoring activities as well as to decide which flaws have higher priority to be refactored among others.
Our research is based on a dataset used by other researchers in the technical debt field. It includes more than 18,000 refactoring operations performed on 33 apache java projects. To estimate the refactoring efforts done, we applied the COCOMO II:2000 model to calculate the refactoring cost in person-months units per release. Furthermore, we investigated the correlation between the refactoring efforts and two static code metrics of the refactored code, mainly, the LOC and the complexity. The research revealed a moderate correlation between the refactoring efforts and each one of the size of the project and code complexity. Finally, we applied the DesigniteJava tool and machine learning practices to verify our research results. From the analysis we found a significant correlation between the ranking of the architecture smells and the ranking of refactoring efforts for each package.
Using machine learning practices, we took the architecture smells level and the code metrics of each release as an input to predict the levels of the refactoring effort of the next release. We calculated the results using our model and found that we can predict the higher refactoring cost levels with 93\% accuracy
Does Infrastructure Investment Lead to Economic Growth or Economic Fragility? Evidence from China
The prevalent view in the economics literature is that a high level of
infrastructure investment is a precursor to economic growth. China is
especially held up as a model to emulate. Based on the largest dataset of its
kind, this paper punctures the twin myths that, first, infrastructure creates
economic value, and, second, China has a distinct advantage in its delivery.
Far from being an engine of economic growth, the typical infrastructure
investment fails to deliver a positive risk adjusted return. Moreover, China's
track record in delivering infrastructure is no better than that of rich
democracies. Where investments are debt-financed, overinvesting in unproductive
projects results in the buildup of debt, monetary expansion, instability in
financial markets, and economic fragility, exactly as we see in China today. We
conclude that poorly managed infrastructure investments are a main explanation
of surfacing economic and financial problems in China. We predict that, unless
China shifts to a lower level of higher-quality infrastructure investments, the
country is headed for an infrastructure-led national financial and economic
crisis, which is likely also to be a crisis for the international economy.
China's infrastructure investment model is not one to follow for other
countries but one to avoid
From Academia to Software Development: Publication Citations in Source Code Comments
Academic publications have been evaluated in terms of their impact on
research communities based on many metrics, such as the number of citations. On
the other hand, the impact of academic publications on industry has been rarely
studied. This paper investigates how academic publications contribute to
software development by analyzing publication citations in source code comments
in open source software repositories. We propose an automated approach for
detecting academic publications based on Named Entity Recognition, and achieve
0.90 in as detection accuracy. We conduct a large-scale study of
publication citations with 319,438,977 comments collected from 25,925 active
repositories written in seven programming languages. Our findings indicate that
academic publications can be knowledge sources for software development. These
referenced publications are particularly from journals. In terms of knowledge
transfer, algorithm is the most prevalent type of knowledge transferred from
the publications, with proposed formulas or equations typically implemented in
methods or functions in source code files. In a closer look at GitHub
repositories referencing academic publications, we find that science-related
repositories are the most frequent among GitHub repositories with publication
citations, and that the vast majority of these publications are referenced by
repository owners who are different from the publication authors. We also find
that referencing older publications can lead to potential issues related to
obsolete knowledge.Comment: 33 page
- …