1,969 research outputs found
BugsInPy: A database of existing bugs in Python programs to enable controlled testing and debugging studies
Lee Kuan Yew Fellowship, Singapore Management Universit
Boa Meets Python: A Boa Dataset of Data Science Software in Python Language
The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community
JVM-hosted languages: They talk the talk, but do they walk the walk?
The rapid adoption of non-Java JVM languages is impressive: major international corporations are staking critical parts of their software infrastructure on components built from languages such as
Scala and Clojure. However with the possible exception of Scala,
there has been little academic consideration and characterization
of these languages to date. In this paper, we examine four nonJava JVM languages and use exploratory data analysis techniques
to investigate differences in their dynamic behavior compared to
Java. We analyse a variety of programs and levels of behavior to
draw distinctions between the different programming languages.
We briefly discuss the implications of our findings for improving
the performance of JIT compilation and garbage collection on the
JVM platform
PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection
Bug datasets are vital for enabling deep learning techniques to address
software maintenance tasks related to bugs. However, existing bug datasets
suffer from precise and scale limitations: they are either small-scale but
precise with manual validation or large-scale but imprecise with simple commit
message processing. In this paper, we introduce PreciseBugCollector, a precise,
multi-language bug collection approach that overcomes these two limitations.
PreciseBugCollector is based on two novel components: a) A bug tracker to map
the codebase repositories with external bug repositories to trace bug type
information, and b) A bug injector to generate project-specific bugs by
injecting noise into the correct codebases and then executing them against
their test suites to obtain test failure messages.
We implement PreciseBugCollector against three sources: 1) A bug tracker that
links to the national vulnerability data set (NVD) to collect general-wise
vulnerabilities, 2) A bug tracker that links to OSS-Fuzz to collect
general-wise bugs, and 3) A bug injector based on 16 injection rules to
generate project-wise bugs. To date, PreciseBugCollector comprises 1057818 bugs
extracted from 2968 open-source projects. Of these, 12602 bugs are sourced from
bug repositories (NVD and OSS-Fuzz), while the remaining 1045216
project-specific bugs are generated by the bug injector. Considering the
challenge objectives, we argue that a bug injection approach is highly valuable
for the industrial setting, since project-specific bugs align with domain
knowledge, share the same codebase, and adhere to the coding style employed in
industrial projects.Comment: Accepted at the industry challenge track of ASE 202
ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair
With the growing interest on Large Language Models (LLMs) for fault
localization and program repair, ensuring the integrity and generalizability of
the LLM-based methods becomes paramount. The code in existing widely-adopted
benchmarks for these tasks was written before the the bloom of LLMs and may be
included in the training data of existing popular LLMs, thereby suffering from
the threat of data leakage, leading to misleadingly optimistic performance
metrics. To address this issue, we introduce "ConDefects", a novel dataset of
real faults meticulously curated to eliminate such overlap. ConDefects contains
1,254 Java faulty programs and 1,625 Python faulty programs. All these programs
are sourced from the online competition platform AtCoder and were produced
between October 2021 and September 2023. We pair each fault with fault
locations and the corresponding repaired code versions, making it tailored for
in fault localization and program repair related research. We also provide
interfaces for selecting subsets based on different time windows and coding
task difficulties. While inspired by LLM-based tasks, ConDefects can be adopted
for benchmarking ALL types of fault localization and program repair methods.
The dataset is publicly available, and a demo video can be found at
https://www.youtube.com/watch?v=22j15Hj5ONk.Comment: 5pages, 3 figure
Study of metrics and practices for improving object oriented software quality
Modern software systems are large and complex products, consisting
in thousands lines of code, developed, often in a distributed environment,
by dozens of developers and produced through an industrial process, usu-
ally with short time to market. To manage such kind of complexity and
to keep the development process under control measurements and metrics
are required. The present thesis collects the outcomes of the research the
author carried on in the field of software metrics during the three years of
the Ph.D. studies. Software metrics are used to measure various aspects
of software development, including software features, processes execution,
developers' efforts, software quality, just to name a few. The first part of
the present thesis reports the results of the studies performed on product
metrics, with the final goal of helping software engineers better manage
the programmers efforts and particularly to assess software quality dur-
ing software development. The second part of this dissertation presents
the outcomes of the research aimed at shedding some light on the effec-
tiveness and impact of some development practices on software systems.
To perform these studies I used a novel approach, based on the concept
of complex network. Complex networks are in fact one of the best can-
didates to represent software systems, enabling researchers to obtain a
deeper knowledge of the structure and evolution of a software system. We
found some meaningful statistical correlations between network metrics
and software properties. Both the theoretical framework and the reported
findings might, in principle, have a practical application to assist software
engineers dealing with specific development tasks, like bug discovery or
refactoring
Study of metrics and practices for improving object oriented software quality
Modern software systems are large and complex products, consisting
in thousands lines of code, developed, often in a distributed environment,
by dozens of developers and produced through an industrial process, usu-
ally with short time to market. To manage such kind of complexity and
to keep the development process under control measurements and metrics
are required. The present thesis collects the outcomes of the research the
author carried on in the field of software metrics during the three years of
the Ph.D. studies. Software metrics are used to measure various aspects
of software development, including software features, processes execution,
developers' efforts, software quality, just to name a few. The first part of
the present thesis reports the results of the studies performed on product
metrics, with the final goal of helping software engineers better manage
the programmers efforts and particularly to assess software quality dur-
ing software development. The second part of this dissertation presents
the outcomes of the research aimed at shedding some light on the effec-
tiveness and impact of some development practices on software systems.
To perform these studies I used a novel approach, based on the concept
of complex network. Complex networks are in fact one of the best can-
didates to represent software systems, enabling researchers to obtain a
deeper knowledge of the structure and evolution of a software system. We
found some meaningful statistical correlations between network metrics
and software properties. Both the theoretical framework and the reported
findings might, in principle, have a practical application to assist software
engineers dealing with specific development tasks, like bug discovery or
refactoring
- …