1,969 research outputs found

    Boa Meets Python: A Boa Dataset of Data Science Software in Python Language

    Get PDF
    The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community

    JVM-hosted languages: They talk the talk, but do they walk the walk?

    Get PDF
    The rapid adoption of non-Java JVM languages is impressive: major international corporations are staking critical parts of their software infrastructure on components built from languages such as Scala and Clojure. However with the possible exception of Scala, there has been little academic consideration and characterization of these languages to date. In this paper, we examine four nonJava JVM languages and use exploratory data analysis techniques to investigate differences in their dynamic behavior compared to Java. We analyse a variety of programs and levels of behavior to draw distinctions between the different programming languages. We briefly discuss the implications of our findings for improving the performance of JIT compilation and garbage collection on the JVM platform

    PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection

    Full text link
    Bug datasets are vital for enabling deep learning techniques to address software maintenance tasks related to bugs. However, existing bug datasets suffer from precise and scale limitations: they are either small-scale but precise with manual validation or large-scale but imprecise with simple commit message processing. In this paper, we introduce PreciseBugCollector, a precise, multi-language bug collection approach that overcomes these two limitations. PreciseBugCollector is based on two novel components: a) A bug tracker to map the codebase repositories with external bug repositories to trace bug type information, and b) A bug injector to generate project-specific bugs by injecting noise into the correct codebases and then executing them against their test suites to obtain test failure messages. We implement PreciseBugCollector against three sources: 1) A bug tracker that links to the national vulnerability data set (NVD) to collect general-wise vulnerabilities, 2) A bug tracker that links to OSS-Fuzz to collect general-wise bugs, and 3) A bug injector based on 16 injection rules to generate project-wise bugs. To date, PreciseBugCollector comprises 1057818 bugs extracted from 2968 open-source projects. Of these, 12602 bugs are sourced from bug repositories (NVD and OSS-Fuzz), while the remaining 1045216 project-specific bugs are generated by the bug injector. Considering the challenge objectives, we argue that a bug injection approach is highly valuable for the industrial setting, since project-specific bugs align with domain knowledge, share the same codebase, and adhere to the coding style employed in industrial projects.Comment: Accepted at the industry challenge track of ASE 202

    Systems for AutoML Research

    Get PDF

    ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair

    Full text link
    With the growing interest on Large Language Models (LLMs) for fault localization and program repair, ensuring the integrity and generalizability of the LLM-based methods becomes paramount. The code in existing widely-adopted benchmarks for these tasks was written before the the bloom of LLMs and may be included in the training data of existing popular LLMs, thereby suffering from the threat of data leakage, leading to misleadingly optimistic performance metrics. To address this issue, we introduce "ConDefects", a novel dataset of real faults meticulously curated to eliminate such overlap. ConDefects contains 1,254 Java faulty programs and 1,625 Python faulty programs. All these programs are sourced from the online competition platform AtCoder and were produced between October 2021 and September 2023. We pair each fault with fault locations and the corresponding repaired code versions, making it tailored for in fault localization and program repair related research. We also provide interfaces for selecting subsets based on different time windows and coding task difficulties. While inspired by LLM-based tasks, ConDefects can be adopted for benchmarking ALL types of fault localization and program repair methods. The dataset is publicly available, and a demo video can be found at https://www.youtube.com/watch?v=22j15Hj5ONk.Comment: 5pages, 3 figure

    Study of metrics and practices for improving object oriented software quality

    Get PDF
    Modern software systems are large and complex products, consisting in thousands lines of code, developed, often in a distributed environment, by dozens of developers and produced through an industrial process, usu- ally with short time to market. To manage such kind of complexity and to keep the development process under control measurements and metrics are required. The present thesis collects the outcomes of the research the author carried on in the field of software metrics during the three years of the Ph.D. studies. Software metrics are used to measure various aspects of software development, including software features, processes execution, developers' efforts, software quality, just to name a few. The first part of the present thesis reports the results of the studies performed on product metrics, with the final goal of helping software engineers better manage the programmers efforts and particularly to assess software quality dur- ing software development. The second part of this dissertation presents the outcomes of the research aimed at shedding some light on the effec- tiveness and impact of some development practices on software systems. To perform these studies I used a novel approach, based on the concept of complex network. Complex networks are in fact one of the best can- didates to represent software systems, enabling researchers to obtain a deeper knowledge of the structure and evolution of a software system. We found some meaningful statistical correlations between network metrics and software properties. Both the theoretical framework and the reported findings might, in principle, have a practical application to assist software engineers dealing with specific development tasks, like bug discovery or refactoring

    Study of metrics and practices for improving object oriented software quality

    Get PDF
    Modern software systems are large and complex products, consisting in thousands lines of code, developed, often in a distributed environment, by dozens of developers and produced through an industrial process, usu- ally with short time to market. To manage such kind of complexity and to keep the development process under control measurements and metrics are required. The present thesis collects the outcomes of the research the author carried on in the field of software metrics during the three years of the Ph.D. studies. Software metrics are used to measure various aspects of software development, including software features, processes execution, developers' efforts, software quality, just to name a few. The first part of the present thesis reports the results of the studies performed on product metrics, with the final goal of helping software engineers better manage the programmers efforts and particularly to assess software quality dur- ing software development. The second part of this dissertation presents the outcomes of the research aimed at shedding some light on the effec- tiveness and impact of some development practices on software systems. To perform these studies I used a novel approach, based on the concept of complex network. Complex networks are in fact one of the best can- didates to represent software systems, enabling researchers to obtain a deeper knowledge of the structure and evolution of a software system. We found some meaningful statistical correlations between network metrics and software properties. Both the theoretical framework and the reported findings might, in principle, have a practical application to assist software engineers dealing with specific development tasks, like bug discovery or refactoring
    corecore