14 research outputs found

    Code similarity and clone search in large-scale source code data

    Get PDF
    Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse

    Autorepairability: A New Software Quality Characteristic

    Full text link
    Lapvikai P., Ragkhitwetsagul C., Choetkiertikul M., et al. Autorepairability: A New Software Quality Characteristic. Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024 , 787 (2024); https://doi.org/10.1109/SANER60148.2024.00085.Currently, research on automated program repair (in short, APR) is actively being conducted. APR techniques have been applied to many bugs in open-source software, but the probability of a successful fix is not very high. The authors consider that not only should APR techniques be developed, but software systems should be developed so that bugs can be easily fixed with APR techniques. In this paper, we propose autorepairability, a new characteristic of software quality, that shows how effective automated program repair techniques are for a specific code fragment, file, or project. We also show an approach to automatically measure autorepairability from the source code of a target project, and present experimental results on 1,282 Java method pairs. The use of autorepairability allows many studies to be conducted. For example, research on the development process for developing software systems with high autorepairability and research on refactoring, which transforms software with low autorepairability into software systems with high autorepairability, will be possible

    A Taxonomy for Mining and Classifying Privacy Requirements in Issue Reports

    Full text link
    Digital and physical footprints are a trail of user activities collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increasing of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data processing. However, existing privacy regulations and policies only provide high-level principles which are difficult for software engineers to design and implement privacy-aware systems. In this paper, we develop a taxonomy that provides a comprehensive set of privacy requirements based on two well-established and widely-adopted privacy regulations and frameworks, the General Data Protection Regulation (GDPR) and the ISO/IEC 29100. These requirements are refined into a level that is implementable and easy to understand by software engineers, thus supporting them to attend to existing regulations and standards. We have also performed a study on how two large open-source software projects (Google Chrome and Moodle) address the privacy requirements in our taxonomy through mining their issue reports. The paper discusses how the collected issues were classified, and presents the findings and insights generated from our study.Comment: Submitted to IEEE Transactions on Software Engineering on 23 December 202

    Mining the Characteristics of Jupyter Notebooks in Data Science Projects

    Full text link
    Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub

    BigCloneBench Considered Harmful for Machine Learning

    No full text

    A taxonomy for mining and classifying privacy requirements in issue reports

    No full text
    Context: Digital and physical trails of user activities are collected over the use of software applications and systems. As software becomes ubiquitous, protecting user privacy has become challenging. With the increase of user privacy awareness and advent of privacy regulations and policies, there is an emerging need to implement software systems that enhance the protection of personal data processing. However, existing data protection and privacy regulations provide key principles in high-level, making it difficult for software engineers to design and implement privacy-aware systems. Objective: In this paper, we develop a taxonomy that provides a comprehensive set of privacy requirements based on four well-established personal data protection regulations and privacy frameworks, the General Data Protection Regulation (GDPR), ISO/IEC 29100, Thailand Personal Data Protection Act (Thailand PDPA) and Asia-Pacific Economic Cooperation (APEC) privacy framework. Methods: These requirements are extracted, refined and classified (using the goal-based requirements analysis method) into a level that can be used to map with issue reports. We have also performed a study on how two large open-source software projects (Google Chrome and Moodle) address the privacy requirements in our taxonomy through mining their issue reports. Results: The paper discusses how the collected issues were classified, and presents the findings and insights generated from our study. Conclusion: Mining and classifying privacy requirements in issue reports can help organisations be aware of their state of compliance by identifying privacy requirements that have not been addressed in their software projects. The taxonomy can also trace back to regulations, standards and frameworks that the software projects have not complied with based on the identified privacy requirements

    Automatically recommending components for issue reports using deep learning

    No full text
    Today’s software development is typically driven by incremental changes made to software to implement a new functionality, fix a bug, or improve its performance and security. Each change request is often described as an issue. Recent studies suggest that a set of components (e.g., software modules) relevant to the resolution of an issue is one of the most important information provided with the issue that software engineers often rely on. However, assigning an issue to the correct component(s) is challenging, especially for large-scale projects which have up to hundreds of components. In this paper, we propose a predictive model which learns from historical issue reports and recommends the most relevant components for new issues. Our model uses Long Short-Term Memory, a deep learning technique, to automatically learn semantic features representing an issue report, and combines them with the traditional textual similarity features. An extensive evaluation on 142,025 issues from 11 large projects shows that our approach outperforms one common baseline, two state-of-the-art techniques, and six alternative techniques with an improvement of 16.70%–66.31% on average across all projects in predictive performance
    corecore