807 research outputs found

    Moving beyond Deletions: Program Simplification via Diverse Program Transformations

    Full text link
    To reduce the complexity of software, Developers manually simplify program (known as developer-induced program simplification in this paper) to reduce its code size yet preserving its functionality but manual simplification is time-consuming and error-prone. To reduce manual effort, rule-based approaches (e.g., refactoring) and deletion-based approaches (e.g., delta debugging) can be potentially applied to automate developer-induced program simplification. However, as there is little study on how developers simplify programs in Open-source Software (OSS) projects, it is unclear whether these approaches can be effectively used for developer-induced program simplification. Hence, we present the first study of developer-induced program simplification in OSS projects, focusing on the types of program transformations used, the motivations behind simplifications, and the set of program transformations covered by existing refactoring types. Our study of 382 pull requests from 296 projects reveals that there exist gaps in applying existing approaches for automating developer-induced program simplification. and outlines the criteria for designing automatic program simplification techniques. Inspired by our study and to reduce the manual effort in developer-induced program simplification, we propose SimpT5, a tool that can automatically produce simplified programs (semantically-equivalent programs with reduced source lines of code). SimpT5 is trained based on our collected dataset of 92,485 simplified programs with two heuristics: (1) simplified line localization that encodes lines changed in simplified programs, and (2)checkers that measure the quality of generated programs. Our evaluation shows that SimpT5 are more effective than prior approaches in automating developer-induced program simplification

    Investigating Automatic Static Analysis Results to Identify Quality Problems: an Inductive Study

    Get PDF
    Background: Automatic static analysis (ASA) tools examine source code to discover "issues", i.e. code patterns that are symptoms of bad programming practices and that can lead to defective behavior. Studies in the literature have shown that these tools find defects earlier than other verification activities, but they produce a substantial number of false positive warnings. For this reason, an alternative approach is to use the set of ASA issues to identify defect prone files and components rather than focusing on the individual issues. Aim: We conducted an exploratory study to investigate whether ASA issues can be used as early indicators of faulty files and components and, for the first time, whether they point to a decay of specific software quality attributes, such as maintainability or functionality. Our aim is to understand the critical parameters and feasibility of such an approach to feed into future research on more specific quality and defect prediction models. Method: We analyzed an industrial C# web application using the Resharper ASA tool and explored if significant correlations exist in such a data set. Results: We found promising results when predicting defect-prone files. A set of specific Resharper categories are better indicators of faulty files than common software metrics or the collection of issues of all issue categories, and these categories correlate to different software quality attributes. Conclusions: Our advice for future research is to perform analysis on file rather component level and to evaluate the generalizability of categories. We also recommend using larger datasets as we learned that data sparseness can lead to challenges in the proposed analysis proces

    A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

    Full text link
    Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.Comment: 49 pages, 10 figures, 6 table

    Towards the detection and analysis of performance regression introducing code changes

    Get PDF
    In contemporary software development, developers commonly conduct regression testing to ensure that code changes do not affect software quality. Performance regression testing is an emerging research area from the regression testing domain in software engineering. Performance regression testing aims to maintain the system\u27s performance. Conducting performance regression testing is known to be expensive. It is also complex, considering the increase of committed code and developing team members working simultaneously. Many automated regression testing techniques have been proposed in prior research. However, challenges in the practice of locating and resolving performance regression still exist. Directing regression testing to the commit level provides solutions to locate the root cause, yet it hinders the development process. This thesis outlines motivations and solutions to address locating performance regression root causes. First, we challenge a deterministic state-of-art approach by expanding the testing data to find improvement areas. The deterministic approach was found to be limited in searching for the best regression-locating rule. Thus, we presented two stochastic approaches to develop models that can learn from historical commits. The goal of the first stochastic approach is to view the research problem as a search-based optimization problem seeking to reach the highest detection rate. We are applying different multi-objective evolutionary algorithms and conducting a comparison between them. This thesis also investigates whether simplifying the search space by combining objectives would achieve comparative results. The second stochastic approach addresses the severity of class imbalance any system could have since code changes introducing regression are rare but costly. We formulate the identification of problematic commits that introduce performance regression as a binary classification problem that handles class imbalance. Further, the thesis provides an exploratory study on the challenges developers face in resolving performance regression. The study is based on the questions posted on a technical form directed to performance regression. We collected around 2k questions discussing the regression of software execution time, and all were manually analyzed. The study resulted in a categorization of the challenges. We also discussed the difficulty level of performance regression issues within the development community. This study provides insights to help developers during the software design and implementation to avoid regression causes

    Data Mining and Machine Learning for Software Engineering

    Get PDF
    Software engineering is one of the most utilizable research areas for data mining. Developers have attempted to improve software quality by mining and analyzing software data. In any phase of software development life cycle (SDLC), while huge amount of data is produced, some design, security, or software problems may occur. In the early phases of software development, analyzing software data helps to handle these problems and lead to more accurate and timely delivery of software projects. Various data mining and machine learning studies have been conducted to deal with software engineering tasks such as defect prediction, effort estimation, etc. This study shows the open issues and presents related solutions and recommendations in software engineering, applying data mining and machine learning techniques

    Machine Learning for Software Engineering: A Tertiary Study

    Full text link
    Machine learning (ML) techniques increase the effectiveness of software engineering (SE) lifecycle activities. We systematically collected, quality-assessed, summarized, and categorized 83 reviews in ML for SE published between 2009-2022, covering 6,117 primary studies. The SE areas most tackled with ML are software quality and testing, while human-centered areas appear more challenging for ML. We propose a number of ML for SE research challenges and actions including: conducting further empirical validation and industrial studies on ML; reconsidering deficient SE methods; documenting and automating data collection and pipeline processes; reexamining how industrial practitioners distribute their proprietary data; and implementing incremental ML approaches.Comment: 37 pages, 6 figures, 7 tables, journal articl

    Large Language Models for Software Engineering: Survey and Open Problems

    Full text link
    This paper provides a survey of the emerging area of Large Language Models (LLMs) for Software Engineering (SE). It also sets out open research challenges for the application of LLMs to technical problems faced by software engineers. LLMs' emergent properties bring novelty and creativity with applications right across the spectrum of Software Engineering activities including coding, design, requirements, repair, refactoring, performance improvement, documentation and analytics. However, these very same emergent properties also pose significant technical challenges; we need techniques that can reliably weed out incorrect solutions, such as hallucinations. Our survey reveals the pivotal role that hybrid techniques (traditional SE plus LLMs) have to play in the development and deployment of reliable, efficient and effective LLM-based SE
    • …
    corecore