58 research outputs found

    LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities

    Full text link
    This paper investigates the application of large language models (LLM) in the domain of mobile application test script generation. Test script generation is a vital component of software testing, enabling efficient and reliable automation of repetitive test tasks. However, existing generation approaches often encounter limitations, such as difficulties in accurately capturing and reproducing test scripts across diverse devices, platforms, and applications. These challenges arise due to differences in screen sizes, input modalities, platform behaviors, API inconsistencies, and application architectures. Overcoming these limitations is crucial for achieving robust and comprehensive test automation. By leveraging the capabilities of LLMs, we aim to address these challenges and explore its potential as a versatile tool for test automation. We investigate how well LLMs can adapt to diverse devices and systems while accurately capturing and generating test scripts. Additionally, we evaluate its cross-platform generation capabilities by assessing its ability to handle operating system variations and platform-specific behaviors. Furthermore, we explore the application of LLMs in cross-app migration, where it generates test scripts across different applications and software environments based on existing scripts. Throughout the investigation, we analyze its adaptability to various user interfaces, app architectures, and interaction patterns, ensuring accurate script generation and compatibility. The findings of this research contribute to the understanding of LLMs' capabilities in test automation. Ultimately, this research aims to enhance software testing practices, empowering app developers to achieve higher levels of software quality and development efficiency.Comment: Accepted by the 23rd IEEE International Conference on Software Quality, Reliability, and Security (QRS 2023

    TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree transformation

    Full text link
    Large-scale language models have made great progress in the field of software engineering in recent years. They can be used for many code-related tasks such as code clone detection, code-to-code search, and method name prediction. However, these large-scale language models based on each code token have several drawbacks: They are usually large in scale, heavily dependent on labels, and require a lot of computing power and time to fine-tune new datasets.Furthermore, code embedding should be performed on the entire code snippet rather than encoding each code token. The main reason for this is that encoding each code token would cause model parameter inflation, resulting in a lot of parameters storing information that we are not very concerned about. In this paper, we propose a novel framework, called TransformCode, that learns about code embeddings in a contrastive learning manner. The framework uses the Transformer encoder as an integral part of the model. We also introduce a novel data augmentation technique called abstract syntax tree transformation: This technique applies syntactic and semantic transformations to the original code snippets to generate more diverse and robust anchor samples. Our proposed framework is both flexible and adaptable: It can be easily extended to other downstream tasks that require code representation such as code clone detection and classification. The framework is also very efficient and scalable: It does not require a large model or a large amount of training data, and can support any programming language.Finally, our framework is not limited to unsupervised learning, but can also be applied to some supervised learning tasks by incorporating task-specific labels or objectives. To explore the effectiveness of our framework, we conducted extensive experiments on different software engineering tasks using different programming languages and multiple datasets

    A Survey of Learning-based Automated Program Repair

    Full text link
    Automated program repair (APR) aims to fix software bugs automatically and plays a crucial role in software development and maintenance. With the recent advances in deep learning (DL), an increasing number of APR techniques have been proposed to leverage neural networks to learn bug-fixing patterns from massive open-source code repositories. Such learning-based techniques usually treat APR as a neural machine translation (NMT) task, where buggy code snippets (i.e., source language) are translated into fixed code snippets (i.e., target language) automatically. Benefiting from the powerful capability of DL to learn hidden relationships from previous bug-fixing datasets, learning-based APR techniques have achieved remarkable performance. In this paper, we provide a systematic survey to summarize the current state-of-the-art research in the learning-based APR community. We illustrate the general workflow of learning-based APR techniques and detail the crucial components, including fault localization, patch generation, patch ranking, patch validation, and patch correctness phases. We then discuss the widely-adopted datasets and evaluation metrics and outline existing empirical studies. We discuss several critical aspects of learning-based APR techniques, such as repair domains, industrial deployment, and the open science issue. We highlight several practical guidelines on applying DL techniques for future APR studies, such as exploring explainable patch generation and utilizing code features. Overall, our paper can help researchers gain a comprehensive understanding about the achievements of the existing learning-based APR techniques and promote the practical application of these techniques. Our artifacts are publicly available at \url{https://github.com/QuanjunZhang/AwesomeLearningAPR}

    GAMMA: Revisiting Template-based Automated Program Repair via Mask Prediction

    Full text link
    Automated program repair (APR) aims to fix software bugs without human intervention and template-based APR has been widely investigated with promising results. However, it is challenging for template-based APR to select the appropriate donor code, which is an important repair ingredient for generating candidate patches. Inappropriate donor code may cause plausible but incorrect patch generation even with correct fix patterns, limiting the repair performance. In this paper, we aim to revisit template-based APR, and propose GAMMA, to directly leverage large pre-trained language models for donor code generation. Our main insight is that instead of retrieving donor code in the local buggy file, we can directly predict the correct code tokens based on the context code snippets and repair patterns by a cloze task. Specifically, (1) GAMMA revises a variety of fix templates from state-of-the-art template-based APR techniques (i.e., TBar) and transforms them into mask patterns. (2) GAMMA adopts a pre-trained language model to predict the correct code for masked code as a fill-in-the-blank task. The experimental results demonstrate that GAMMA correctly repairs 82 bugs on Defects4J-v1.2, which achieves 20.59\% (14 bugs) and 26.15\% (17 bugs) improvement over the previous state-of-the-art template-based approach TBar and learning-based one Recoder. Furthermore, GAMMA repairs 45 bugs and 22 bugs from the additional Defects4J-v2.0 and QuixBugs, indicating the generalizability of GAMMA in addressing the dataset overfitting issue. We also prove that adopting other pre-trained language models can provide substantial advancement, e.g., CodeBERT-based and ChatGPT-based GAMMA is able to fix 80 and 67 bugs on Defects4J-v1.2, indicating the scalability of GAMMA. Overall, our study highlights the promising future of adopting pre-trained models to generate correct patches on top of fix patterns.Comment: Accepted to 38th IEEE/ACM International Conference on Automated Software Engineering (ASE2023

    Backdooring Neural Code Search

    Full text link
    Reusing off-the-shelf code snippets from online repositories is a common practice, which significantly enhances the productivity of software developers. To find desired code snippets, developers resort to code search engines through natural language queries. Neural code search models are hence behind many such engines. These models are based on deep learning and gain substantial attention due to their impressive performance. However, the security aspect of these models is rarely studied. Particularly, an adversary can inject a backdoor in neural code search models, which return buggy or even vulnerable code with security/privacy issues. This may impact the downstream software (e.g., stock trading systems and autonomous driving) and cause financial loss and/or life-threatening incidents. In this paper, we demonstrate such attacks are feasible and can be quite stealthy. By simply modifying one variable/function name, the attacker can make buggy/vulnerable code rank in the top 11%. Our attack BADCODE features a special trigger generation and injection procedure, making the attack more effective and stealthy. The evaluation is conducted on two neural code search models and the results show our attack outperforms baselines by 60%. Our user study demonstrates that our attack is more stealthy than the baseline by two times based on the F1 score

    A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

    Full text link
    Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of Software Engineering (SE) tasks, such as Automated Program Repair (APR), code summarization, and code completion. For example, ChatGPT, the latest black-box LLM, has been investigated by numerous recent research studies and has shown impressive performance in various tasks. However, there exists a potential risk of data leakage since these LLMs are usually close-sourced with unknown specific training details, e.g., pre-training datasets. In this paper, we seek to review the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives. We first introduce {\benchmark}, a new benchmark with buggy and the corresponding fixed programs from competitive programming problems starting from 2023, after the training cutoff point of ChatGPT. The results on {\benchmark} show that ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5\% and 62.4\% prediction accuracy. We also investigate the impact of three types of prompts, i.e., problem description, error feedback, and bug localization, leading to additional 34 fixed bugs. Besides, we provide additional discussion from the interactive nature of ChatGPT to illustrate the capacity of a dialog-based repair workflow with 9 additional fixed bugs. Inspired by the findings, we further pinpoint various challenges and opportunities for advanced SE study equipped with such LLMs (e.g.,~ChatGPT) in the near future. More importantly, our work calls for more research on the reevaluation of the achievements obtained by existing black-box LLMs across various SE tasks, not limited to ChatGPT on APR

    Coverage Goal Selector for Combining Multiple Criteria in Search-Based Unit Test Generation

    Full text link
    Unit testing is critical to the software development process, ensuring the correctness of basic programming units in a program (e.g., a method). Search-based software testing (SBST) is an automated approach to generating test cases. SBST generates test cases with genetic algorithms by specifying the coverage criterion (e.g., branch coverage). However, a good test suite must have different properties, which cannot be captured using an individual coverage criterion. Therefore, the state-of-the-art approach combines multiple criteria to generate test cases. Since combining multiple coverage criteria brings multiple objectives for optimization, it hurts the test suites' coverage for certain criteria compared with using the single criterion. To cope with this problem, we propose a novel approach named \textbf{smart selection}. Based on the coverage correlations among criteria and the subsumption relationships among coverage goals, smart selection selects a subset of coverage goals to reduce the number of optimization objectives and avoid missing any properties of all criteria. We conduct experiments to evaluate smart selection on 400400 Java classes with three state-of-the-art genetic algorithms under the 22-minute budget. On average, smart selection outperforms combining all goals on 65.1%65.1\% of the classes having significant differences between the two approaches. Secondly, we conduct experiments to verify our assumptions about coverage criteria relationships. Furthermore, we experiment with different budgets of 55, 88, and 1010 minutes, confirming the advantage of smart selection over combining all goals.Comment: arXiv admin note: substantial text overlap with arXiv:2208.0409
    corecore