18 research outputs found

    SapFix: Automated End-To-End Repair at Scale

    Get PDF
    We report our experience with SapFix: the first deployment of automated end-to-end fault fixing, from test case design through to deployed repairs in production code. We have used SapFix at Facebook to repair 6 production systems, each consisting of tens of millions of lines of code, and which are collectively used by hundreds of millions of people worldwide

    Testing web enabled simulation at scale using metamorphic testing

    Get PDF
    We report on Facebook's deployment of MIA (Metamorphic Interaction Automaton). MIA is used to test Facebook's Web Enabled Simulation, built on a web infrastructure of hundreds of millions of lines of code. MIA tackles the twin problems of test flakiness and the unknowable oracle problem. It uses metamorphic testing to automate continuous integration and regression test execution. MIA also plays the role of a test bot, automatically commenting on all relevant changes submitted for code review. It currently uses a suite of over 40 metamorphic test cases. Even at this extreme scale, a non-trivial metamorphic test suite subset yields outcomes within 20 minutes (sufficient for continuous integration and review processes). Furthermore, our offline mode simulation reduces test flakiness from approximately 50% (of all online tests) to 0% (offline). Metamorphic testing has been widely-studied for 22 years. This paper is the first reported deployment into an industrial continuous integration system

    From start-ups to scale-ups: Opportunities and open problems for static and dynamic program analysis

    Get PDF
    This paper describes some of the challenges and opportunities when deploying static and dynamic analysis at scale, drawing on the authors' experience with the Infer and Sapienz Technologies at Facebook, each of which started life as a research-led start-up that was subsequently deployed at scale, impacting billions of people worldwide. The paper identifies open problems that have yet to receive significant attention from the scientific community, yet which have potential for profound real world impact, formulating these as research questions that, we believe, are ripe for exploration and that would make excellent topics for research projects

    Leveraging Automated Unit Tests for Unsupervised Code Translation

    Get PDF
    With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java→Python and Python→C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%

    Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

    Full text link
    Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.Comment: Accepted to the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023

    A Comprehensive Study of Code-removal Patches in Automated Program Repair

    Full text link
    Automatic Program Repair (APR) techniques can promisingly help reducing the cost of debugging. Many relevant APR techniques follow the generate-and-validate approach, that is, the faulty program is iteratively modified with different change operators and then validated with a test suite until a plausible patch is generated. In particular, Kali is a generate-and-validate technique developed to investigate the possibility of generating plausible patches by only removing code. Former studies show that indeed Kali successfully addressed several faults. This paper addresses the case of code-removal patches in automated program repair investigating the reasons and the scenarios that make their creation possible, and the relationship with patches implemented by developers. Our study reveals that code-removal patches are often insufficient to fix bugs, and proposes a comprehensive taxonomy of code-removal patches that provides evidence of the problems that may affect test suites, opening new opportunities for researchers in the field of automatic program repair.Comment: New version of the manuscrip

    StubCoder: Automated Generation and Repair of Stub Code for Mock Objects

    Full text link
    Mocking is an essential unit testing technique for isolating the class under test (CUT) from its dependencies. Developers often leverage mocking frameworks to develop stub code that specifies the behaviors of mock objects. However, developing and maintaining stub code is labor-intensive and error-prone. In this paper, we present StubCoder to automatically generate and repair stub code for regression testing. StubCoder implements a novel evolutionary algorithm that synthesizes test-passing stub code guided by the runtime behavior of test cases. We evaluated our proposed approach on 59 test cases from 13 open-source projects. Our evaluation results show that StubCoder can effectively generate stub code for incomplete test cases without stub code and repair obsolete test cases with broken stub code.Comment: This paper was accepted by the ACM Transactions on Software Engineering and Methodology (TOSEM) in July 202

    Many-Objective Optimization of Non-Functional Attributes based on Refactoring of Software Models

    Full text link
    Software quality estimation is a challenging and time-consuming activity, and models are crucial to face the complexity of such activity on modern software applications. In this context, software refactoring is a crucial activity within development life-cycles where requirements and functionalities rapidly evolve. One main challenge is that the improvement of distinctive quality attributes may require contrasting refactoring actions on software, as for trade-off between performance and reliability (or other non-functional attributes). In such cases, multi-objective optimization can provide the designer with a wider view on these trade-offs and, consequently, can lead to identify suitable refactoring actions that take into account independent or even competing objectives. In this paper, we present an approach that exploits NSGA-II as the genetic algorithm to search optimal Pareto frontiers for software refactoring while considering many objectives. We consider performance and reliability variations of a model alternative with respect to an initial model, the amount of performance antipatterns detected on the model alternative, and the architectural distance, which quantifies the effort to obtain a model alternative from the initial one. We applied our approach on two case studies: a Train Ticket Booking Service, and CoCoME. We observed that our approach is able to improve performance (by up to 42\%) while preserving or even improving the reliability (by up to 32\%) of generated model alternatives. We also observed that there exists an order of preference of refactoring actions among model alternatives. We can state that performance antipatterns confirmed their ability to improve performance of a subject model in the context of many-objective optimization. In addition, the metric that we adopted for the architectural distance seems to be suitable for estimating the refactoring effort.Comment: Accepted for publication in Information and Software Technologies. arXiv admin note: substantial text overlap with arXiv:2107.0612

    Program transformation landscapes for automated program modification using Gin

    Get PDF
    Automated program modification underlies two successful research areas — genetic improvement and program repair. Under the generate-and-validate strategy, automated program modification transforms a program, then validates the result against a test suite. Much work has focused on the search space of application of single fine-grained operators — COPY, DELETE, REPLACE, and SWAP at both line and statement granularity. This work explores the limits of this strategy. We scale up existing findings an order of magnitude from small corpora to 10 real-world Java programs comprising up to 500k LoC. We decisively show that the grammar-specificity of statement granular edits pays off: its pass rate triples that of line edits and uses 10% less computational resources. We confirm previous findings that DELETE is the most effective operator for creating test-suite equivalent program variants. We go farther than prior work by exploring the limits of DELETE ’s effectiveness by exhaustively applying it. We show this strategy is too costly in practice to be used to search for improved software variants. We further find that pass rates drop from 12–34% for single statement edits to 2–6% for 5-edit sequences, which implies that further progress will need human-inspired operators that target specific faults or improvements. A program is amenable to automated modification to the extent to which automatically editing it is likely to produce test-suite passing variants. We are the first to systematically search for a code measure that correlates with a program’s amenability to automated modification. We found no strong correlations, leaving the question open
    corecore