18 research outputs found
SapFix: Automated End-To-End Repair at Scale
We report our experience with SapFix: the first deployment of automated end-to-end fault fixing, from test case design through to deployed repairs in production code. We have used SapFix at Facebook to repair 6 production systems, each consisting of tens of millions of lines of code, and which are collectively used by hundreds of millions of people worldwide
Testing web enabled simulation at scale using metamorphic testing
We report on Facebook's deployment of MIA (Metamorphic Interaction Automaton). MIA is used to test Facebook's Web Enabled Simulation, built on a web infrastructure of hundreds of millions of lines of code. MIA tackles the twin problems of test flakiness and the unknowable oracle problem. It uses metamorphic testing to automate continuous integration and regression test execution. MIA also plays the role of a test bot, automatically commenting on all relevant changes submitted for code review. It currently uses a suite of over 40 metamorphic test cases. Even at this extreme scale, a non-trivial metamorphic test suite subset yields outcomes within 20 minutes (sufficient for continuous integration and review processes). Furthermore, our offline mode simulation reduces test flakiness from approximately 50% (of all online tests) to 0% (offline). Metamorphic testing has been widely-studied for 22 years. This paper is the first reported deployment into an industrial continuous integration system
From start-ups to scale-ups: Opportunities and open problems for static and dynamic program analysis
This paper describes some of the challenges and opportunities when deploying static and dynamic analysis at scale, drawing on the authors' experience with the Infer and Sapienz Technologies at Facebook, each of which started life as a research-led start-up that was subsequently deployed at scale, impacting billions of people worldwide. The paper identifies open problems that have yet to receive significant attention from the scientific community, yet which have potential for profound real world impact, formulating these as research questions that, we believe, are ripe for exploration and that would make excellent topics for research projects
Leveraging Automated Unit Tests for Unsupervised Code Translation
With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java→Python and Python→C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%
Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting
Automatically detecting software failures is an important task and a
longstanding challenge. It requires finding failure-inducing test cases whose
test input can trigger the software's fault, and constructing an automated
oracle to detect the software's incorrect behaviors. Recent advancement of
large language models (LLMs) motivates us to study how far this challenge can
be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows
that ChatGPT has a low probability (28.8%) of finding correct failure-inducing
test cases for buggy programs. A possible reason is that finding
failure-inducing test cases requires analyzing the subtle code differences
between a buggy program and its correct version. When these two versions have
similar syntax, ChatGPT is weak at recognizing subtle code differences. Our
insight is that ChatGPT's performance can be substantially enhanced when
ChatGPT is guided to focus on the subtle code difference. We have an
interesting observation that ChatGPT is effective in inferring the intended
behaviors of a buggy program. The intended behavior can be leveraged to
synthesize programs, in order to make the subtle code difference between a
buggy program and its correct version (i.e., the synthesized program) explicit.
Driven by this observation, we propose a novel approach that synergistically
combines ChatGPT and differential testing to find failure-inducing test cases.
We evaluate our approach on Quixbugs (a benchmark of buggy programs), and
compare it with state-of-the-art baselines, including direct use of ChatGPT and
Pynguin. The experimental result shows that our approach has a much higher
probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the
best baseline.Comment: Accepted to the 38th IEEE/ACM International Conference on Automated
Software Engineering (ASE 2023
A Comprehensive Study of Code-removal Patches in Automated Program Repair
Automatic Program Repair (APR) techniques can promisingly help reducing the
cost of debugging. Many relevant APR techniques follow the
generate-and-validate approach, that is, the faulty program is iteratively
modified with different change operators and then validated with a test suite
until a plausible patch is generated. In particular, Kali is a
generate-and-validate technique developed to investigate the possibility of
generating plausible patches by only removing code. Former studies show that
indeed Kali successfully addressed several faults. This paper addresses the
case of code-removal patches in automated program repair investigating the
reasons and the scenarios that make their creation possible, and the
relationship with patches implemented by developers. Our study reveals that
code-removal patches are often insufficient to fix bugs, and proposes a
comprehensive taxonomy of code-removal patches that provides evidence of the
problems that may affect test suites, opening new opportunities for researchers
in the field of automatic program repair.Comment: New version of the manuscrip
StubCoder: Automated Generation and Repair of Stub Code for Mock Objects
Mocking is an essential unit testing technique for isolating the class under
test (CUT) from its dependencies. Developers often leverage mocking frameworks
to develop stub code that specifies the behaviors of mock objects. However,
developing and maintaining stub code is labor-intensive and error-prone. In
this paper, we present StubCoder to automatically generate and repair stub code
for regression testing. StubCoder implements a novel evolutionary algorithm
that synthesizes test-passing stub code guided by the runtime behavior of test
cases. We evaluated our proposed approach on 59 test cases from 13 open-source
projects. Our evaluation results show that StubCoder can effectively generate
stub code for incomplete test cases without stub code and repair obsolete test
cases with broken stub code.Comment: This paper was accepted by the ACM Transactions on Software
Engineering and Methodology (TOSEM) in July 202
Many-Objective Optimization of Non-Functional Attributes based on Refactoring of Software Models
Software quality estimation is a challenging and time-consuming activity, and
models are crucial to face the complexity of such activity on modern software
applications. In this context, software refactoring is a crucial activity
within development life-cycles where requirements and functionalities rapidly
evolve. One main challenge is that the improvement of distinctive quality
attributes may require contrasting refactoring actions on software, as for
trade-off between performance and reliability (or other non-functional
attributes). In such cases, multi-objective optimization can provide the
designer with a wider view on these trade-offs and, consequently, can lead to
identify suitable refactoring actions that take into account independent or
even competing objectives. In this paper, we present an approach that exploits
NSGA-II as the genetic algorithm to search optimal Pareto frontiers for
software refactoring while considering many objectives. We consider performance
and reliability variations of a model alternative with respect to an initial
model, the amount of performance antipatterns detected on the model
alternative, and the architectural distance, which quantifies the effort to
obtain a model alternative from the initial one. We applied our approach on two
case studies: a Train Ticket Booking Service, and CoCoME. We observed that our
approach is able to improve performance (by up to 42\%) while preserving or
even improving the reliability (by up to 32\%) of generated model alternatives.
We also observed that there exists an order of preference of refactoring
actions among model alternatives. We can state that performance antipatterns
confirmed their ability to improve performance of a subject model in the
context of many-objective optimization. In addition, the metric that we adopted
for the architectural distance seems to be suitable for estimating the
refactoring effort.Comment: Accepted for publication in Information and Software Technologies.
arXiv admin note: substantial text overlap with arXiv:2107.0612
Program transformation landscapes for automated program modification using Gin
Automated program modification underlies two successful research areas — genetic improvement and program repair. Under the generate-and-validate strategy, automated program modification transforms a program, then validates the result against a test suite. Much work has focused on the search space of application of single fine-grained operators — COPY, DELETE, REPLACE, and SWAP at both line and statement granularity. This work explores the limits of this strategy. We scale up existing findings an order of magnitude from small corpora to 10 real-world Java programs comprising up to 500k LoC. We decisively show that the grammar-specificity of statement granular edits pays off: its pass rate triples that of line edits and uses 10% less computational resources. We confirm previous findings that DELETE is the most effective operator for creating test-suite equivalent program variants. We go farther than prior work by exploring the limits of DELETE ’s effectiveness by exhaustively applying it. We show this strategy is too costly in practice to be used to search for improved software variants. We further find that pass rates drop from 12–34% for single statement edits to 2–6% for 5-edit sequences, which implies that further progress will need human-inspired operators that target specific faults or improvements. A program is amenable to automated modification to the extent to which automatically editing it is likely to produce test-suite passing variants. We are the first to systematically search for a code measure that correlates with a program’s amenability to automated modification. We found no strong correlations, leaving the question open