15 research outputs found
TorchProbe: Fuzzing Dynamic Deep Learning Compilers
Static and dynamic computational graphs represent two distinct approaches to
constructing deep learning frameworks. The former prioritizes compiler-based
optimizations, while the latter focuses on programmability and
user-friendliness. The recent release of PyTorch 2.0, which supports compiling
arbitrary deep learning programs in Python, signifies a new direction in the
evolution of deep learning infrastructure to incorporate compiler techniques in
a more dynamic manner and support more dynamic language features like dynamic
control flows and closures. Given PyTorch's seamless integration with Python,
its compiler aims to support arbitrary deep learning code written in Python.
However, the inherent dynamism of Python poses challenges to the completeness
and robustness of the compiler. While recent research has introduced fuzzing to
test deep learning compilers, there is still a lack of comprehensive analysis
on how to test dynamic features. To address this issue, we propose several code
transformations to generate test cases involving dynamic features. These
transformations preserve the program's semantics, ensuring that any discrepancy
between the transformed and original programs indicates the presence of a bug.
Through our approach, we have successfully identified twenty previously unknown
bugs in the PyTorch compiler and its underlying tensor compiler Triton
Configuring Test Generators using Bug Reports: A Case Study of GCC Compiler and Csmith
The correctness of compilers is instrumental in the safety and reliability of
other software systems, as bugs in compilers can produce executables that do
not reflect the intent of programmers. Such errors are difficult to identify
and debug. Random test program generators are commonly used in testing
compilers, and they have been effective in uncovering bugs. However, the
problem of guiding these test generators to produce test programs that are more
likely to find bugs remains challenging. In this paper, we use the code
snippets in the bug reports to guide the test generation. The main idea of this
work is to extract insights from the bug reports about the language features
that are more prone to inadequate implementation and using the insights to
guide the test generators. We use the GCC C compiler to evaluate the
effectiveness of this approach. In particular, we first cluster the test
programs in the GCC bugs reports based on their features. We then use the
centroids of the clusters to compute configurations for Csmith, a popular test
generator for C compilers. We evaluated this approach on eight versions of GCC
and found that our approach provides higher coverage and triggers more
miscompilation failures than the state-of-the-art test generation techniques
for GCC.Comment: The 36th ACM/SIGAPP Symposium on Applied Computing, Software
Verification and Testing Track (SAC-SVT'21
PHYFU: Fuzzing Modern Physics Simulation Engines
A physical simulation engine (PSE) is a software system that simulates
physical environments and objects. Modern PSEs feature both forward and
backward simulations, where the forward phase predicts the behavior of a
simulated system, and the backward phase provides gradients (guidance) for
learning-based control tasks, such as a robot arm learning to fetch items. This
way, modern PSEs show promising support for learning-based control methods. To
date, PSEs have been largely used in various high-profitable, commercial
applications, such as games, movies, virtual reality (VR), and robotics.
Despite the prosperous development and usage of PSEs by academia and industrial
manufacturers such as Google and NVIDIA, PSEs may produce incorrect
simulations, which may lead to negative results, from poor user experience in
entertainment to accidents in robotics-involved manufacturing and surgical
operations.
This paper introduces PHYFU, a fuzzing framework designed specifically for
PSEs to uncover errors in both forward and backward simulation phases. PHYFU
mutates initial states and asserts if the PSE under test behaves consistently
with respect to basic Physics Laws (PLs). We further use feedback-driven test
input scheduling to guide and accelerate the search for errors. Our study of
four PSEs covers mainstream industrial vendors (Google and NVIDIA) as well as
academic products. We successfully uncover over 5K error-triggering inputs that
generate incorrect simulation results spanning across the whole software stack
of PSEs.Comment: This paper is accepted at The 38th IEEE/ACM International Conference
on Automated Software Engineering, a.k.a. ASE 2023. Please cite the published
version as soon as this paper appears in the conference publication
Fuzzing Deep Learning Compilers with HirGen
Deep Learning (DL) compilers are widely adopted to optimize advanced DL
models for efficient deployment on diverse hardware. Their quality has profound
effect on the quality of compiled DL models. A recent bug study shows that the
optimization of high-level intermediate representation (IR) is the most
error-prone compilation stage. Bugs in this stage are accountable for 44.92% of
the whole collected ones. However, existing testing techniques do not consider
high-level optimization related features (e.g. high-level IR), and are
therefore weak in exposing bugs at this stage. To bridge this gap, we propose
HirGen, an automated testing technique that aims to effectively expose coding
mistakes in the optimization of high-level IR. The design of HirGen includes 1)
three coverage criteria to generate diverse and valid computational graphs; 2)
full use of high-level IRs language features to generate diverse IRs; 3) three
test oracles inspired from both differential testing and metamorphic testing.
HirGen has successfully detected 21 bugs that occur at TVM, with 17 bugs
confirmed and 12 fixed. Further, we construct four baselines using the
state-of-the-art DL compiler fuzzers that can cover the high-level optimization
stage. Our experiment results show that HirGen can detect 10 crashes and
inconsistencies that cannot be detected by the baselines in 48 hours. We
further validate the usefulness of our proposed coverage criteria and test
oracles in evaluation
Differential Testing of Simulation-Based Virtual Machine Generators Automatic Detection of VM Generator Semantic Gaps Between Simulation and Generated VMs
International audienceTesting and debugging language Virtual Machines (VMs) is a laborious task without the proper tooling. This complexity is aggravated when the VM targets multiple architectures. To solve this problem, simulation-based VM generator frameworks allow one to write test cases on the simulation, but those test cases do not ensure the correctness of the generated artifact due to the semantic gaps between the simulated VM and generated VMs. We propose Test Transmutation to extend simulation-based VM generator frameworks to support test case generation. It validates the generated VM by also running test cases generated from existing simulation test cases. Results of the generated test cases are compared against the simulation test cases using differential testing. Moreover, test cases are automatically mutated with non-semanticpreserving mutations. Test Transmutation detects bugs that are representative of typical VM modifications. We demonstrate its effectiveness by applying it to a set of real test cases of the Pharo VM. It allowed us to find several issues that were unknown to the VM development team. Our approach shows promising results to test simulationbased VM generator frameworks
Test-Equivalence Analysis for Automatic Patch Generation
Automated program repair is a problem of finding a transformation (called a patch) of a given incorrect program that eliminates the observable failures. It has important applications such as providing debugging aids, automatically grading student assignments, and patching security vulnerabilities. A common challenge faced by existing repair techniques is scalability to large patch spaces, since there are many candidate patches that these techniques explicitly or implicitly consider.
The correctness criteria for program repair is often given as a suite of tests. Current repair techniques do not scale due to the large number of test executions performed by the underlying search algorithms. In this work, we address this problem by introducing a methodology of patch generation based on a test-equivalence relation (if two programs are “test-equivalent” for a given test, they produce indistinguishable results on this test). We propose two test-equivalence relations based on runtime values and dependencies, respectively, and present an algorithm that performs on-the-fly partitioning of patches into test-equivalence classes.
Our experiments on real-world programs reveal that the proposed methodology drastically reduces the number of test executions and therefore provides an order of magnitude efficiency improvement over existing repair techniques, without sacrificing patch quality