9 research outputs found
Test-case reduction for C compiler bugs
ManuscriptTo report a compiler bug, one must often find a small test case that triggers the bug. The existing approach to automated test-case reduction, delta debugging, works by removing substrings of the original input; the result is a concatenation of substrings that delta cannot remove. We have found this approach less than ideal for reducing C programs because it typically yields test cases that are too large or even invalid (relying on undefined behavior). To obtain small and valid test cases consistently, we designed and implemented three new, domain-specific test-case reducers. The best of these is based on a novel framework in which a generic fixpoint computation invokes modular transformations that perform reduction operations. This reducer produces outputs that are, on average, more than 25 times smaller than those produced by our other reducers or by the existing reducer that is most commonly used by compiler developers. We conclude that effective program reduction requires more than straightforward delta debugging
Doctor of Philosophy
dissertationAggressive random testing tools, or fuzzers, are impressively effective at finding bugs in compilers and programming language runtimes. For example, a single test-case generator has resulted in more than 460 bugs reported for a number of production-quality C compilers. However, fuzzers can be hard to use. The first problem is that failures triggered by random test cases can be difficult to debug because these tests are often large. To report a compiler bug, one must often construct a small test case that triggers the bug. The existing automated test-case reduction technique, delta debugging, is not sufficient to produce small, reportable test cases. A second problem is that fuzzers are indiscriminate: they repeatedly find bugs that may not be severe enough to fix right away. Third, fuzzers tend to generate a large number of test cases that only trigger a few bugs. Some bugs are triggered much more frequently than others, creating needle-in-the-haystack problems. Currently, users rule out undesirable test cases using ad hoc methods such as disallowing problematic features in tests and filtering test results. This dissertation investigates approaches to improving the utility of compiler fuzzers. Two components, an aggressive test-case reducer and a tamer, are added to the fuzzing workflow to make the fuzzer more user friendly. We introduce C-Reduce, an aggressive test-case reducer for C/C++ programs, which exploits rich domain-specific knowledge to output test cases nearly as good as those produced by skilled humans. This reducer produces outputs that are, on average, more than 30 times smaller than those produced by the existing reducer that is most commonly used by compiler engineers. Second, this dissertation formulates and addresses the fuzzer taming problem: given a potentially large number of random test cases that trigger failures, order them such that diverse, interesting test cases are highly ranked. Bug triage can be effectively automated, relying on techniques from machine learning to suppress duplicate bug-triggering test cases and test cases triggering known bugs. An evaluation shows the ability of this tool to solve the fuzzer taming problem for 3,799 test cases triggering 46 bugs in a C compiler
Recommended from our members
Leveraging Generated Tests
The main goal of automated test generation is to improve the reliability of a program by exposing faults to developers. To this end, testing should cover the largest possible portion of the program given a test budget (i.e., time and resources) as frequently as possible. Coverage of a program entity in testing increases our confidence in the correctness of that entity.
Generating various tests to cover a program entity is a particularly hard problem to solve for large software systems because the test inputs are complex and they often exhibit sophisticated feature interactions. As a result, current test generation techniques, such as symbolic execution or search-based testing, do not scale well to complex, large-scale systems.
This dissertation presents a test generation technique which aims to increase the frequency of coverage in large, complex software systems. It leverages the information of existing test cases to direct the automated testing. We show the results of the application of this technique to some large systems such as GCC compiler ( 850K Lines of code), and Mozillas JavaScript engine ( 120K lines of code). It increases the frequency of coverage upto the factor of 9x, compared to the state-of-the-art technique.
It also proposes non-adequate test-case reduction for reducing the size of test cases by cov-erage and mutant detection criteria. C%-coverage test reduction technique reduces a test case while preseving at least C% of coverage in the original test case. N-mutant test reduction tech-nique reduces a test cases while preserving detection of N mutants of the original test case. We evaluate the effectiveness of these test reduction techniques on different attributes of test cases.
This research suggest that the generated test cases should be treated as first-class artifacts in the software development and they can be leveraged for interesting testing tasks
Automatic test generation for the detection of performance bugs in code optimization
Software is everywhere in our daily lives, and it is important that software behaves in ways it is expected to. Testing is a widely accepted method for improving software quality. Testing detects the presence of bugs by comparing the actual outcome to the expected outcome of a computation.
Testing for correctness is a well-studied problem. Testing for correctness compares the actual outcome of computation against its expected output. Typically, the expected output of a computation is unambiguous, since computations in computer software typically have clear semantics defined by the programming language.
However, testing for performance is less studied. The expected outcome of a test may require context-knowledge not apparent in the test program itself. For example, by simply inspecting the code of a web server, one cannot determine what is the expected throughput. This makes performance testing for performance a challenging task.
Testing compilers adds another layer of complexity. For compilers, a correctness bug during compiler optimization may introduce a bug in the resulting binary, even though the bug was not present in the source code. Similarly, a performance bug during optimization may cause inconsistencies in the runtimes of equivalent programs, where equivalent programs are defined as programs with identical outcomes but whose sources may differ through semantic-preserving transformations. Performance bugs prevent compilers from producing efficient code when they have the ability to do so.
Many testing techniques have been proposed. Random testing is a powerful testing technique often associated with test generation. It allows a large testing space to be explored efficiently through sampling and is suitable for large and complex software with a large testing space, such as compilers.
Random test generation for compilers has been shown to be effective in detecting correctness bugs. However, to the best of our knowledge, there is no previous study on random test generation for performance bugs in compilers. We believe one of the main reasons is the context-dependent nature when quantifying performance headroom.
We propose a random test generation infrastructure for evaluating the performance of compilers. We quantify the performance headroom of tests by borrowing existing ideas from previous studies. Namely, when a set of equivalent programs is compiled by a compiler, all programs should aim to perform as well as the best-performing program. Additionally, when a program is compiled by a set of compilers, all compilers should aim to generate code that performs as well as the code generated by the best-performing compiler. We define metrics to evaluate compilers based on these ideas.
We used our system to evaluate four modern compilers -- Intel's ICC, GNU's GCC, the Portland Group Inc.'s PGI compiler, and Clang -- on how well they handle loop unrolling, loop interchange, and loop unroll-and-jam. Results suggest that ICC typically performs better than the other three compilers. On the other hand, our system also identified extreme outliers for ICC where, for example, one program becomes x180000 slower after unrolling a loop.
Due to the nature of random testing, we also study the methodologies required to achieve reproducible results by using statistical methods. We apply these methodologies to our compiler evaluation and provide evidence that our experiments are reproducible across different randomly generated collections of code segments
Automatic isolation of compiler errors
This paper describes a tool called vpoiso that was developed to automatically isolate errors in the vpo compiler system. The two general types of compiler errors isolated by this tool are optimization and nonoptimization errors. When isolating optimization errors, vpoiso relies on the vpo optimizer to identify sequences of changes, referred to as transformations, that result in semantically equivalent code and to provide the ability to stop performing improving (or unnecessary) transformations after a specified number have been performed. Acompilation of a typical program by vpo often results in thousands of improving transformations being performed. The vpoiso tool can automatically isolate the first improving transformation that causes incorrect output of the execution of the compiled program by using a binary search that varies the number of improving transformations performed. Not only is the illegal transformation automatically isolated, but vpoiso also identifies the location and instant the transformation is performed in vpo. Nonoptimization errors occur from problems in the front end, code generator, and necessary transformations in the optimizer. Ifanother compiler is available that can produce correct (but perhaps more inefficient) code, then vpoiso can isolate nonoptimization errors to a single function. Automatic isolation of compiler errors facilitates retargeting a compiler to a new machine, maintenance of the compiler, and supporting experimentation with new optimizations