Automated testing for GPU kernels

Abstract

Graphics Processing Units (GPUs) are massively parallel processors offering performance acceleration and energy efficiency unmatched by current processors (CPUs) in computers. These advantages along with recent advances in the programmability of GPUs have made them widely used in various general-purpose computing domains. However, this has also made testing GPU kernels critical to ensure that their behaviour meets the requirements of the design and specification. Despite the advances in programmability, GPU kernels are hard to code and analyse due to the high complexity of memory sharing patterns, striding patterns for memory accesses, implicit synchronisation, and combinatorial explosion of thread interleavings. Existing few techniques for testing GPU kernels use symbolic execution for test generation that incur a high overhead, have limited scalability and do not handle all data types. In this thesis, we present novel approaches to measure test effectiveness and generate tests automatically for GPU kernels. To achieve this, we address significant challenges related to the GPU execution and memory model, and the lack of customised thread scheduling and global synchronisation. We make the following contributions: First, we present a framework, CLTestCheck, for assessing the quality of test suites developed for GPU kernels. The framework can measure code coverage using three different coverage metrics that are inspired by faults found in real kernel code. Fault finding capability of the test suite is also measured by the framework to seed different types of faults in the kernel and reported in the form of mutation score, which is the ratio of the number of uncovered faults to the total number of seeded faults. Second, with the goal of being fast, effective and scalable, we propose a test generation technique, CLFuzz, for GPU kernels that combines mutation-based fuzzing for fast test generation and selective SMT solving to help cover unreachable branches by fuzzing. Fuzz testing for GPU kernels has not been explored previously. Our approach for fuzz testing randomly mutates input kernel argument values with the goal of increasing branch coverage and supports GPU-specific data types such as images. When fuzz testing is unable to increase branch coverage with random mutations, we gather path constraints for uncovered branch conditions, build additional constraints to represent the context of GPU execution such as number of threads and work-group size, and invoke the Z3 constraint solver to generate tests for them. Finally, to help uncover inter work-group data races and replay these bugs with fixed work-group schedules, we present a schedule amplifier, CLSchedule, that simulates multiple work-group schedules, with which to execute each of the generated tests. By reimplementing the OpenCL API, CLSchedule executes the kernel with a fixed work-group schedule rather than the default arbitrary schedule. It also executes the kernel directly, without requiring the developer to manually provide boilerplate host code. The outcome of our research can be summarised as follows: 1. CLTestCheck is applied to 82 publicly available GPU kernels from industry-standard benchmark suites along with their test suites. The experiment reveals that CLTestCheck is capable of automatically measuring the effectiveness of test suites, in terms of code coverage, faulting finding capability and revealing data races in real OpenCL kernels. 2. CLFuzz can automatically generate tests and achieve close to 100% coverage and mutation score for the majority of the data set of 217 GPU kernels collected from open-source projects and industry-standard benchmarks. 3. CLSchedule is capable of exploring the effect of work-group schedules on the 217 GPU kernels and uncovers data races in 21 of them. The techniques developed in this thesis demonstrate that we can measure the effectiveness of tests developed for GPU kernels with our coverage criteria and fault seeding methods. The result is useful in highlighting code portions that may need developers' further attention. Our automated test generation and work-group scheduling approaches are also fast, effective and scalable, with small overhead incurred (average of 0.8 seconds) and scalability to large kernels with complex data structures

    Similar works