Abstract-We present an automated symbolic verifier for checking the functional correctness of GPGPU kernels parametrically, for an arbitrary number of threads. Our tool PUG ♣ ❛ r ❛ checks the functional equivalence of a kernel and its optimized versions, helping debug errors introduced during memory coalescing and bank conflict elimination related optimizations. Key features of our work include: (1) a symbolic method to encode a comparative assertion across two kernel versions, and (2) techniques to overcome SMT solver restrictions through overapproximations, yielding an efficient bug-hunting method.
I. INTRODUCTION
There is an explosive growth of interest in Graphical Processing Units (GPU) for speeding up computations occurring at all application scales [11] . When properly programmed, GPUs can yield 20x to 100x performance compared to traditional CPUs. Often this requires heroic acts of programming: (i) keep the GPU threads busy; (ii) ensure coalesced data transfers from the GPU global memory to the GPU shared memory; and (iii) minimize bank conflicts during shared memory accesses. Unfortunately, bugs are frequently introduced during CUDA programming and optimization; and few tools are available to verify CUDA programs. In this paper, we present the first (to the best of our knowledge) parameterized reasoning method for GPU kernels.
GPU kernels are comprised of extremely light-weight Single Instruction Multiple Data (SIMD) threads that synchronize sparingly using barriers. These little resemble threads of C/Java that are heavy-weight, and synchronize using locks/monitors. In [13] , we introduce an SMT [22] based approach for analyzing GPU kernels through a new tool PUG that can handle kernels of thousands of lines of code -but for a fixed number (e.g., two or three) threads. It builds a symbolic model (as transition relation) according to the operational semantics of a kernel. PUG's main drawback is that it explodes in complexity when confronted with a growing number of threads during functional correctness checking (PUG often times out on four threads). This makes it very difficult to downscale a kernel and check it. While * Work done as part of the author's PhD dissertation at Utah modeling a small number (e.g., two) threads often suffices for race checking, it is almost always impossible to express functional correctness over such small thread populations.
In [14] we present a tool called GKLEE and a checking methodology that dramatically improves the capabilities in this area. GKLEE is the first concolic verifier and test generator for CUDA GPU programs. GKLEE detects several forms of data races, bank conflicts, non-coalesced memory accesses, deadlocks, and also reports thread/warp divergences accurately. GKLEE employs a new schedule generation method consisting of a canonical sequential schedule interlaced with SIMD execution within each thread warp. This avoids the exponentially of general schedule generation, and detects bugs without omissions or false alarms. GKLEE generates tests that guarantee code coverage, assisted by many new GPU-specific test minimization heuristics. GKLEE has found bugs and issues in CUDA SDK kernels, and can handle multi-kernel examples. However, GKLEE still does not offer parameterized verification capabilities, exceeding normally allocated amounts of computational resources on many small to medium examples at about 2K threads.
We show in this paper that taking a different approach to SMT-encoding than PUG or GKLEE can result in a practically feasible parameterized verification approach. We show that for many kernels, this method (called PUG ✁ ✂ ✁ ) vastly outperforms our previous methods. In our new parameterized approach, only one (parameterized) thread is modeled. Our tool PUG ✁ ✂ ✁ based on this approach tracks how data flows through the threads in consecutive computational rounds. Over these rounds, it symbolically reasons about the possible values of shared variables contributed by all threads. From one perspective, it implicitly implements the Omega Test [20] using SMT techniques. When this checking approach applies to a kernel, it is sound (no false alarms will be reported). We also propose an over-approximation approach to combat the capacity limits of SMT solvers, in order to locate bugs quickly.
One of the main applications of our method is to check the equivalence of a kernel and its optimized version. This parameterized method is particularly suitable for handling typical optimizations for CUDA kernels such as memory coalescing and bank conflict elimination (which often preserve the loop structures).
In essence, our parameterized method makes full use of the symmetric nature of SIMD computations. In the SIMD model, within each (synchronized) step, each thread performs similar computations on different data. Thus, the behavior of all the threads can be inferred by investigating one arbitrary thread. After one execution round, the threads exchange data and go the next round. Modeling such exchanges by an arbitrary number of threads is challenging; we rely on symbolic matching and SMT solving to address this problem in this paper.
We organize the paper by first presenting the generic, nonparameterized approach extended from [13] , then present the parameterized approach. We then compare the performance of these approaches on realistic CUDA programs.
II. BACKGROUND AND MOTIVATING EXAMPLES
A CUDA kernel is launched as an 1D or 2D grid of thread blocks. The total size of a 2D grid is gridDim.x ✂ gridDim.y. The coordinates of a (thread) block are ❤ blockIdx.x, blockIdx.y✐. The dimensions of each thread block are blockDim.x and blockDim.y. Each block contains blockDim.x blockDim.y threads, each with coordinates ❤ threadIdx.x, threadIdx.y✐. These threads can share information via shared memory, and synchronize via barriers ( syncthreads()). Threads belonging to distinct blocks must use the much slower global memory to communicate, and may not synchronize using barriers.
The values of gridDim and blockDim determines the configuration of the system, e.g. the sizes of the grid and each block. For a thread, blockIdx and threadIdx give its block index in the grid and its thread index in the block respectively. For brevity, we use
, and
always hold. Consider the following simple example with 2D blocks, which is simplified from the "transpose" kernel in CUDA SDK 2.0 [7] . Note that a variable with modifier shared is "global" for all threads within a block; and private variables have no modifiers.
void naiveTranspose (int * odata, int * idata, int width, int height) { int xIndex = bid.x * bdim.x + tid.x; int yIndex = bid.y * bdim.y + tid.y; if (xIndex < width && yIndex < height) { int index_in = xIndex + width * yIndex; int index_out = yIndex + height * xIndex;
// for the post-condition postcond(i < width && j < height => odata[i * height + j] == idata[j * width + i]); }
The threads transpose the array in parallel: each thread reads
and writes it to
. The functional correctness of this kernel is specified in the postcondition: the element at location
. This property should hold for all valid configurations as well as all possible input values.
This naïve kernel suffers from non-coalesced writes, and can be more than 10x slower than the following optimized kernel for large matrices. The kernel below is optimized to ensure that all global reads and writes are coalesced, and avoids bank conflicts in the shared memory. The computations between two consecutive barriers constitute a barrier interval (BI) We may use the same post-condition as before to specify the functional correctness of this optimized kernel. Moreover, the equivalence of these two kernels can be specified as: suppose the two kernels take the same inputs, i.e. the same
, then after execution they produce the same outputs (in
) for all possible configurations. The main challenge here is to show this for any number of threads and any input value.
A. Related Work
Verification of CUDA Kernels. Traditional testing methods are ineffective at producing guarantees because they assume concrete input values as well as a fixed numbers of threads, and examine only those concurrency schedules produced by the execution environment. Past efforts in thread verification have focused on multi-threaded programs synchronizing using locks and semaphores [9] . These methods are inapplicable for GPU kernels. Our work is tailored for CUDA which is very widely used; it will easily apply to emerging standards (e.g., OpenCL [16] ).
There are only few GPU-specific checkers reported in the past. GKLEE [14] builds a virtual machine (VM) modeling the thread computations on the GPU. When a GPU program is executed in the VM, the tool checks deadlocks, several forms of data races, and performance bugs (e.g. bank conflicts, non-coalesced memory accesses, thread/warp divergences). The execution in the VM considers only a fixed numbers of threads; hence only a portion of valid configurations are examined. Moreover, GKLEE executions often exceed memory/time limits on many small to medium examples containing non-trivial branches. For example, the BitonicSort kernel (of about 50 lines of code) will cause blow-up when the thread number is greater than 8.
As far as race checking and bank conflict detection goes, the techniques used in PUG can easily accommodate the use of symbolic thread identifiers. However, these straightforward extensions do now work for functional equivalence checking.
The KLEE-FP tool [6] handles OpenCL code. Its main use is in crosschecking OpenCL code against an initial scalar sequential version, and also for race detection in such code. Its approach to floating-point equivalence is based on expression normalization. KLEE-FP is not a parameterized checker, however. Its floating-point reasoning methods can be incorporated into PUG ✁ ✂ ✁ which currently lacks the ability to handle float numbers.
Parameterized Verification. There are abstraction based techniques [19] , [5] that help reduce the problem of verifying parameterized systems with infinite states to that of checking corresponding finite-state abstractions. The abstraction methods employed include counter abstraction [19] which helps abstract process identities, or environmental abstraction [5] , which provides an abstract counting method for the number of processes satisfying a given predicate. These techniques require manual effort to obtain the abstractions.
They also do not directly apply to GPUs.
There are efforts [1] , [18] that apply automatic induction to generate and verify invariants pertaining to parameterized systems. In most cases, manual effort is required to obtain the invariants, although Pnueli et al. [18] presented a way to automatically compute invariants given an appropriate abstraction relation. None of these methods consider CUDAstyle computations.
The reduction from infinite states to equivalent finite states in these works is based on finding an appropriate cut-off ❦ of the parameter of the system. The goal is to establish that a property is satisfied by ❦ processes if and only if it is satisfied by any number (❃ ❦ ) of processes. For example, [10] proposes tighter bounds of cut-off for parameterized systems, independent of the communication topology. In contrast, our technique considers only one parameterized thread and does not require symmetry reduction.
Equivalence Checking. Many approaches have been proposed for checking the equivalence of two sequential programs. For instance, equivalence checkers [21] , [24] perform a dependence graph abstraction of programs containing only affine loops. The basic idea is to use the Omega test to check whether the relations depicting the dependence graphs are equal. Unfortunately, the Omega test approach supports only linear arithmetic. Since these works do not exploit decision procedures, they are unable to handle many arithmetic transformations. They can only deal with programs with high similarities.
TVOC [2] first verifies loop transformations using a specific proof rule called Permute, and then verifies structurepreserving optimizations. It relies on extra information supplied by the compiler to generate verification conditions which are fed to an SMT solver for satisfiability checking. Zaks and Pnueli [26] also used SMT solving to verify structure-preserving optimizations. Their verifier attempts to find invariants connecting the models of the two programs. However, it is difficult to identify sufficiently precise invariants for non-trivial optimizations. Also, these checkers can handle only sequential programs.
An equivalence checking method for CUDA kernels is discussed in [23] . It makes many assumptions and restrictions on the input programs and is not parameterized. No implementation of this method is reported.
III. SMT ENCODING AND NON-PARAMETERIZED CHECKING
Although CUDA kernels are concurrent programs executing in parallel, CUDA programmers often intend to write deterministic programs whose final results are independent of the concurrent schedule. We have presented a static checker [13] and a symbolic executor [14] to determine whether a program is deterministic; and also proved that a deterministic program could be serialized such that the accesses on shared variables are executed in a sequential order (i.e. the canonical schedule). In this section we give a different (and slightly better) order to sequentialize the shared variable accesses. Unlike the one in [13] , this order does not use symbolic arrays to support schedule ids and requires only local information of the accesses. This encoding serves as the basis for comparing the parameterized method and the non-parameterized one.
A. Encoding Sequential Structures
Basic Statements. Our encoding assigns SSA indices to variables. Specifically, the following translation function constructs a logical formula from single statements and expressions, where . Note that a write to an array variable is actually modeled as an array update. We also give below a simple example of applying .
Branches. The SSA indices of the variables updated in the two clauses of a conditional statement "if 
The natural order generates the following constraint.
Formally, the combined transition system for
We give below the model of the naiveTranpose kernel. Each thread has a private copy of local variables such as
. They are referred to by
. Similarly we can obtain the model
for the optimized kernel (we use ⑨ and ❷ to refer to the source (naive) and target (optimized) kernel respectively). The encoding of the post-condition is trivial and not shown here.
Equivalence Checking Given the models TRANS 
Unfortunately, an SMT solver is unable to handle this quantified formula since the definition of TRANS is recursive over the number of threads and the solver requires a concrete ✠ to unroll the recursion. This also forbids using induction (e.g. k-induction [17] ) to to perform the proof. Moreover, the fact that our translation conjoins the models of ✠ threads will make SMT solving quite complex (and of course it would not lead to a parametric approach). The Assertion Language (for Property Checking) In addition to equivalence checking, PUG ♣ ❛ r ❛ also checks properties specified as assertions (e.g. in the postconditions). Our assertion language supports the definition of Boolean formulas using C syntax. One of the main features of this assertion language is that it allows the definition of loops, handling recursive properties and variables with symbolic values. For instance, consider a reduction kernel which computes the sum of the elements in the input array
and stores this sum in
. A suitable post-condition specifying functional correctness is the following, where ✠ is the number of elements in
In some cases, functional correctness can be specified recursively. Consider a scan kernel which computes the parallel prefix sum of the input elements. We show below a suitable postcondition.
IV. PARAMETERIZED CHECKING
This section describes how to perform parameterized encoding. The key is to calculate the value of an output element regardless of the number of threads.
A. Single Conditional Assignment
Our method builds a symbolic model according to the accesses on shared arrays. We first present a method which eliminates all the intermediate variables so that only the accesses on shared arrays are left (an optimization is presented in Section IV-C). For example, the body of the naiveTranspose contains a conditional assignment (CA) to
as follows. This can be interpreted by an mapping from
, the destination address
and the source address
respectively. This CA can be denoted as
, where
are called the range and domain of the CA respectively. Now consider the ❦ ✜ ✢ element in the output array,
. Its value comes from either (1)
and the guard holds (there is only one such s ✣ since no race occurs on ✡ ❞ ☛ t ☛ ); or (2) the old value of
. For brevity we write
. The diagram in Figure 1 indicates how
is investigated, and so on. Here we use the "xor" operator 
Therefore, we can build an SMT constraint considering only one thread (with symbolic ID s ✣
).
Note that, for a given 
. Unfortunately, existing SMT solvers often fail to handle quantified formulas (they return an inconclusive answer "unknown"). To overcome this limitation and make sure that our verifier gives conclusive answers, we derive unquantified formulas from the quantified ones and use them as the constraints. From the first formula we can derive
for a fresh variable s ✣ , which indicates that, for any
. The absence of conflicts enables us to eliminate the Figure 1 . Calculating CAs over multiple threads.
We call the derived formulas Verification Conditions. Section IV-D presents additional details of our technique.
B. Instantiation of Conditional Assignments (CA)
Now consider a more complicated case where an expression contains multiple instances of a shared variable. For example,
, where op is a binary operator, reads variable . In other words,
. For the second read
, note that we cannot use the same for the thread writing to
. These two formulas connect expression to that on
For instance, consider the optimized Transpose kernel. Let
respectively. This kernel contains two CAs: The value of the . That is, it can be obtained by the sequential composition of the two CAs. We instantiate the tids in the first and second assignments to be 
and the CA's domain
We may dig deeper into these formulas. Suppose , then the above two formulas become
If each block of threads is a square such that
, then we can derive the following formula justifying the correctness of the optimized kernel -the input array is correctly transposed no matter how many threads are considered. Note that this kernel is designed with implicit assumptions that (1) each block is square; and (2) only those threads with tid
should participate in the computation. In fact, our encoding models exactly this design and also helps reveal hidden assumptions. For example, PUG
reports a bug when the block is not square by relying on the following check for valid configurations:
Equivalence Checking. The equivalence of the two example kernels requires
provided that all above constraints hold and
. Note that only
is instantiated only once for each kernel. We need to reducing the checking on
to that on the elements in the input array
C. Barrier Interval and Control Flow
The statements between two consecutive barriers are within a Barrier Interval (BI). Since there are no conflicts within a BI, writes to the same shared variable will not fall on the same address. We may use this fact to simplify the generated constraints. Consider the following diagram where BI 1 contains multiple writes to ✈ and in BI 2 property Figure 2 . Instantiation of conditional assignments.
The non-conflicting assumption indicates that at most one
. Thus instead of writing a pair of constraints for each CA, we can combine all the CA constraints to be an embedded ite expression (here
's value right before BI 1 and BI 2 respectively). The main benefit is now we have only one quantified formula rather than ✑ formulae. In some cases (e.g. the two Transpose kernels) the quantified formula is not needed at all because
's value comes from one of the writes in BI 1.
A further optimization we employed is to keep the control flow of the BI and not eliminate all intermediate variables.
The program below (the left column) contains two conditional jumps. Instead of flattening this program to generate three CAs:
, we keep this control flow structures and generate the constraint as shown on the right. This representation, which mimics those in Section III, reduces substantially the size of the constraints and make them much more readable.
D. Quantified Formulas
We attempt to convert a quantified formula into an equivalent quantifier-free formula whenever possible. One quantified formulas we encountered so far is of the following format, where t is the thread id with domain 
We introduce a function
returns the address here). 
It is not hard to see that the ✪ functions for the two Transpose kernels are increasing and their quantified formulas can be converted in this manner. In fact, under valid configurations (e.g. the block is of square size), their spaces ❙ are continuous over the thread IDs, thus the quantified formulas will never be used and can be safely removed.
Fast Bug Hunting. If quantifier elimination is impossible, then we can further loosen the requirement of proving the properties. Our goal then would be to locate property violations quickly by ignoring the quantified formula.
Consider the following sequence. Even the quantified formulas are nonconvertible, we know conclusively that
should be true if both
hold. Thus any violation of the predicate
is able to find such bugs fast.
Coverage. Our approach may be criticized on many grounds. For example, our parameterized method employs under-approximation because of the inability of solvers to handle quantified formulas. Yet our encoding ensures that all (conditional) assignments are covered. With our proposed quantifier elimination technique and the techniques described in Section IV-C, all combinations (conjunctions) of the CAs in different BIs are encoded. We believe that in practice PUG ♣ ❛ r ❛ will miss few bugs.
Contrast with Omega
Tests. An Omega Test [20] based approach may be used to match the address of a read and the range of a CA by building a relation (over the thread IDs) from the address to the range
The main advantage of this approach is that it won't generate quantified formulas. However, Omega Tests only support linear expressions while non-linear expressions are prevalent in CUDA kernels (e.g. in the two Transpose kernels). Our SMT-based method can be regarded as an alternative to Omega Tests to handle non-linear expressions (of particular types).
E. Loops
Our method works better for kernels containing no loops. For kernels with loops, a naïve solution is to fully unroll the loop. However, loop unrolling may not scale, especially with nested loops. Also, loop bounds may involve symbolic values, making it impossible to perform loop unrolling without assigning concrete values to relevant inputs. Our solution is to align the loops or down-size the iteration space.
The loop problem becomes much less severe in equivalence checking. Typical CUDA optimizations often preserve the loop structures of the source kernel such that we may just need to compare the bodies of the loops. A similar assumption is made in [26] . For example, we can optimize the following loop where Since the operator ✰ in the body is commutative and associative, the two loop headers can be normalized to be the same. Then the two respective CAs are as follows; on them, the equivalence checking approach discussed in previous sections can be used:
When the loop alignment fails, we unroll the loops fully (to the given bounds). This happens when optimizations other than memory coalescing and bank conflict elimination are applied. We plan to port the method in [2] to deal with other typical loop transformations.
A notion of program products, cross-products, is proposed in [26] . The compiler correctness verification is reduced to checking a single program which synchronizes the original and transformed programs. However, the restriction of this cross-product based approach to structurally similar programs limits its applicability to structure-preserving transformations. Another line of work [3] extended this method to handle more asynchronous programs by (manually) setting up the synchronous points and inserting invariants into the product program. Our method is similar to these two works in the sense that we also need to align the loops to obtain structurally similar programs.
It is still a non-trivial challenge to parametrically verify structurally dissimilar sequential programs, and optimizations not conforming to specific patterns. We believe that we can overcome some of the limitations of our approach by introducing a richer set of inference rules (e.g. for complicated loop optimizations). Typical transformation rules can be verified once and for all, or over each execution [15] , [12] . Symmetry Reduction. In many cases, loop bounds in a CUDA kernel depends on the size of a block. As many CUDA kernels are designed to run on arbitrarily-sized blocks [8] , one can expect to be able to reduce block sizes to a reasonable value before running PUG ♣ ❛ r ❛ . Currently, such downscaling is done manually. We plan to develop an automatic symmetry reduction approach to identify, for a property ☛ , the minimum number of threads ♥ for which ☛ should be checked. We believe that parameterized analysis and SMT-based analysis can help determine ♥ in practice.
V. EXPERIMENTAL RESULTS
PUG ♣ ❛ r ❛ uses Z3 [25] as the SMT solver. Z3's expressions are based on bit vectors (bounded integers); thus the solving time depends on the number of bits.
We performed experiments on a laptop with an Intel Core(TM)2 Duo 1.60GHz processor and 2GB memory to check some representative kernels in CUDA SDK 2.0 Suite [7] , each of which contains both unoptimized and optimized kernels. Table II sensitive to the sizes of the bit-vectors. This may cause even the parameterized method to time-out. In this case, we must concretize some of the symbolic variables (i.e. give them concrete values, indicated by the "+C." flag) and then compare the results.
Our testing addresses two kinds of bugs. The first kind is due to incorrect configurations for running kernels: for example, using a non-square block (for the Tranpose kernel); or, using a value of ACCN that is not a power of 2 (in the Scalar Product kernel). PUG ♣ ❛ r ❛ is able to reveal violations of these assumptions. The second class of bugs is those intentionally introduced within correct kernels, e.g. by modifying the addresses of accesses on shared variables or the guards of conditional statements. has checked more kernels than shown in above tables, some of which come from GPGPU programming classes. Although they are small-medium sized programs (typically 50-200 lines of code), it still is non-trivial to verify some of these highly optimized parallel programs. Furthermore, even for a small kernel, loop unrolling often results in many CAs to be checked. It is encouraging that PUG ♣ ❛ r ❛ was able to identify bugs within few seconds on some of these kernels.
VI. CONCLUDING REMARKS
In this paper, we detailed several directions for developing parameterized equivalence verification methods for GPU programs. We then presented specific details pertaining to , we have obtained many encouraging preliminary results on several small-to-medium real-world kernels. As to our future work, many extensions are planned. In addition to finding better ways to handle quantified formulas and non-trivial loop transformations, we plan to extend PUG ♣ ❛ r ❛ to deal with more complicated programs, and incorporate it into our symbolic executor GKLEE. We also plan to extend PUG ♣ ❛ r ❛
to support floatingpoint numbers.
