Power consumption is a prevalent issue in current and future computing systems. SIMD processors amortize the power consumption of managing the instruction stream by executing the same instruction in parallel on multiple data. Therefore, in the past years, the SIMD width has steadily increased, and it is not unlikely that it will continue to do so. In this article, we experimentally study the influence of the SIMD width to the execution of data-parallel programs. We investigate how an increasing SIMD width (up to 1024) influences control-flow divergence and memory-access divergence, and how well techniques to mitigate them will work on larger SIMD widths. We perform our study on 76 OpenCL applications and show that a group of programs scales well up to SIMD width 1024, whereas another group of programs increasingly suffers from controlflow divergence. For those programs, thread regrouping techniques may become increasingly important for larger SIMD widths. We show what average speedups can be expected when increasing the SIMD width. For example, when switching from scalar execution to SIMD width 64, one can expect a speedup of 60.11, which increases to 62.46 when using thread regrouping. We also analyze the frequency of regular (uniform, consecutive) memory access patterns and observe a monotonic decrease of regular memory accesses from 82.6% at SIMD width 4 to 43.1% at SIMD width 1024.
INTRODUCTION
Data-parallel languages such as OpenCL or CUDA are a common choice for programming massively parallel compute devices. The underlying execution model is called single program multiple data (SPMD): a single program (also called kernel) is instantiated multiple times as a so-called work item. Every work item usually processes its own piece of input data. Up to barrier synchronization, no order is imposed on the order in which work items are executed. Current hardware and software exploit this parallelism at different granularities: usually, several work items (also called threads)
This work is part of the ECOUSS project and has been funded by the German Federal Ministry of Education and Research (BMBF) and the Intel Visual Computing Institute Saarbrücken. Authors' address: T. Schaub, S. Moll, R. Karrenberg, and S. Hack, Computer Science Department, Saarland University. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. Karrenberg and Hack [2011] ) and sequential execution traces.
are organized into a work group (also called block). These are broken down into several SIMD groups (also called warps) that are executed by a SIMD processor. Such a processor executes a single instruction on multiple input values (single instruction multiple data). A compute device may have multiple SIMD processors to execute SIMD groups in parallel. The size of a SIMD group-the SIMD width-is the number of work items that a SIMD processor can execute in parallel. It is a crucial parameter for the performance potential as well as the power consumption of a processor.
Data-Parallel Programs and Divergence
In this article, we study the influence of the SIMD width to control-flow and memory divergence of data-parallel programs. Data-parallel program execution on SIMD hardware is done with lockstep execution: every operation is executed for the entire SIMD group of work items before moving to the next operation. SIMD execution is essentially a power optimization. All work items in a SIMD group can share the same logic for managing the instruction stream, such as the instruction decoder and the pipeline logic. Because power consumption is the limiting factor in current and future computing systems, wider SIMD processors are imaginable. However, it is unclear how wider SIMD processors affect the performance of typical workloads. Does hardware scale efficiently when using wider SIMD units, or is it necessary to limit the SIMD width and increase the number of cores, which is most likely more expensive? Especially, the effect of wider SIMD to control flow and memory divergence, the main causes why several data-parallel workloads do not scale well, has not been studied.
Control flow divergence.
If multiple work items are executed in SIMD fashion, work items that execute different code paths have to be partially stalled, resulting in reduced utilization of the SIMD units and a potentially longer execution time of the kernel. Consider the example in Figure 1 . Assume that the work group has four items that all take different execution paths. On a single, scalar processor, the overall latency of the whole work group is the sum of the lengths of the traceshere, 5 + 5 + 8 + 8 = 26. Executing the group on a SIMD processor ( Figure 2 ) with SIMD width 2 gives a latency of 6 + 9 = 15 because two items are executed in parallel. Finally, a SIMD processor of width 4 executes the group with latency 10. Although the latency went down by increasing the SIMD width, the corresponding speedup is sublinear: 1.73 instead of 2, and 2.6 instead of 4. Memory-access divergence. Assume that all work items in a SIMD group of size w execute a load instruction. In the worst case, all w addresses are scattered in the address space, which results in w data fetches from the memory. This sequentializes the execution of the load instructions. However, if the addresses are uniform (i.e., all equal) or consecutive, only a single load has to be issued. Scatter/gather units of GPUs (and modern CPUs) usually coalesce neighboring memory accesses into fewer fetches to mitigate this problem. Many CPUs, however, do not have scatter/gather units. On those machines, it is the job of the compiler to prove uniformity/consecutiveness by static analysis and/or to insert code variants that test for this property dynamically.
Several compiler techniques have been proposed to mitigate control-flow and memory-access divergence. Control-flow divergence is tackled by static analyses that try to prove that a branch does not diverge based on the values that flow into the branch condition [Coutinho et al. 2011; Karrenberg and Hack 2012] . If the analysis fails to prove a branch nondivergent, the compiler may insert runtime tests that will bypass code regions if a nondivergent branch is detected [Shin 2007 ]. Furthermore, several techniques have been proposed to reorganize SIMD groups (in both software and hardware) at certain points during the runtime of the program [Wald 2011; Fung et al. 2007; Sartori and Kumar 2012; Han and Abdelrahman 2011; Horn 2005; Billeter et al. 2009 ]. The goal of these compaction techniques is to create SIMD groups from work items that follow the same control-flow path to avoid divergence. Compaction can mitigate divergence effectively but can have a significant overhead caused by the reorganization.
Memory divergence is usually handled by the hardware directly (by coalescing) or by dynamic tests that the compiler inserts. For example, an OpenCL driver that targets the SSE instruction set can check if the addresses of a load in a SIMD group are consecutive. In that case, the load can be handled by a single instruction. Otherwise, all values have to be loaded separately and be packed into a vector register, which causes significant overhead.
Contributions. The contributions this article makes are twofold. First, we assess the impact of increasing the SIMD width on control-flow and memory divergence of a wide range of existing, real-world OpenCL workloads. In total, we examine 76 benchmarks with 105 kernels and 17,677 inputs with respect to control-flow divergence, memoryaccess divergence, and compaction by means of a novel, hardware-independent dynamic program analysis. Second, we investigate the effectiveness of existing compiler techniques (compaction and code variants) and static analyses for data-parallel programs in the context of growing SIMD widths. Our key findings are the following: -A majority of the workloads we examine scale well up to higher SIMD widths. A vast majority of the executed kernel runs do not exhibit more than three different control-flow paths per SIMD group. Thus, control-flow divergence "saturates" already at lower SIMD widths. The other kernels do not scale well. Although a relative speedup of 1 + δ for δ < 1 from width w to width 2w seems to be okay in the first place, this mediocre speedup accumulates up to (1+δ) k when going from sequential to 2 k -wide SIMD. For example, 54:4 T. Schaub et al. Fig. 3 . Overview of the interplay between the OpenCL driver and Diana. The driver instruments the kernel IR. During execution, the driver produces the expected output as well as an execution trace. The trace is analyzed by Diana postmortem.
if a kernel has a relative speedup of 1.8, its absolute speedup from sequential to 256-wide SIMD is only 110×. -For increasing SIMD widths, stragglers become a significant problem. Stragglers are a minority of work items that take loops significantly longer than the other work items in their SIMD group. The larger the SIMD group, the worse the impact of stragglers to the utilization of the SIMD lanes. -Compaction techniques can compensate control-flow divergence effectively. However, for several benchmarks, we observed that even very aggressive compaction approaches start to degrade for higher SIMD widths. In the worst case, these approaches can increase the latency of the kernel run. Whether powerful compaction can be implemented efficiently is a different question. We find that as a consequence of the performance degradation for high SIMD widths, smaller SIMD widths may spend more time for compaction than higher ones. -Memory access patterns become less regular as the SIMD width increases: our benchmarks experience less uniform, consecutive, and coalescable memory accesses for high SIMD widths. Furthermore, the amount of memory accesses that could be classified by static divergence analyses decreases as well. Even more, we find that dynamic code variants will be less profitable with increasing SIMD widths.
The rest of the article is structured as follows. Section 2 briefly describes our analysis framework, as well as the features that we extract from the programs studied in the core of the article. Section 3 presents a detailed empirical evaluation of control-flow and memory divergence for varying SIMD widths of a large set of OpenCL programs. Section 4 relates this work to previous work, and Section 5 concludes.
DYNAMIC ANALYSIS FRAMEWORK
We present a framework to analyze the dynamic behavior of OpenCL kernels on SIMD architectures, which is depicted in Figure 3 . It consists of two phases. First, a modified OpenCL driver 1 instruments the kernel code to record execution events. The execution events are starting and finishing a work item, entering a basic block, reaching and passing a barrier, and accessing global memory. The instrumented kernel is then executed in scalar mode (i.e., SIMD width 1). From the recorded events, a tool called Diana (Divergence analyzer) derives properties of a SIMD execution of the kernel for a given SIMD width of w. To this end, Diana processes the work items in the same order as the modified OpenCL driver. Vectorization-the formation of SIMD groups-is done by grouping consecutive work items.
In general, when executing a kernel in SIMD fashion, all paths starting at conditional branches have to be executed. These divergent paths reconverge at the immediate post dominator [Fung et al. 2007] . Diana simulates linearized lockstep execution of a set T of traces by computing a super trace with the algorithm shown in Figure 4 : each trace t ∈ T starts at the same entry block. The progress vector p reflects how many steps each trace has performed. We simulate a reconvergence stack and operate on the top-most stack entry. A stack entry determines which traces are currently active and at which block these traces reconverge. In each step, we determine which of the active traces executes which block next. If the traces try to execute different blocks, we push one stack entry per successor and set their reconvergence point to the immediate post dominator. If all traces execute the same block, we check if we have reached the reconvergence point. If so, we pop the current stack entry. If not, we increase the progress vector by one for the respective traces. Finally, if there is no successor block, execution is finished.
As an example, consider building a super trace from the traces given in Figure 1 . Blocks a and b are executed sequentially in lockstep. Then, work item 2 continues with block d while the rest jump to block c before control flow reconverges at block e. This pushes two new stack entries, which in turn execute c and d. Right after e, the threads diverge again. The first new stack entry immediately reconverges, because work items 1 and 2 jump to post dominator f right away. The second stack entry diverges itself and ultimately yields b c d e. Thus, the final super trace is a b c d e b c d e f.
Using this simulation technique, Diana derives properties of the kernel's SIMD execution related to control flow (Section 2.1) and memory accesses (Section 2.2), which we detail in the remainder of this section.
Control-Flow Divergence Analysis
The control-flow divergence analysis assesses the impact of control-flow divergence on the latency of the kernel and the utilization of the SIMD units of a given width w. We measure the latency of a SIMD group while executing some kernel by counting the number of executed SIMD instructions 2 I w . Note that I 1 is the instruction count for scalar execution. For SIMD width w, Diana simulates lockstep execution by partitioning all traces into groups of size w and creating super traces for each group. The latency of the kernel is then the sum of the latencies of all of its groups. Note that Diana only considers a single SIMD processor because the effects of control-flow and memory divergence are independent of the number of SIMD processors.
The utilization U w of SIMD execution reflects the utilization of the SIMD lanes during execution. Values less than 1 are caused by control-flow divergence. To this end, we have to relate the latency I w to the latency of a SIMD execution that does not suffer from control-flow divergence. In the case of executing k ≤ w work items on a w-wide SIMD processor, determining the shortest SIMD execution corresponds to the shortest common supersequence problem, which is NP-complete in terms of k [Maier 1978] . However, to the best of our knowledge, it is unknown whether the problem remains NP-complete when we allow for compaction (i.e., reassigning threads at branches) in the case that one work group contains more threads than a SIMD processor can execute (i.e., k > w).
3
Compaction techniques have been invented to eliminate the overhead of control-flow divergence. In this work, we study them for two reasons. First, if the effects of controlflow divergence become more severe with increasing SIMD width, then compaction techniques, although they have significant overhead, could be effective at higher SIMD widths. Second, we use the latency E w of an execution with compaction as the baseline for utilization:
If the branches of an application never diverge, the utilization is 1. However, the larger the gap between E w and I w gets, the lower the utilization will be. The overhead is 1 − U w -that is, the fraction of instructions that are hopefully being removed by compaction.
Execution manager instruction count. We assess the efficiency of one particular approach to software-based compaction-the execution manager of Kerr et al. [2012] . At the current state of the art, this approach promises to most effectively eliminate control-flow divergence, because to the best of our knowledge, it is the approach that has the most freedom to rearrange threads. Hence, the execution manager instruction count E w can be seen as the best possible "practical" SIMD execution of the kernel. We briefly outline the most important concepts of the execution manager. Whenever a group of w work items diverges, the work items are stopped and added to the waiting set of their respective successors. Execution is resumed with a newly formed set of w work items waiting at the same block. These work items are executed until they diverge again and so on. If there are no w stopped work items at any block, w (or all remaining work items if these are less than w) new work items are launched. If all work items have already been launched, execution is resumed with less than w active work items. If a work item reaches a barrier, it can only be resumed after all other work items of that work group have reached the barrier. This is implemented by splitting the set of stopped work items into one set for each work group at barriers.
We simulate a variant of the execution manager on a machine with SIMD width w to count the number of executed instructions E w . We refer to how often a group of work items is stopped as Y w .
There are several degrees of freedom when implementing the execution manager. When there is no block with w stopped work items, which block should be chosen next for execution? When there are multiple blocks with w stopped work items, which one should be chosen? When resuming from a block with more than w stopped work items, which work items should be selected to continue? When executing with less than w work items, it can be beneficial to eagerly stop at some point to reconverge with other work items to increase the number of active work items in the SIMD group. Should this eager reconverging be done, and if so, where?
We experimented with different choices regarding these issues and validated experimentally that we get the best results (concerning E w ) with the following settings. We resume from the block that comes first in a post dominator order whenever there is no block with w stopped work items. We resume execution with the first w work items that were stopped at a block. Whenever the amount of active work items drops below w, we forcefully yield if there are work items waiting at a block that dominates the current block.
Note that E w (using the execution manager as described in the previous paragraph) is not a lower bound for the number of instructions on a SIMD processor. There are (pathological) examples where it is beneficial to not reconverge at the post dominator. Consider the traces in Figure 5 . Not reconverging immediately yields better results than we have seen in Figure 2 . Additionally, as we present in Section 3, E w can be larger than I w on real benchmarks.
Execution manager costs. We want to estimate the maximum costs of code performing the tasks of the execution manager as a representative for software-based compaction. Again, we estimate the costs by instruction counts. The execution manager involves significant overhead for storing state, moving data, and so forth, so the difference I w − E w has to be large enough to exhibit potential for a software-based approach. We are interested in the compaction potential CP w , the average costs for performing a regrouping operation. The total costs (number of instructions) for performing compaction are given by the average costs per regrouping times the number of regrouping operations (i.e., CP w · Y w ). On the other hand, the number of saved instructions is I w − E w . The costs and gains of compaction cancel out if I w − E w = CP w · Y w . Therefore, the compaction potential is the maximum average number of instructions for a regrouping operation:
The higher CP w is, the more likely it is that software-based compaction using the execution manager approach will result in performance gains.
Memory Pattern Analysis
Another goal of our study is to assess how the nature of memory accesses evolves if the SIMD width increases. Additionally, we investigate to what extent static divergence analyses can classify memory accesses and if dynamic code variants are profitable when static analyses fail to classify a memory access. We evaluate only global memory accesses, since accesses to the global memory system usually incur much higher latencies than local memory. For SIMD hardware, the following memory access patterns are relevant because a compiler can make use of special instructions that result in improved performance.
Uniform. A memory operation is uniform if all active work items of the SIMD group access the same memory address:
Uniform access can be implemented by a single (if the hardware permits it: scalar) memory instruction. The accessed memory addresses can still differ between SIMD groups.
Consecutive. A memory operation is consecutive if all active work items of the SIMD group access consecutive addresses:
Consecutive access can be implemented by an efficient vector load or store (if all items are active).
Ranged. A memory operation is ranged if it is neither uniform nor consecutive but all active addresses are in the range of a SIMD register:
Ranged accesses can be performed by a single memory transaction of w words and a shuffle. The shuffling is either done in software or by a scatter/gather unit in hardware.
Note that some patterns may coincide on certain compute devices (e.g., on modern GPUs, consecutive and ranged accesses show the same performance due to shuffle units). To stay independent of particular hardware incarnations, we chose to make a distinction between them.
All applications were executed with the AMD APP SDK OpenCL driver v2.8 modified to extract execution events that are analyzed offline by Diana. The applications chosen for this study cover a diverse set ranging from the microbenchmarks of the AMD APP SDK (e.g., Mandelbrot, NBody, MatrixTranspose, PrefixSum) over more complex applications and benchmark suites (clSURF [Mistry et al. 2011 Rodinia [Che et al. 2009 ], Parboil [Stratton et al. 2012] ) to real-world applications (Bullet physics engine, 6 X264, 7 GEGL 8 ). We strive to cover the whole spectrum of data-parallel applications. Clearly, this goal can never be met no matter how many applications are considered. However, we believe that the set used in this work covers all major areas. Overall, our benchmark suite uses 105 kernels and executes a grand total of 17,677 kernel runs.
We first present the results of our performance model for normal SIMD execution when increasing the SIMD width. Then, we discuss the influence of compaction, followed by a validation of our model against real hardware on some manually vectorized microbenchmarks. Finally, we evaluate the impact of increasing the SIMD width to memory access patterns.
Application Performance Estimation
We quantify the performance characteristics of the benchmarks for SIMD widths ranging from 1 to 1024. With this data, we attempt to answer the following questions. What speedup can we expect when using a machine with SIMD width w instead of computing in scalar fashion? What speedup can we expect when transitioning from a machine with SIMD width w to a machine with width 2w? As mentioned earlier, we use the kernel instruction counters (I w and E w ) to estimate application performance. We define the following terms. Total speedup describes the speedup when comparing scalar execution with SIMD execution (i.e., I 1 /I w ). By normalized total speedup, we refer to the result of applying a linear normalization to the interval [1, 2] to a total speedup. Normalization is required to compare total speedups for different widths. Relative speedup describes the speedup when doubling the SIMD width (i.e., I w /I 2w ). A kernel run refers to the execution of all work items of one kernel for one given set of input data.
Relative speedups. Ideally, the SIMD instruction count I w drops by 50% when doubling the SIMD width, because every instruction handles twice as many work items in that case. This is equivalent to a speedup of 2. In the worst case, the instruction count does not drop at all, corresponding to a speedup of 1.
The speedups for the benchmarks are depicted in the first table of Figure 6 . The majority of kernel runs show speedups of factors 1.9 and larger, independent of the SIMD width. In our model, good speedups imply that control-flow divergence only increases a little when increasing the SIMD width. The reason is that the number of different executed traces is smaller than the SIMD width. Hence, divergence "saturates" already at lower SIMD widths. This is typical for kernels that contain no loops or have loop bounds that are uniform (not dependent on the work item ID). As an example, the particle simulation in the Bullet physics engine uses six if-statements to check for particle-wall collisions. Although collisions with the floor are quite common, collisions with the walls are the exception and the ceiling is never hit. Furthermore, collisions with two walls at the same time are even rarer. As a result, a typical run 54:10 T. Schaub et al. Fig. 6 . Frequency of speedups of all benchmarks when increasing the SIMD width. For normal SIMD execution, the first table shows the relative speedups and the second table shows normalized absolute speedups. The third table shows normalized absolute speedups for compacted execution using the execution manager. We highlighted environments of local maxima. Note that the group of speedups around 1.8 in the first table translates to speedups decaying from 1.8 to 1.5 in the second table. Furthermore, the execution manager is able to improve the absolute speedups of most benchmarks to the maximum as seen in the third table. However, some applications do not benefit from this approach, and hence the left cluster (marked with lighter gray) does not disappear completely.
only executes three paths through the kernel. Figure 7 shows the number of different traces per SIMD group. Observe that a large majority of SIMD groups contains only three different traces or less.
For a vast majority of runs, control-flow divergence "saturates" at low SIMD widths because they execute less than three different paths through the CFG.
The speedups for widths 2, 4, 256, 512, and 1024 are worse than the speedups in the range of 8 to 128. We assume that after width 4, many of the possibly diverging paths are already taken for most kernels. Consider a kernel containing two nested conditional statements. These result in four paths through that region. At SIMD width 4, it is possible to take all paths in one SIMD execution. If this happens, divergence does not increase for this program part when switching to higher widths. Since low nesting depths are more common, this effect is mostly noticeable for small SIMD widths. This is reflected in the low number of different traces per SIMD group shown in Figure 7 . For SIMD width 8, 80% of the SIMD groups do not diverge at all. This number goes down to about 60% for width 512, which is still very high. For each SIMD width, only a small fraction of all SIMD groups contains more than three traces. This means that for many applications, since maximal divergence already occurs at a small SIMD width, increasing the SIMD width farther can be done without risking additional divergence.
For widths of 256 and higher, stragglers in loops become a larger problem for some of our benchmarks. Stragglers are a minority of work items (possibly only one) in a SIMD group that iterate a loop much more often than the majority. In this situation, the whole group has to wait for the stragglers to finish until it can proceed. During that time, most of the SIMD lanes are inactive. The existence of stragglers is proved by the following evidence. Figure 8 shows the percentiles of the normalized number of loop iterations over all runs. We count the number of loop header executions (one number to count all loop headers) per kernel run and normalize that number by the number of loop header executions for width 1. If all work items in a SIMD group iterate the loop (almost) equally often, the quotient will be (only a little bigger than) 1. However, in the presence of stragglers, the quotient will increase. This can be observed in the graph for the .9, .95, and especially the .99 percentiles. For example, Parboil's breadth-first search benchmark executes one work item per node, and each work item iterates its node's successors. Consequently, the number of loop iterations depends on the input data and therefore varies strongly among the work items. This is a typical scenario for stragglers: many work items in a SIMD group will have finished iterating their successors while waiting for the few nodes with significantly higher successor count. Furthermore, the fraction of speedups that is exactly 1 for width 1024 increases. Note that a speedup of exactly 1 when switching to width 2w is only possible if w is larger than or equal to the number of work items. In that case, all work items belong to the same SIMD group. Using a machine with width 2w does not change the number of necessary instructions; it only increases the number of idle lanes.
With increasing SIMD width, stragglers start deteriorating SIMD utilization in loops
for some kernels (e.g., 1% of the runs execute ≈ 25× as many loop iterations as in a sequential run).
Absolute speedups. The second table in Figure 6 shows normalized total speedups. As an example, we see that 6,036 runs have a normalized total speedup of 1.97 for width 256. The total speedups for that width are in the interval [1, 256] . We normalize a given total speedup s = . This means that for these runs, (I 1 /I 256 )+254 255 ≈ 1.97. There are two major groups of kernel runs. The first group shows (almost) optimal total speedups. The normalized total speedups of the second group become worse for increasing SIMD widths. We observe that there are no "jumps"-that is, the normalized speedups decrease smoothly. How can the second group exist despite the fact that almost all relative speedups are so high? Assume that a kernel has constant relative speedups of 1.9, then the absolute speedup for width 256 is
8 ≈ 170. The overhead of control-flow divergence accumulates.
For the majority of kernel runs, doubling the SIMD width results in a relative speedup >1.95. The absolute speedup of the remaining runs suffers from smaller relative speedups that accumulate to unsatisfactory absolute speedups (e.g., relative
speedup of 1.9 results in absolute speedup of 170× on width 256).
An example. We examine two interesting kernels in detail. Figure 9 shows the instruction counts for a cell ID computation run from the Bullet physics engine. The kernel scales perfectly up to width 512 (i.e., I w /I 2w = 2 for w = 1, 2, 4, . . . , 256). This is also reflected in the fact that I w and E w coincide. Note that I 512 = I 1024 since there are only 512 work items. This is a typical application in which all work items execute the same code path. In contrast, the collision detection kernel shown in Figure 10 uses multiple code paths. In fact, in this particular run, only 2 out of 512 work items follow the same path throughout the whole kernel. We observe that the speedups range from 1.69 to 1.76 for SIMD widths from 4 to 512. Again, there is no speedup when going from 512 to 1024. The small range of speedups up to 512 suggests that divergence increases steadily. O w is a naïve lower bound on the number of instructions of any SIMD execution. Note that both axes are log scale. Up to w = 64, the speedup potential of compacted execution steadily increases, meaning that control flow diverges more often and in a way that can be alleviated with compaction. However, for w = 128 and w = 256, compaction using the execution manager actually results in an increase of executed instructions. At w = 512 and higher, the work group size is greater than w, which results in I w = E w . Right: Normalized absolute speedups for the same run. These values are normalized as in Figure 6 .
Note the accumulating overhead in the collision detection kernel. Although speedups of roughly 1.7 look reasonable, the total speedup when going from SIMD width 1 directly to width 512 (i.e., I 1 /I 512 ) is a rather disappointing factor of 137 (instead of the maximum of 512). Not only for this very large SIMD width but also for smaller widths like 64, the total speedup is far from linear: I 1 /I 64 ≈ 26.
We observe that increasing the SIMD width yields diminishing returns for many benchmarks. Hardware vendors will have to decide whether lower production costs justify the suboptimal speedups when they choose between wider or more SIMD units. This choice also greatly depends on the expected workload. Alternatively, compaction can mitigate some control-flow divergence as discussed in the next section.
Compaction Profitability Estimation
In this section, we discuss if and to what extent compaction techniques can efficiently compensate the sublinear speedup observed for some of the investigated kernels. We consider two different scenarios. First, in zero-cost compaction, we assume that 54:14 T. Schaub et al. rearranging work items comes for free. Second, we compute the compaction potential CP w -that is the maximum number of instructions compaction can spend to still be efficient on SIMD width w.
Zero-cost compaction. We assess whether compaction improves performance if it can be done at runtime essentially for free (e.g., using hardware support). The third table in Figure 6 shows absolute speedups using the execution manager discussed in Section 2.1. Let's have another look at width 256. We see that 12 runs show a total normalized speedup of 1.97. The numbers are normalized as before. Note that these 12 runs are not necessarily a subset of the 6,036 runs with the same speedup when not performing compaction.
Observe that the "slower" groups from the second table disappeared, whereas the number of runs showing a speedup of 2 increased. At width 1024, the speedups suffer a bit, again due to SIMD groups with fewer than 1024 work items. Overall, the total speedups improved almost to the maximum. Thus, if compaction can be implemented with no overhead in hardware, it can remove the drawbacks of control-flow divergence effectively.
Recall that it may happen that after compaction by the execution manager, the number of executed instructions increases (i.e., E w > I w ). We find this phenomenon for a total of 338 runs (combining all SIMD widths). Figure 11 shows the exact number per width. This shows that the execution manager approach is not optimal and can in fact increase the number of executed instructions in real-world applications across all SIMD widths. Note that we use the execution manager instruction count to measure the utilization. Since the execution manager count is not optimal, we overapproximate utilization. Compared to an optimal SIMD execution using compaction, the utilization is therefore potentially even worse than presented in this article. However, as discussed in Section 2.1, we do not believe that optimal SIMD execution with compaction is within the scope of a hardware/software implementation.
Example revisited. Reconsider the collision detection example from Figure 10 and recall that the total speedup for SIMD width 64 is merely 26. If the total speedup factor for the same width is computed from the execution manager instruction counters (i.e.,
), the factor increases to almost 64. This is a strong indication that this applicationwill significantly benefit from compaction. However, as seen in Figure 10 , E w surpasses I w for widths 128 and 256. This is a case where the execution manager shows suboptimal behavior. Interestingly, this example shows that despite the aggressive dynamic reordering the execution manager performs, common SIMD execution can be more efficient. This suggests that there is still demand for research to better understand the effects of compaction to SIMD execution.
Compaction potential. We use CP w to assess the maximum amortized costs for compaction. Figure 12 shows the density of estimated maximum costs for the kernels for different SIMD widths. We ignored the 338 kernel runs where E w > I w because the compaction potential loses its meaning in that case.
Again, we can identify two groups of kernels. The first group shows no compaction potential. This is mostly because of kernel runs showing little to no dynamic control flow or runs for which the number of yields outweighs the difference in the instruction counters. On the other hand, kernels in the second group exhibit promising compaction potential. Their compaction potential starts at approximately 140 instructions for SIMD width 2 and falls off to 60 instructions at width 32. From there, the potential slowly decays to 50 instructions at width 1024. The actual gain depends on the costs of operations that the idealized machine (see Section 2.1) does for free: rearranging work items and storing execution state. The costs of the execution manager are unknown; therefore, we cannot tell the minimum compaction potential required for the execution manager to be beneficial. However, we can say that it is more beneficial for small SIMD widths.
Note that these numbers may be further improved by tweaking the execution manager (see Section 2.1). As mentioned previously, we obtain almost optimal (absolute speedup = SIMD width) values for E w for many applications (namely those that show an absolute normalized speedup of 2 in Figure 6 ). For these kernels, the compaction potential can only increase by lowering the number of yields. Obviously, this is always a trade-off: yielding more often gives more freedom to rearrange SIMD groups, but maybe there are cases where yielding only negligibly impacts the number of instructions. Therefore, it might be beneficial to yield only at a set of carefully selected branches.
If compaction comes for free, it effectively compensates control-flow divergence across all SIMD widths. However, the time budget for compaction decreases when the SIMD width increases.

Model Validation against SSE/AVX Hardware
In this section, we validate our performance model by calibrating data points for w = 4 and w = 8 on existing hardware. 9 To get robust and meaningful results, we implemented a set of microbenchmarks in scalar C, vectorized them manually, and executed them in a controlled, OpenCL-like environment with 2 16 work items in a single dimension. We deliberately did not validate our model with respect to a proprietary OpenCL driver because we do not know and cannot influence how the driver executes the code. The performance results that we observed with several proprietary OpenCL drivers were inconclusive even for very simple kernels. For instance, drivers cannot be forced to automatically vectorize for a certain width and sometimes decide to not vectorize a kernel at all. For manually vectorized OpenCL code, we encountered situations where w = 4 scaled linearly as expected, but w = 8 did not show any further improvement or even slowed down compared to w = 4. This is a driver issue, not a limitation or property of SIMD execution.
For each microbenchmark, we compute the actual speedup from measured execution times on existing hardware and the simulated speedup using the techniques presented previously. For actual speedups, we use the median execution time of 101 individual runs without warming up to ensure reproducibility of the results. We define the deviation ratio as the ratio of simulated to actual speedups. A deviation ratio of 1 means that the simulated results match the measurements. Figure 13 shows a comparison of our performance model with these measured results. Clearly, the results of the simulation reflect the actual speedups for a large majority of the microbenchmarks.
Classification of Memory Access Patterns
In this section, we investigate the effects of increasing the SIMD width on the behavior of memory operations. This means that we analyze normal SIMD execution without compaction again to prevent drawing false conclusions. We leave the assessment of the effects of compaction to memory access patterns for future work. Furthermore, we explore the precision of static divergence analyses and investigate the profitability of dynamic code variants.
We call a memory instruction pure if all of its executions follow the same memory access pattern (see Section 2.2). A memory access is pure if it is issued by a pure memory instruction. We show how control-flow divergence directly affects purity of memory accesses in the code example of Figure 14 .
We consider pure memory access separately because impure accesses cannot be classified by static analyses. For each-pure and impure accesses- Figures 15 and 16 show two statistics that aggregate our data differently. First, we count the number of accesses per pattern per kernel. The "Per kernel" column shows the arithmetic mean of all of these distributions. Thus, the "Per kernel" column reflects the access behavior of the average kernel. Second, the "Per memory access" column shows the normalized count of accesses per pattern over all runs of all kernels. The "Per access" column shows the profile of the average memory access in the benchmarks. Fig. 14 . If condition is uniform, then every memory access by the store at the end is either uniform or consecutive. If condition always evaluates to the same value for all kernel runs, then the store is pure. If condition is divergent, then any memory access by the store is impure because ptr is a vector created by blending a consecutive and a uniform address vector. Despite the observable patterns in some executions, the store itself is impure and cannot be statically specialized for either address pattern. Fig. 15 . Fractions of accesses from pure memory instructions. On the right-hand side, we see the expected type of a pure memory access over all traces. In the left-hand side plot, we show the expected type of a pure memory access when drawing a kernel uniformly at random. The complementary fractions of impure accesses are shown in Figure 16 . Discussion. First, consider the pure memory accesses in Figure 15 . Independent of the SIMD width, uniform memory accesses take a major share of the pure memory accesses. This is due to kernels such as calNumEigenValueInterval of the AMD SDK that alone contribute already 6.14% uniform memory accesses at SIMD width 4. However, the average kernel shows far more consecutive memory accesses. For instance, the image processing kernels from the GEGL benchmark suite operate on a per-pixel basis and only have consecutive and no uniform accesses. In general, the number of consecutive accesses decays slowly, whereas the number of ranged accesses increases. We attribute this to two different factors. First, the last SIMD group of a kernel run is padded in case the global work size is not a multiple of the SIMD width. Padding necessarily demotes uniform and consecutive patterns to ranged accesses. Second, traces are enumerated in the order they were started. The OpenCL driver terminates a local work group before starting the next. If the work size is multidimensional, this means that traces whose width is wider than the first local dimension are no longer linear in the first global dimension.
Increasing the SIMD width may break memory access patterns. Consider a multiple of the simulated SIMD width as the first dimension of the work size. In that case, Fig. 17 . Decay of rangedness in event 8 of the Intersect kernel of Luxmark. The second column shows how often the event was scheduled, which depends on the grouped traces as visualized in Figure 2 . The third column shows the last time the event was rescheduled with at least one ranged access. We exclude the cases where only a single straggler remains.
the first coordinate in every SIMD trace is consecutive, and all others are uniform. However, if the simulated SIMD width does not divide the first dimension of the work size, we observe irregular patterns in the coordinate vectors. As these coordinate vectors are often used in address computations, this disrupts uniform and consecutive access patterns. This effect is strong at the step to SIMD width 1024, which is larger than the first dimension of most work groups. The fraction of ranged memory accesses is nonmonotonic in the SIMD width. If the SIMD width is doubled, the range of the addresses in a vector can be twice as big to form a ranged access.
Ranged instructions that are neither uniform nor consecutive appear to be far less relevant. At SIMD width 512, consecutive and uniform instructions still cover 40.61% of all memory accesses.
In Figure 16 , the remaining accesses that could not be attributed to pure memory instructions are categorized. In general, we observe that ranged accesses only seem to arise from decaying consecutive accesses. At SIMD width 4, 43.96% of all memory accesses follow a pattern that cannot be attributed to a pure memory instruction. These are mostly uniform. The picture for the average kernel ("Per kernel" column) is slightly different. First, similar to pure accesses, we mostly observe consecutive accesses. Second, consecutive accesses are not strictly decaying but fluctuate with increasing SIMD width. The reason, therefore, is twofold: pure consecutive memory instructions become impure when doubling the SIMD width; at the same time, impure consecutive accesses demote to ranged accesses.
With increasing SIMD width, memory access patterns become less uniform, less
consecutive, and less pure.
An example: memory behavior in Luxmark. In the following, we will discuss the Luxmark benchmark as an extreme case of SIMD sensitivity in memory divergence. Figure 17 shows the memory access profile of a load instruction of the Luxmark benchmark. It is representative for a set of memory instructions in the main traversal loop of the Intersect kernel. The memory operation accesses the current node of a tree that serves as an acceleration structure. These accesses are highly divergent because different work items descend into different branches of the tree. We observe that increasing the SIMD width not only affects the loop trip count due to stragglers. Furthermore, the number of loop iterations in which memory accesses are ranged decreases rapidly: starting from a SIMD width of 32 and at least up to 1024, the first loop iteration is always the last with any ranged memory access.
Precision of static divergence analyses. Several OpenCL drivers use static divergence analyses [Karrenberg and Hack 2012; Coutinho et al. 2011; Sampaio et al. 2014 ] to classify memory accesses. We implemented one existing analysis [Karrenberg and Hack 2012] in our framework and report on its precision. Across all SIMD widths, the static divergence analysis proves about 40.36% of all pure uniform accesses and more than 59.54% of the pure consecutive accesses. Hence, future work could look at the unclassified accesses in detail and potentially improve on existing static analyses. Increasing SIMD widths do not affect the effectiveness of the static analysis; this is expected, as in Karrenberg's analysis the SIMD width only influences the classification of consecutive values.
Dynamic code variant profitability estimation. Our data shows that many memory instructions are either impure and can therefore not be classified at compile time or that state-of-the-art techniques fail to recognize their patterns. However, our data can be used to develop a heuristic to introduce dynamic code variants. Dynamic code variants for memory instructions are optimized code paths for cases where the address vector follows an optimizable pattern. In the following, we only consider memory instructions that the static divergence analysis marked as divergent. This subsumes all impure and part of the pure instructions that could not be classified. Pure memory instructions that are successfully classified by a static analysis can be optimized without code variants and are therefore not considered here. Each dot in Figure 18 stands for a memory instruction marked with the access pattern that it follows most frequently. The figure shows the access pattern frequency (x axis) and how often the instruction is executed on average (y axis, upper graph). A memory instruction far to the top right of the upper graph will benefit the most from an optimized code path.
Further, we evaluate how a simple heuristic performs that decides whether a dynamic code variant should be inserted. The heuristic cannot reason about the absolute number of executions, because kernel parameters and work size are unknown at compile time. Thus, we assume that we are given a static analysis that can estimate the pattern frequency. For simplicity, we consider that a decision threshold on the frequency estimate is used. Every instruction that lies above the decision threshold will get a dynamic code variant. In the lower two charts of Figure 18 , we evaluate how sensitive this method is to the quality of the estimate.
We report precision and recall for the two optimizable uniform and consecutive access patterns. The numbers are based on the average number of memory accesses. Then, we show the number of instructions (y axis, lower graph) that lay above the decision threshold.
We observe that the pattern frequency is located in the two extreme ends. This means that the decision threshold should be fitted in the range of 35% to 55% where precision and recall are higher than 80%. Thus, a static analysis that lower bounds the pattern frequency can make a reasonable decision to insert optimized paths. However, Figure 19 shows that for larger SIMD widths, fewer and fewer instructions can benefit from dynamic execution paths.
The most frequent access pattern of divergent memory instructions either occurs very often or is barely relevant. Given a rough frequency estimate, a decision threshold is
effective to decide whether to use a code variant.
Threats to Validity
The validity of our study is threatened by several assumptions and choices we made:
(1) The set of benchmarks could be too small and/or too biased. Other programs that we did not consider could exhibit different control-flow and memory divergence behavior. Especially, the memory divergence results have to be interpreted and extrapolated diligently. (2) Our performance model ignores caches and other performance-enhancing hardware. Therefore, the numbers that we present cannot directly be taken as a performance prediction for a specific kind of accelerator. By ignoring cache misses, our study gives best-case results. It investigates to what extent the problematic properties of SIMD execution influence program execution in the best of all cases. It is not to be expected that cache misses improve on this best case. This best-case analysis is what architects and compiler writers need: when writing a compiler, one rarely will perform a code transformation that detriments performance when something is a cache hit. Nevertheless, in this article, we give experimental evidence that our performance model is representative for at least one concrete hardware platform. (3) The execution manager [Kerr et al. 2012] or its configuration could be an inappropriate choice for a baseline. Although unexpected (see the discussion in Section 2.1), other approaches (see Section 4) could exhibit better performance.
(4) The reported memory patterns were observed in the work item order of the modified OpenCL driver. Memory patterns are sensitive to the work item ordering, and results may vary with other OpenCL drivers. However, we expect drivers to not exercise that freedom but instead stick to the order given by the application developer. (5) We assess the behavior of specific implementations of the benchmark applications to get a realistic view of their properties. For example, some kernels may have been written and optimized specifically for GPUs that have a fixed SIMD width of 32 or 64. Such kernels may exhibit unexpected performance behavior when increasing the SIMD width beyond this value, which could influence our results. Magni et al. [2013] conducted experiments on the effects of different thread coarsening factors on specific GPU and CPU architectures. Their work is similar to ours in that thread coarsening also increases the number of work items executed in lockstep, but the effect is different from changing the architecture's SIMD width. Lashgar et al. [2013] simulated GPUs with different warp sizes and hypothetical hardware to eliminate either control-flow divergence or memory-access divergence. They estimate the performance implications of these designs. Kerr et al. [2009] analyzed the efficiency of CUDA microbenchmarks in terms of control-flow divergence, memory behavior, and parallelism on GPUs. They used a simulator for the PTX instruction set instead of a trace-based approach and only conducted their experiments on a GPU with a fixed SIMD width of 32. In addition, their approach is tied to the GPU, since they simulate memory timings and caches and take coalescing as given. Additionally, they consider barriers as reconvergence points (as opposed to reconverging at the immediate post dominator), which negatively affects all of their metrics. Alternative reconvergence schemes have also been evaluated by Fung et al. [2007] , who find that immediate postdominator reconvergence is almost optimal. Burtscher et al. [2012] developed metrics for control-flow and memory irregularity to categorize kernels for GPU execution and evaluated the effects of optimizations on these metrics. Coutinho et al. [2010] instrumented CUDA kernels with profiling code to dynamically detect both the location and the volume (severity) of control-flow divergences. CuMAPz [Kim and Shrivastava 2011] is a tool to estimate memory performance of a CUDA application on a given machine. The tool enables performance comparison of different program designs by simulating their respective memory behavior. The official CUDA Visual Profiler 10 also performs dynamic analysis of both control-flow divergence and memory access patterns for Nvidia GPUs. Baghsorkhi et al. [2010] estimate execution time of a CUDA kernel analytically. Their model also includes an estimation for control-flow divergence. A variety of work exists where OpenCL or CUDA kernels are dynamically tested for race conditions, which requires techniques related to our memory pattern analysis [Collingbourne et al. 2012; Boyer et al. 2008; Zheng et al. 2011] . There are several implementations of compaction, such as those of Fung et al. [2007] , Fung and Aamodt [2011] , Kerr et al. [2012] , Zhang et al. [2011] , and Meng et al. [2010] . The execution manager by Kerr has been discussed in Section 2.1. Fung and Aamodt's Thread Block Compaction [Fung and Aamodt 2011 ] is a hardware approach that works similar to the execution manager-that is, work items are stopped whenever control flow in a SIMD group diverges and new groups are formed with work items stopped at the same block. Regrouping only happens inside of a work group. The formation of new groups is done lane aware-in other words, a work item always stays in the same SIMD lane in all groups. New groups are formed by collecting work items that wait at a block. When a new work item is added to a group and finds its lane already taken, the group is started. At the immediate post dominator, the original assignment of work items to SIMD groups is restored. Rhu and Erez [2013] proposed SIMD lane permutation to counteract the shortcomings of lane awareness. G-Streamline by Zhang et al. [2011] uses the CPU to determine loop counters and then execute groups of work items with similar loop counters on the GPU. The evaluation of all these works is limited to their respective approach (e.g., to the target hardware's SIMD width). Meng et al. [2010] propose dynamic warp subdivision. Their technique dynamically splits a SIMD group when some of its work items are ready while others are waiting for memory operations, so the ready items can pull ahead. The splits allow for more flexible scheduling. Splitting can also be used to implement control flow.
RELATED WORK
CONCLUSION
We studied the behavior of data-parallel applications when executed on hardware with different SIMD widths where each SIMD lane computes one work item of the application.
Our analysis framework collects execution traces of OpenCL applications, analyzes them a posteriori, and simulates SIMD execution of arbitrary width. We evaluated a diverse set of 76 applications from various sources, ranging from microbenchmarks to real-world applications such as a physics engine, video decoding, photorealistic rendering, and image processing.
Our study shows that the majority of kernel runs scale well because they exhibit almost no control-flow divergence. In fact, the number of distinct control-flow paths through those kernels is small (<4), such that making SIMD wider does not cause further control-flow divergence. The rest of the kernel runs severely suffer from controlflow divergence and mediocre relative speedup, when doubling the SIMD size, which accumulates to a poor overall speedup. Our data indicates that work group rearrangement techniques (compaction) can compensate control-flow divergence. However, the margin for efficient compaction is small and decreases when increasing the SIMD width.
Memory access patterns become less regular with increasing SIMD widths. The amount of impure accesses that cannot be classified by static analyses increases for larger SIMD widths. Furthermore, the frequency that one particular, regular access pattern (uniform, consecutive, ranged) can be observed for an impure access decreases as well. This reduces the profitability for dynamic code variants or checks that test for a particular pattern at runtime.
