It is unquestionable that successive hardware generations have significantly improved CPU computing workload per formance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruc tion throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's CeForce 8800 CTX in November 2006 to its CeForce CTX 580 in November 2010. However, raw capability num bers typically underestimate the improvements in real ap plication performance over the same time period, due to significant architectural feature improvements.
It is unquestionable that successive hardware generations have significantly improved CPU computing workload per formance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruc tion throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's CeForce 8800 CTX in November 2006 to its CeForce CTX 580 in November 2010. However, raw capability num bers typically underestimate the improvements in real ap plication performance over the same time period, due to significant architectural feature improvements.
To demonstrate the effects of architecture features and optimizations over time, we conducted experiments on a set of benchmarks from diverse application domains for multi ple CPU architecture generations to understand how much performance has truly been improving for those workloads. First, we demonstrate that certain architectural features make a huge difference in the performance of unoptimized code, such as the inclusion of a general cache which can improve performance by 2-4x in some situations. Second, we describe what optimization patterns have been most es sential and widely applicable for improving performance for CPU computing workloads across all architecture genera tions. Some important optimization patterns included data layout transformation, converting scatter accesses to gather accesses, CPU workload regularization, and granularity coars ening, each of which improved performance on some bench mark by over 20%, sometimes by a factor of more than 5x . While hardware improvements to baseline unoptimized code can reduce the speedup magnitude, these patterns remain important for even the most recent CPUs. Finally, we iden tify which added architectural features created significant new optimization opportunities, such as increased register file capacity or reduced bandwidth penalties for misaligned accesses, which increase performance by 2 x or more in the optimized versions of relevant benchmarks. While no community or field springs out of nothing, the modern field of CPU computing had a major inflection point approximately five years ago with the first support for C based programming languages for general computation on CPUs. Ve ry quickly, the community discovered and pub lished what worked well on CPU platforms and what didn't at first. As the years progressed, CPU architects and ap plication researchers continually pushed at the boundaries of what CPUs could do effectively, significantly improving performance for many workloads. At this juncture, with five years of experience and a new academic conference ex plicitly dedicated to novel parallel computing platforms and applications, we would like to examine some CPU comput ing workloads and see how far we have come, and what is most important to learn from the current state of the art as we continue to move forward.
We would like to focus on two major aspects of the CPU computing field over the last five years. The first is the op timization and programming patterns that have shaped op timized applications for CPU architectures. Several design principles of CPU architectures have been and will likely continue to be very consistent, such as SIMT and high de grees of multithreading. We have surveyed many CPU com puting applications and kernels and distilled what we believe to be several key optimization techniques and design consid erations for high-performance CPU-computing workloads. These techniques deserve to be covered in some detail and in a way that can be understood across domains, because they reflect the common patterns a new CPU programmer in any domain should learn. In addition, we believe that these op timization patterns will continue to gain broader relevance over time, because the optimization patterns we present are fundamentally about implementing scalable, efficient paral lel programs on an architecture with many cores, vector exe cution, and limited memory bandwidth. We present our fo cused discussion on such optimization patterns in section 3.
Second, we present new experiments exploring the design space of CPU architecture itself, which has been both re markably consistent and increasingly friendly to application programmers over the last several years. The tipping point of CPU computing five years ago coincided with the CPU 
ARCHITECTU RE EVOLU TION
GPU computing workloads are extremely diverse. Liter ally hundreds of GPU application optimization papers have been published over the past few years alone. The GPU Computing Gems books alone document dozens, from very diverse fields [4, 5J. The SHOC OpenCL benchmark suite (with some equivalent CUDA benchmarks) has multiple "lev els", benchmarking increasingly complicated codes [3J. The SHOC level 0 benchmarks are essentially microbenchmarks for stress-testing things like memory bandwidth. Level 1 benchmark are simple, common primatives such as reduc tion, scan, SGEMM, and others. Level 2 benchmarks (of which there is only one at time of writing) are considered full applications. The Rodinia [2J and Parboil [8J benchmark suites overlap some with each other and with the SHOC benchmarks, mixing "primitives" with "applications". Both Parboil and Rodinia have CUDA and OpenCL implementa tions, except for two Rodinia benchmarks lacking OpenCL support at time of writing. Because part of our goals was to reimplement and reoptimize our benchmarks for multiple GPU devices, we chose to work with the Parboil benchmark suite, which has both reasonably-sized benchmarks for reim plementation and scripting support for multiple implemen tations and platforms.
All our experiments were conducted on one of three GPUs, which are summarized in Table 1 . The first system houses an NVIDIA GeForce 9800 GX2, of which we only use one of the included GPU chips. We chose this device as opposed to slightly older G80-generation device because of the added support for atomic operations to global memory, without which some of our benchmarks would be practically impos sible to implement. Atomic operations to shared memory are not supported in this device, but can often by emulated by barriers or clever uses of shared memory consistency and warp scheduling assumptions valid for that device. Such em ulated atomic operations come at a higher software develop ment, performance, and portability cost. Additionally, the 9800 GX2 architecture carries a harsh penalty to any imper fectly coalesced memory accesses, generating many DRAM line accesses even for contiguous but simply misaligned ac cesses.
Experiments on a second system executed on a single GPU of an NVIDIA Tesla S1070. That GPU, and others of com pute capability 1.2 and higher, support atomic operations to shared memory, allowing for more robust and efficient com munication among threads in a block for certain program ming patterns. In addition, the S1070 removes the harshest penalties for uncoalesced accesses, such that a misaligned ac cess will only transfer the two DRAM lines straddled by the contiguous addresses. In general, the S1070's coalescing unit will generate the fewest number of DRAM transactions nec essary to satisfy the requests from a single warp-instruction.
Finally, the third system consists of an NVIDIA GTX 480. The "Fermi" GPU generation, designated by compute capa bilities 2.0 and higher and including the GTX 480, adds a general cache hierarchy to the global memory system, and increased shared memory capacity for those kernels need ing it. In addition, an addition to its instruction set allows the CUDA compiler to automatically target certain access patterns to the constant memory cache for high broadcast performance.
Clearly, at time of publication all of these devices will be two or more years old, which for scientific computing implies that the performance results may not be directly relevant for future hardware. However, the insights gained from analyzing the feature sets should continue to be infor mative. While NVIDIA is unlikely to remove standardized hardware features in future generations, determining which features proved most impactful can help shape the acceler ator market as a whole towards the most useful feature set for accelerated workloads.
OPTIMIZATION PATTERNS FOR

PA RALLEL CHIP ARCHITECTU RES
A segment of the parallel programming community has long been interested in characterizing the programming pat terns that are effective for parallel systems [7, 6J. However, in our conversations with some of authors in that field, they have confided that they sometimes struggle with the fact that once a parallel program is implemented, the optimiza tion process involves software development practices com pletely outside the domain of their structural patterns. We would therefore like to begin the academic discussion of a set of patterns systematizing the optimization of parallel programs. The optimization patterns were drawn in partic ular from our informal survey of the GPU Computing Gems contributions [4, 5J, and from a focused and detailed analysis of the Parboil benchmarks. Because accelerators are highly parallel devices, many of the techniques specifically address general performance is sues that arise from programming a highly parallel shared memory architecture, such as contention and load imbal ance. Some techniques are not specific to highly parallel architectures, but avoid especially severe performance cliffs given the design of today's accelerator architectures, such as the especially software-driven approaches to effective band width utilization and locality management. Still other tech niques are specifically targeted towards leveraging the ben efits of a hybrid system, using the versatility of the CPU to not only process necessarily sequential code regions but to also precondition GPU kernel inputs such that kernels can be further optimized than would be possible for general input.
Finally, parallel architectures are fundamentally a collec tion of sequential processing units. When a parallel archi tecture is well used, the performance limitation of a program on that architecture is the efficiency of the sequential pro grams running on each execution unit. Therefore, sequential program performance optimization is still an area of inter est for the SPMD code executed on the accelerator. We will not discuss those techniques here, as they are well stud ied and not unique to parallel programming systems, but acknowledge that immature compiler technology sometimes will necessitate direct programmer implementations of "triv ial" code optimizations.
We firmly believe that every one of the patterns we de scribe here has been explained in previous work, but note that previous descriptions of these patterns as they apply to GPU workloads are typically embedded within implementa tions of specific workloads. While we do not take credit for being the first to discover any of these individual transfor mations, we believe that there is useful insight to be gained by consolidating summaries of all those we found applied to the Parboil benchmarks in a way that highlights their generality to a variety of GPU computing workloads. By gathering the optimization patterns together, anchored by the real benchmarks using them, we can study how they interact with one another, their variations among different applications, and their individual and cumulative results on real hardware systems.
We demonstrate the impact of each individual pattern by presenting, for benchmarks where that pattern was particu larly relevant, performance improvements from the highest- performing code we could write without that optimization to the highest-performing code we have. Except where noted otherwise, results in this section are collected from the NVIDIA Tesla S1070 system described in Section 2.
Tiling
Tiling is perhaps the most widely used and understood technique for best utilizing a tiered memory hierarchy. While the technique is fundamentally the same in sequential code optimization, the actual implementation can vary with the design of an architecture's memory hierarchy, as shown in Figure 1 . Tiling in the context of a CPU architecture with a large-capacity, implicitly managed cache hierarchy typi cally means writing regions of code that operate intensively on smaller sections of memory. The regions could then be repeated many times for different sections, or tiles, of data. The application need not explicitly define the region of mem ory being operated on, as the hardware should automatically respond to the heavy usage of certain regions and retain those regions in the cache.
One of the most obvious differences of current GPU ar chitecture is explicitly managed on-chip memory, such as on the right side of Figure 1 . To use the small-capacity, high bandwidth scratchpad, software must explicitly move data into it before use. The threads themselves are mediators be tween DRAM and scratchpad, under the direction of source code written by application programmers.
Recent GPUs have also added small implicit caches to their general memory system, providing a hybrid of im plicitly and explicitly managed locality mechanisms. What Clearly, neither caches nor scratchpads in current CPUs were designed for CPU-style temporal locality and thread private tiles, but for overlapping accesses among threads. A block of threads may collectively have 16kB of private cache or scratchpad space even when all thread contexts are active, which is often sufficient to hold worthwhile-sized tiles of data. Therefore, the software technique of tiling is still applicable for CPUs, but very often must take the form of cooperative tiling using the shared resources of several threads for sufficient impact.
The performance impacts of tiling are significant, as shown in Table 3 . The 3x improvement in performance for the stencil benchmark corresponds to the fact that memory tiling reduces the number of bytes accessed from global memory per iteration from 5 words per thread to 1.25 words per thread on average. Performance does not increase by a full factor of 4x primarily because some accesses are still mis aligned, not fully utilizing DRAM bandwidth. Although the results in Table 3 are for a cacheless CPU, our exper iments in Section 4.3 verify, as any CPU high-performance programmer will assert, that software tiling is still critical for architectures with implicit caches.
Privatization
Privatization is the transformation of taking some data that was once common or shared among parallel tasks and duplicating it such that different parallel tasks have a private copy on which to operate. Parallel threads typically operate most efficiently when they can operate completely indepen dently, avoiding coordination with other threads, but many parallel algorithms require threads to interact to obtain a fi nal result. Privatization is applied to isolate regions of code where threads can operate independently and efficiently, be fore eventually combining results. Figure 2 shows a common multi-level privatization pat tern reflecting the hierarchical task decomposition common among highly parallel systems such as clusters or single-chip CPUs. Working up from the bottom of Figure 2 , a global re sult is built from the partial results from many independent tasks (thread blocks in the case of a CPU. ) Those partial results are each in turn constructed from many more "pri vate" results. This kind of privatization has applications in many different kinds of algorithms. Collective operations such as sorting or reductions will use this pattern, as will data structures such as histograms or queues. One limitation of privatization is that the data footprint of the copies and the overhead of combining the copies scale with the amount of parallelism being exploited. This is why privatization is an extremely powerful technique for today's CMPs, with a relatively small number of threads, but some what limited for the levels of thread parallelism in highly multithreaded architectures. Often the "private" results are still shared by several CPU threads due to resource limita tions, but are intended to be constructed with as little inter thread cooperation as possible. As shown in Table 4 , the BFS Parboil benchmark privatizes the output work queues, resulting in a 3x performance improvement over an unpriva tized implementation. Privatization allows the BFS kernels to exchange more costly global memory atomic operations for shared memory atomic operations, and also collects ir regular updates in shared memory before bulk-committing results to the global queue in a more regular pattern, im proving bandwidth. For the histogram benchmark, the pri vatization transformation was ineffective for the S1070 due to shared memory capacity limitations; we therefore report CTX 480 speedups in Table 4 for that benchmark.
Scatter to Gather Transformation
A few Parboil applications demonstrate a computation pattern where an input datum would either contribute to many output elements, or contribute to one or more stati cally unknown output elements, such as shown in Figure 3 . In both either cases, a common pattern for sequential im plementation is to examine each input element, determine the output elements it affects, and update each one before moving on to the next input element. 
I
This method works poorly as parallelism scales, because the output accesses are either contentious or random or both. Examining the previous techniques, we see that tiling is very effective on input data, and privatization is very ef fective on output data. However, a kernel implemented with a scattering approach has no input read sharing to tile, and no outputs with multiple updates from the same thread to privatize. In these situations, it is often very important to transform the code such that input elements are read shared, but output elements are private to a parallel task. This is more palatable than the converse case because shared reads can be much more efficiently handled than conflicting writes, which typically require more costly atomic opera tions and coherence enforcement. A conversion to gather accesses means that privatization can be applied to output writes, reducing their cost, while techniques such as tiling can be applied to improve shared read efficiency.
Scatter-to-Gather transformation works exceptionally well when the range of inputs affecting an output can be found without direct examination of the input data contents. If this input-to-output mapping cannot be done statically, some times the transformation must be supplemented with a bin ning operation. The Parboil Histogram benchmark gains about 20% performance on a Fermi architecture by using a gather-based approach instead of a scattering approach. Results are presented on the G TX 480 because the gather approach for the Parboil Histogram benchmark is only effec tive for GPUs with sufficient shared memory space to pri vatize a reasonable portion of the output histogram. The S1070 system does not have sufficient scratchpad capac ity for the scatter-to-gather transformation to improve his togram benchmark performance, and is therefore inapplica ble as an optimization for that architecture.
Binning
A gather operation can be difficult to orchestrate without a method of determining, based on output location, which inputs contribute to that location. In the Parboil Histogram 
Output dilll l.1
Overflow Data for Alternate Processing Figure 4 : An example showing the binning, regularization, and compaction optimization patterns benchmark, for instance, a set of work-groups redundantly reads a section of the input data from off-chip DRAM, but each only processes the set falling within its own output range. 1 In general, the bandwidth cost of reexamining data scales with the amount of parallelism. Therefore, for some applications it is beneficial to first create a data structure creating a map from output locations to a small subset of the input data that may affect that output location, reducing the redundant reading of data. This data structure creation is called "binning", because it often reflects a sorting of input elements into bins representing a region of space containing those input elements. In the example of Figure 4 , the un sorted data keys are examined and sorted into an array. If the input dataset were very regular, the sorting by key alone would likely create an efficiently accessible data structure. However, in the presence of irregularity, there will either be empty or overflowing bins for any fixed bin size, which should be addressed by some combination of the following two optimization patterns: regularization and compaction.
Binning can improve system performance in several ways. If the GPU is performing both the binning and the com putation, the overhead of binning can be outweighed by the improved efficiency of the main compute kernels. Al ternatively, the binning operation could be offioaded to the CPU, potentially making better use of all available system resources. Binning is applicable in particular for the CutCP benchmark, as shown in Table 6 . The speedups from binning are often very high, because binning input data for a kernel's input changes the fundamental computational complexity of the kernel algorithm. A scatter-based kernel may not need binning to get comparable computational complexity, but even for scattering kernels, binning is important because it can facilitate privatization of tiles of output data.
Regularization
Load imbalance has been one of the banes of parallel pro cessing throughout its history. Typically load imbalance is exacerbated when the level of parallelism being exploited increases. Architectures exploiting SIMD or SIMT vector processing suffer from low-level imbalance if the tasks as- signed to different execution lanes process different amounts or kinds of work. GPU architectures are no exception. Fur thermore, if threads co-executing in a thread block have im balanced loads, the shared resources of the entire thread block may be occupied until the last thread completes, po tentially reducing the real amount of thread-level parallelism available for the architecture to exploit.
Some applications that exhibit load imbalance can predict at run-time where and how the load imbalance will occur. In the example of Figure 4 , we assume that the program can count the number of data elements for each key for much less cost that actually calculating its contribution to its par ticular output. A preprocessing step can limit the amount of imbalance in work units executed on the GPU by identi fying regions of load imbalance and proactively addressing them. In our example, during the binning process, elements that "overflow" a bin can be put in a separate data struc ture, which can be processed by some method less sensitive to load imbalance.
Regularization is the optimization pattern of precondi tioning GPU kernel input to improve performance. Among the Parboil benchmarks, there are examples of processing work separately using a GPU kernel insensitive to imbal ance, offioading irregular work for the CPU to process con currently with the accelerated kernel. Other cases have no visible impact on the kernel code except that load imbalance and warp divergence are on average improved, resulting in higher performance. Regularization increases the efficiency of the primary accelerated kernels handling the majority of the processing, resulting in higher system performance over all, with impact as listed in Table 7 .
Compaction
Compaction has been a technique within extremely par allel, shared-memory systems and programming models for quite some time as well. The fundamental issue is that when parallel work units produce a varying number of output el ements into statically allocated output buffers, the buffer size must be overprovisioned. Because tasks determine out put locations statically, unused holes or spaces in the out put are the consequence of overprovision, such as those bins marked by X's in Figure 4 . Output gaps interleaved with useful data cause bandwidth efficiency to drop for DRAM and cache architectures operating on transactions of larger, contiguous memory chunks. Compaction is a method of co ordinating parallel tasks to dynamically determine output locations such that no holes are introduced.
If compaction were a separate processing step, as depicted in Figure 4 , it would simply move all the useful data elements into contiguous addresses, filling in the holes, while keeping track of where each output section begins, as it will be data dependent [1] . More often, and in the MRI-Gridding and BFS Parboil benchmarks where GPU computation produces compacted output, the compaction is integrated into the kernel producing output itself. The benefits from compaction primarily stem from the re duced memory footprint of the compacted data format. Per formance impacts are typically only meaningful for bandwidth bound kernels, and even then only minimally if the overpro visioned regions of the buffers are mostly contiguous. Thus, we quote not performance results but memory capacity re duction effects in Table 8 . Compaction is essential for the MRI-Gridding benchmark in particular, for which we can not even run uncompacted versions of reasonable datasets on most GPUs due to insufficient global memory capacity.
Data Layout Transformation
DRAM systems supporting both CPU and GPU architec tures are designed to transfer data in large, contiguous lines or rows. Poor usage of CPU cache lines or GPU coalesced bursts will result in poor performance. However, GPU co alescing rules are somewhat harsher, because of the shorter time window over which the software could make use of a data burst from DRAM before any unused data is "dropped", requiring retransmission if needed at a future time. In some GPU architectures, the window is instantaneous, only ex posed to a single SIMD instruction. More recent architec tures introduce a small degree of caching extending this win dow, but because of the high degree of threading and the cache's low capacity, the window in practice is still very small. This is in contrast to CPU cache lines, which will typically sit in the cache for a longer period of time before being replaced.
Programmers work within the DRAM system design with well chosen data traversal orders or task index organization. If the elements in question are single-word data and closely associated with task indexes, a good choice of task index to element index mapping is typically sufficient to get good memory system performance. However, that pleasant situ ation is not always feasible. Sometimes, the data elements needed within a particular time window are not naturally adjacent to each other in the memory address space. Take, for example, the diagram in Figure 5 , which shows a warp accessing fields from a set of cells for various layouts. In the top case, using C standard data structure layout, the warp access addresses with a large stride between them, requir ing multiple memory lines of mostly unused data to fulfill the requests. The middle case of Figure 5 shows equivalent accesses with a structure-of-arrays layout, a common trans formation. However, even the middle case results in a large distance between the addresses of two fields, which are likely to happen very close together in time.
Depending on the memory system design, performance can be improved further by more complex layout transfor mation [9] , perhaps resulting in a layout like that depicted at the bottom of Figure 5 where accesses to multiple fields will request adjacent, contiguous regions of memory. Spe cific examples of data layout transformation in the Parboil benchmarks include several instances of array-of-structure to structure-of-array transformations, a matrix transposition in SGEMM, and a transposed sparse matrix data storage for mat in SpMV. We specifically isolate the data layout trans formation effect for the LBM benchmark, with an order-of magnitude speedup as shown in Table 9 . Overall, trans formations for the purposes of achieving coalescing often achieve very high performance gains, such as the LBM, while layout transformations for improving memory level paral lelism or avoiding moderate partition camping can effect a more modest improvement. Sung et al. report speedups ranging from 5% to 30% for already coalesced accesses in different benchmarks [9] .
Granularity Coarsening
Granularity coarsening has been anecdotally described in many application optimization papers, perhaps most rigor ously by Volkov in regards to linear algebra kernels [10] . When a larger task is decomposed into a set of fine-grained work-items, there is almost invariably some amount of over head introduced in the problem decomposition. The over head may vary for different algorithms and kernels, but al most every kernel will exhibit some inefficiencies in recalcu lating values like address offsets or other seemingly "small" operations in many threads. The finer the decomposition, typically the larger the overhead incurred. In addition to in nate implementation inefficiencies, most real systems incur some fixed costs creating or scheduling parallel tasks, and communication operations tend to become more costly as the number of communicating tasks grows.
The CUDA and OpenCL programming models lend them selves to an "elemental" style of decomposition, where the source code of the kernel is scalar, processing a single ele ment, with as many threads created as there are elements to process. With this extreme level of decomposition, the level of redundancy and other inefficiencies can be surprisingly high, but difficult to address within the elemental-function methodology as the cost of communicating between different tasks is still higher than the cost of redundant computation. Granularity coarsening is essentially a de-parallelization of a program. Instead of executing code where each thread processes one element, each thread processes several. Fig  ure 6 shows a coarsening transformation by a factor of six. By putting several threads together, redundant operations that were previously executed once by each original thread have their redundant executions reduced by a factor of the degree of coarsening. Furthermore, what had been shared reads or conflicting writes to a variable in the untransformed code become private uses of data. In the example of Fig  ure 6 , although task parallelism was reduced by a factor of six, the total number of operations required to compute the full output was reduced by nearly two-thirds. The efficiency gains make incremental coarsening worthwhile so long as the amount of parallelism is still sufficient to occupy the parallel resources of the device. Examples of specific efficiency gains are shown in Table 10 .
Summary
Table 11 shows a compact representation of which opti mization patterns were relevant for each benchmark. Note that the table does not convey the relative importance of each optimization pattern to each benchmark. In our expe rience, a given benchmark's performance improvement due to optimization is typically dominated by one or two opti mizations, with others making smaller contributions. Also, note that certain optimization patterns are widely applica ble, such as granularity coarsening, while others are only applicable to applications with certain characteristics, such as binning. Finally, we would like to point out that some of the optimization patterns are clustered. Regularization and compaction, for instance, are typically combined rather than applied separately, because both are applicable for similar workload characteristics.
We are not necessarily convinced that these optimization 9800 GX2 S1070 GTX 480 Figure 7 : Performance of code optimized for each succes sive GPU generation, plotted against raw throughput and bandwidth scaling for comparison a moderate impact on optimized workload performance in clude increased register file capacity, which boosted the per formance of applications such as SGEMM and SAD in par ticular because of the extensive register tiling of those bench marks. The performance improvement for register tiled bench marks came less from the opportunity of additional register tiling, which reaches asymptotically low incremental benefits and had little impact in practice, but more from the archi tecture's ability to increase occupancy for the same degree of register tiling. The single feature with the most performance impact over all was the global memory cache added in the GTX 480 generation. Even for optimized codes, scratchpad usage can introduce meaningful inefficiencies into the software. That overhead is typically overcome by the performance improve ment due to captured locality otherwise unattainable in the absence of a cache, but does put scratchpad at a disad vantage to implicit caches for certain workloads. Further more, the GTX 480's cache captures what private scratch pads never can: shared locality among different thread blocks and access patterns with irregular locality. The spmv bench mark performance increases for the GTX 480 primarily from the caching of irregular accesses to the dense vector. The stencil benchmark benefits from caching because any mem- ory tiling approach in that benchmark must address the fact that the sizes of input tile needed to compute an output tile does not match the output tile size. Thread blocks sizes must be chosen to fit either the input or output tile size, resulting in inefficiencies from idle threads or increased soft ware complexity for explicitly copying input tiles, respec tively. In addition the tile borders overlap with the working sets of other thread blocks, exposing locality that cannot be captured with private scratch pad memory. The version of the stencil benchmark optimized for the GTX 480 actu ally avoids memory tiling, improving the efficiency of the instruction stream by relying on the cache and thread block scheduling policy to capture locality. The cache also sig nificantly improves the BFS benchmark's performance by caching the end of the output queue while threads in a block incrementally add to its taiL Surprisingly, atomic operations to shared memory had less performance impact than we expected. On further analy sis, we found that privatization optimizations had reduced contention on shared memory locations requiring atomic up dates to the point that the overhead of our software atomic updates, which scales with contention, was not so high as to make hardware assisted atomics indispensable. While our iterative atomic update methods were limited to certain sit uations, the versions optimized for the 9800 GX2 targeted those situations specifically, resulting in sufficient atomic up date performance.
Finally, we note that the BFS benchmark in particular does not scale very well with regards to the number of SMs in the system. The BFS kernel that fills the GPU and performs device-wide barrier synchronizations in certain kernels does not perform as well on the S0170 as on the narrower 9800 GX2 and GTX 480 devices. As it is likely that machine widths will be increasing on average in the future, it seems reasonable to expect that using atomic operations for chip wide synchronizations will become increasingly inefficient, and should perhaps be avoided if possible. Figure 8 shows the performance improvement of a sin gle, optimization-agnostic implementation across the differ ent GPU generations, again plotted against the raw through put and bandwidth improvements of the devices themselves. The definition of "unoptimized" is somewhat slippery, be cause it is always possible to write less efficient code by doing some kind of useless computation. Our philosophy while writing these baseline versions was to write the sim plest functional code that seemed reasonable to us. We can not claim that the baseline versions of all the benchmarks are equivalently unoptimized, but believe we can still learn some useful insights by paying attention to what "inefficien cies" are automatically mitigated or eliminated by particular architectures.
Baseline Performance Improvements
Generally, we can see that the performance trends are definitely positive, and significantly higher in magnitude than the improvement of optimized code versions. In sev eral instances, one architecture generation brings order-of magintude speedups over the previous generation, mostly for benchmarks with artificially poor memory bandwidth performance for uniform or misaligned accesses on the 9800 GX2 surging in performance when those limitations were re moved in the S1070. The Fermi generation improved global memory broadcast accesses further by automatically pro moting them to use the constant memory cache. Broadcast accesses are those where each thread in a warp loads from exactly the same address in a particular instruction. The GPU's constant cache supports this access pattern with very high performance. The constant cache design of the GTX 480 architecture enables the CUDA compiler to automati cally transform accesses to use it under certain conditions, which reduces pressure on the general global memory cache and resultx in significant speedups for unoptimized mri-q, tpacf, and sgemm implementations.
Despite the raw bandwidth improvements of the S1070 over the 9800 GX2, the strided access pattern of the unop timized Ibm benchmark saw practically no performance im provement. It was not until the cache of the GTX 480 that its performance meaningfully improved. The GTX 480 cache also had significant impact on the performance of codes with had shared locality in the accesses among thread blocks that was not exploited by explicit memory tiling, in particular the stencil benchmark.
4.3
Optimization & Architecture Interactions
Finally, we examine the performance improvements of op timization for each benchmark and GPU generation, with results presented in Figure 9 . Overall trend is significantly downward, implying that optimizations in general are be coming less critical over time. Conversely, we can say that many of the performance cliffs avoided by optimization are becoming less steep with successive architecture generations. However, there are some exceptions. The binning optimiza tion pattern, exemplified by the cutcp benchmark in particu lar, results in consistently high speedups due to the change in fundamental algorithmic complexity, as should be expected. Also, while architectures are becoming slightly less sensitive to imperfect access patterns, good data layout remains ex tremely important, as exemplified by the Ibm benchmark's 5x performance improvement from layout transformation, even on the Fermi architecture. For the sgemm benchmark, register tiling results in consistently high speedups. For 9800 GX2 S1070 GTX 480 Figure 9 : Speedup of optimizations for each GPU generation such "simple" codes, the primary bottleneck is instruction stream efficiency: how many instructions compute neces sary floating-point operations relative to how many instruc tions calculate addresses or move memory around. Even when artificial bandwidth inefficiencies are addressed by the Fermi architecture, a significant speedup can be expected from good register tiling.
CONCLUSIONS
Hundreds of articles have been published on optimizing applications for GPUs, and for good reason. In this pa per, we have verified that for nearly all applications, hand optimization has great performance rewards for GPU archi tectures. Additionally, those application optimizations are worth explaining and sharing, because certain patterns of optimization are applicable for a wide range of workloads. Each optimization pattern discussed in this paper was some what applicable for at least two benchmarks, and critically important for one or two as well.
We can also verify that some of the "worst" days of GPU computing are now behind us. Although legacy GPUs will still linger in the marketplace for several years, NVIDIA and other vendors seem to be getting on track with the design philosophy that unoptimized code exists, matters, and must be addressed. While major optimizations like binning or good choice of data layout should continue to be forefront in the minds of developers, others we are beginning to think of as good things to do if there is time instead of essential for getting any kind of reasonable performance.
Finally, based on experiments on past architectures, we can say with some confidence that if there were some way of making these optimization patterns unnecessary, it would have been done by now. Many of the optimization patterns could be applied to any parallel system, and are still relevant for today's multicore CPUs after decades of research and experience with high-performance parallel systems.
While innovation may still surprise us, it seems like man ual program optimization, and in particular the optimiza tion patterns we have presented in this paper, will continue to be relevant forparallel architectures in general, and GPUs specifically, for years to come. Software developers for high performance applications would do well to brush up on these optimization patterns, and to continue to publish optimiza tion insights either applying these general patterns to spe cific contexts, or possibly describing new optimization pat terns.
