Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.
Introduction
For the past two decades, a majority of high-performance parallel computing research, including much visualization research, has focused on the large, commercial off-the-shelf (COTS)-based distributed memory system as the primary platform. These systems have become more powerful due to a combination of faster processors, faster memory, and faster interconnect. Up until recently, growth in processor speed and capacity more or less followed Moore's law: transistors shrank, allowing more of them to be packed into a given unit area of silicon, and clock rates increased. As heat and power constraints may limit future gains, multi-core processors have emerged as industry's solution.
In computational science research communities, recent research has focused on the problem of adapting algorithms to make effective use of this 'new' platform, the multi-core processor. Achieving optimal algorithm performance typically involves a careful examination of how an algorithm moves data through the memory hierarchy. Unfortunately, the cache hierarchies and memory management systems evolve at a very fast pace. Increasingly, there is often a non-obvious relationship between algorithmic parameters such as blocking strategies, their impact on memory utilization, and in turn the relationship with runtime issues.
For example, on GPUs, performance can be highly sensitive to the degree to which threads in a thread block all execute the same code: performance will degrade when there is conditional branching and threads execute different code paths. Recent work has shown there is not necessarily a clear correlation between the amount of memory traffic and absolute runtime (Datta et al. 2009b ), a result confirmed by our study here. At best, multi-processor algorithmic performance models, which are derived using a combination of architectural knowledge and empirical runtime observations, can predict performance bounds, but do not always offer advice on how algorithms should be tuned to produce optimal performance, particularly in light of architecture-specific characteristics, such as thread divergence on GPUs.
As a result, the computational science community has evolved a technique known as 'auto-tuning' which is methodology for finding the combinations of tunable algorithmic parameters that result in the best performance of an algorithm on a particular platform for a particular problem size. This approach has been used with success to optimize the performance of stencil-based codes on multi-core CPUs and many-core GPUs (Datta et al. 2008 (Datta et al. , 2009b Hollingsworth and Tiwari 2010; Kamil et al. 2010; Williams et al. 2010) . Auto-tuning is based upon the premise that one can enumerate all possible tunable configurations of an algorithm. As the number of potential permutations of configuration can be quite large, search strategies have emerged as a research area to avoid having to perform an exhaustive search of all possible configurations.
The contribution of this work is an application of autotuning principles (the systematic exploration of tunable algorithmic parameters, known algorithmic optimizations, and memory layout strategies) to study the relative performance gain of a staple visualization algorithm, raycasting volume rendering, on two multi-core CPUs and one many-core GPU by measuring absolute runtime as well as memory cache utilization (L2 cache misses via hardware performance counters). The comparison of hardware performance counter and runtime data helps to provide deeper insight into the relationship between memory traffic, cache utilization, and absolute runtime for volume rendering using known algorithmic optimizations. Our test results reveal a wide variation in runtime performance across all platforms and datasets, as much as 254% on the CPU and 265% on the GPU. The settings that produce the best performance vary from problem to problem and platform to platform, often in a non-obvious way. For example, our results indicate the best-performing configurations on the GPU occur at a crossover between memory utilization and thread divergence, a result that is likely to be extremely difficult, if not impossible, to predict with an empirical performance model.
The auto-tuning methodology, which is familiar to the computational science community, is relatively unexplored in the visualization community. The settings and algorithmic optimizations that result in the 'best performance' will change as platforms evolve: our results suggest very different settings than earlier work due to the evolution in memory and processor technology. The focus of this work is a thorough study of the performance and memory utilization characteristics of known algorithmic optimizations on modern platforms. The findings reveal insights about their performance on modern platforms, including the fact that an optimization (Z-order memory layout) initially designed to be 'cache friendly' on decade-old single-core technology is surprisingly beneficial on modern GPUs. The autotuning methodology will likely prove useful in suggesting new algorithmic optimizations and evolution on future platforms where core density increases along with the cost of moving data through the memory hierarchy.
Previous work
Levoy's formulation for raycasting volume rendering is the process of casting rays through a 3D volume and integrating color and opacity along the ray's length to produce a final pixel color (Levoy 1988) . A common approach for parallelizing this staple visualization algorithm produces what is known as 'hybrid volume rendering' (Ma et al. 1993; Tiwari and Huntsberger 1994; Ma 1995; Bajaj et al. 1999) , which is a two-stage algorithm where data is partitioned across processors to be rendered with raycasting (Drebin et al. 1988; Levoy 1988; Sabella 1988; Upson and Keeler 1988) . Each processor generates a partial image, and these are combined in a second algorithmic stage using compositing (Drebin et al. 1988; Levoy 1988; Upson and Keeler 1988) . Nieh and Levoy (1992) describe a parallel ray tracing volume rendering algorithm that runs on a sharedmemory platform and uses a task-queue image partitioning technique where screen pixels are partitioned amongst processors. Palmer et al. (1998) studied two parallel partitioning and dynamic load balancing algorithms to explore the tradeoffs between their memory hierarchy performance. They suggest that image-order decomposition strategies suffer from a lack of locality in accessing the 3D volume data: during the ray integration loop, the most resource intensive part of the raycasting volume rendering algorithm, a given ray may need to access any voxel within the source volume. This lack of locality can result in 'cache thrashing', which is a relatively low level of cache reuse. They report that object-parallel partitionings scale well, and this form of partitioning has been adopted as the basis for parallel work decomposition in many subsequent works (e.g. Hsu 1993; DeMarle et al. 2003; Cavin et al. 2005; Yu et al. 2008) . In contrast, our work here is a more comprehensive, systematic exploration of the relationship between algorithmic optimization and tunable algorithmic parameters (image tile size, work assignment strategy, and alternative memory layouts for the source data, and algorithmic optimizations) and their impact on algorithm performance in terms of runtime and cache utilization measured via hardware performance counters.
Looking more closely at the impact of memory accesses and algorithmic performance, Grim et al. (2004) explored the performance impact resulting when varying the size of data blocks allocated to processors in object-order parallel volume rendering on a shared-memory, multi-core CPU platform. Their objective is to minimize the number of repeated voxel loads, so their approach is to use an 'advancing ray-front' approach (Law and Yagel 1996) combined with small data blocks that will fit within cache so as to exploit spatial locality. In contrast, our approach is to exploit temporal, rather than spatial coherence, in an image-parallel approach that is portable to multi-core CPUs and many-core GPUs, as well as comparing performance of alternative memory layouts.
With the aim of exploring a memory layout that is more 'cache friendly', Pascucci and Frank (2001) used an indexing scheme based upon the Lebesgue, or Z-order, space-filling curve to improve and optimize progressive visualization algorithms. This data layout has desirable spatial locality properties: at any point in the mesh, nearby points are nearby in memory or storage. In our study, we wish to compare the performance impact of this type of data layout in sharedmemory parallel raycasting volume rendering with arrayorder memory layout and with interactions of other algorithmic optimizations and tunable algorithmic parameters.
Whereas CPU-based volume rendering techniques span both object-and image-order parallel approaches, GPUbased volume rendering algorithms tend to exploit imagelevel parallelism. Examples including implementing volume rendering as a fragment program with raycasting (Krüeger and Westermann 2003; Stegmaier et al. 2005; Gobbetti et al. 2008; . The basic idea with these implementations is to draw a proxy geometry where the fragment program, invoked during rasterization, performs the actual raycasting volume rendering. The degree of parallelism is a function of the number of fragment programs that can be run at once by a given GPU. With this approach, there is no direct opportunity for tuning algorithm performance in terms of parameters such as block size, which may provide the opportunity for performance gains through improved temporal cache locality. Marsalek et al. (2008) implement a raycasting volume rendering kernel in CUDA. As with our CUDA implementation, theirs is an image-parallel approach where each CUDA thread performs raycasting on an image pixel. Their results suggest that CUDA kernels perform at a rate commensurate with earlier fragment-based approaches. They achieve a performance boost by a process of manual experimentation with thread block size, the number of threads per thread block, to find a good balance between register and shared memory use. In contrast, we systematically explore and report on the performance impact of varying the thread block size as well as alternative memory layouts.
In recent years, there has been a great deal of activity focusing on optimizing the performance of codes on parallel computing platforms. Given the diversity of computer architectures, along with their rapid change, a technique known as 'auto-tuning' has emerged as a way to tune codes to achieve optimal performance (Datta et al. 2008; Hollingsworth and Tiwari 2010; Williams et al. 2010) . Auto-tuning is based upon the premise that one can enumerate all possible tunable configurations of an algorithm for a given problem size. As the number of potential permutations of configuration can be quite large, search strategies have emerged as a research area to avoid having to perform an exhaustive search of all possible configurations. Alternately, some recent research, such as that by de la Cruz and Araya-Polo (2011), focuses on deriving a predictive performance model for a 3D stencil computation code, which has a uniform and predictable memory access pattern. The problem we are studying here, raycasting volume rendering, is representative of a class of algorithm that performs 'unstructured' or 'non-uniform' memory access. In particular, since our implementation uses perspective projection during rendering (Foley et al. 1990 ), each of the rays through the volume will diverge from one another along their length moving away from the viewpoint. The result is that each ray's memory access pattern is different than all other ray's memory access patterns. Therefore, this application is well suited for auto-tuning. While we perform what amounts to an exhaustive search of the space of tunable algorithmic parameters, interesting future work could focus on using one of several different search strategies to avoid the expensive exhaustive search.
Our work extends previous studies to explore the effect of varying the size of the image-tile partitioning and memory layout options in a shared-memory volume renderer on modern multi-core CPU and many-core GPU platforms. Specifically, we wish to understand the relationship between various tunable algorithmic parameters, namely tile size and shape, and memory layout alternative, with performance on a variety of platforms run on a variety of different problems. Do different size or shape tiles or different memory layouts result in better use of memory hierarchy? What is the difference in performance that results from different configurations? Is the performance and memory utilization sufficiently variable to warrant use of auto-tuning for large, production runs? Does a known algorithmic optimization, such as early ray termination, offer a performance gain on a platform such as the GPU where there is a performance penalty for thread divergence?
These questions help us to better understand how tunable algorithmic parameters can affect performance on multi-and many-core systems. While the work in this study focuses on single-node rather than extreme concurrency configurations, our work here helps to reveal how to improve performance at extreme concurrencies by speeding the shared-memory parallel raycasting volume rendering by establishing algorithmic parameter settings that result in optimal, or near-optimal, performance. These single-socket configurations then comprise building blocks in large-scale multi-core hybrid-parallel and hybrid volume rendering applications executed at extreme concurrency, as in the case of recent work by Howison et al. (2010) , or in situations where multiple GPUs are employed as part of a distributedmemory parallel implementation ).
System design and implementation
Our implementation is a raycasting volume renderer that follows Levoy's formulation (Levoy 1988) and that is parallelized using a shared-memory programming model. In our parallel implementation, each thread is responsible for casting rays into the volume, performing color and opacity integration along its ray, and writing the result into a final image buffer. We are using an image-space decomposition: there is one single copy of the volume data, the work is divided up amongst threads such that each thread is assigned a separate part of the image space. We use this algorithm design pattern and parallelization strategy on both multi-core CPUs (Section 3.1) and many-core GPUs (Section 3.2), along with alternative memory layouts (Section 3.3).
3.1 Multi-core CPU implementation 3.1.1 Shared memory volume rendering. For the implementation in this study, our implementation uses POSIX threads (Butenhof 1997) as the shared memory programming model and execution environment. Upon startup, the application launches a user-specified number of rendering threads. Each of these threads is a slave that executes an event loop. A master thread tells all slave threads to execute a particular task, e.g., render the volume. A pair of shared memory barriers serve to synchronize the communication between the master and slave threads in the event loop. Prior to releasing the slave threads for rendering, the master thread will compute a number of values that are common across all slave rendering threads, e.g., inverse of modelview matrix, etc., and store these values into a sharedmemory data structure that is visible to all threads.
3.1.2 Shared memory parallel work/block decomposition. The approach we use for parallelism in this study is to have each thread independently cast rays into the volume to produce final image pixels; the master thread computes a list of work assignments for each thread, then each thread independently processes the list of work assignments. Each work assignment, or work block, consists of a spatially disjoint region of the final image, and a single thread will perform raycasting over all pixels in its assigned work block, then move on to the next work block. We explore two methods for assigning work blocks to threads: (1) a static assignment, where the master assigns blocks to threads in a round-robin fashion; and (2) a dynamic assignment where threads request work blocks from a single work queue. The user specifies an arbitrary tessellation of the output image that forms the basis for the work assignments: blocks of size 4 Â 4, 8 Â 8, 512 Â 1, 1 Â 512, etc., pixels.
We explored a configuration, called 'block synchronization', in which all threads encounter a shared barrier after completing work on a given block. The idea is to force all threads to process blocks in lock-step, with the intent of inducing better memory utilization by having the multiple worker threads accessing nearby regions of the source data. For the tests run in this study, we use the dynamic work assignment algorithm on multi-core CPU implementation, as in Nieh and Levoy (1992) . We determined through empirical testing that it performs better than the static work assignment approaches. Those results are not shown here due to space limits, and are consistent with earlier findings (Nieh and Levoy 1992; Palmer et al. 1998 ) that suggest static work assignment algorithms can suffer from load imbalance.
Concurrency.
To explore the effects of varying thread concurrency and of non-uniform memory access, we ran tests across lower concurrency levels than the number of available CPU cores. However, to maintain the same image and data volume size in these tests, we used a hybrid parallelism approach that combines the image-tile decomposition described above with a domain decomposition to distribute the data volume across groups of threads. In this way, we still ran as many threads as the number of cores by using multiple groups at the given concurrency level, with each group allocating its own memory buffer disjoint from the other groups (see Table 1 ). This functionality was already implemented for large-scale tests we conducted on distributed-memory systems using a hybrid parallelism approach (Howison et al. 2011) . For the CPU tests, we invoked the -Mconcur¼numa flag in the PGI compiler to link in a thread affinity library that prevented threads from migrating across cores.
Many-core GPU implementation
Our GPU raycasting kernel implementation is essentially the same as the CPU version. However, the way work blocks are assigned to GPU threads differs from the CPU implementation owing to the data-parallel nature of the CUDA language: the image is considered as a 2D CUDA grid that is divided into CUDA thread blocks. Each thread block corresponds to an image tile, and each individual CUDA thread in each thread block performs raycasting of a single pixel in the image. Therefore, an M Â N thread block corresponds to an M Â N pixel image tile. The CUDA runtime manages the dynamic scheduling of thread blocks on the GPU, assigning up to eight blocks at a time to each of the GPU's multiprocessors (NVIDIA Corporation 2010). The GPU uses hardware context switches between the blocks within a multiprocessor to hide memory latency. We used version 3.2 of the CUDA compiler and runtime and version 270.41.19 of the NVIDIA driver. ECC support was enabled. During compilation, we targeted CUDA 'compute capability' 2.0 (NVIDIA Corporation 2010) to take advantage of additional optimizations available on the Fermi-series architecture. We also configured shared memory to be used as L1 cache for global memory requests.
Memory layout
We explore two alternative ways of laying out source data in memory. In the array-order layout, a 3D source volume S is indexed such that a data value at S½i; j; k is at memory location i þ j Ã xsize þ k Ã xsize Ã ysize. The Z-order method is essentially a way of laying out a 2 n tree in memory in 1D fashion. Unlike array-order indexing, Z-order indexing has the desirable property that at any [i, j, k] location, accessing a point that is nearby in index space is also nearby in physical memory location. We use the formulation described by Pascucci and Frank (2001) where we compute the Z-order index through a series of bitwise operations.
To optimize algorithm runtime for both memory layouts, we implement a data structure that supports rapid memory index computation for both layout schemes. During volume rendering, we compute the index of a S½i; j; k location for array-order indexing using two table lookups and three additions, and for Z-order indexing, using three table lookups and two logical OR operations. In our GPU implementation, these lookup tables are stored in the GPU's constant memory for fast access. In both implementations, we found that a single lookup operation was significantly faster than the 12n bitwise operations (2n bit shifts, n bitwise ANDs, and n bitwise ORs for each dimension) needed to calculate the Z-order index on the fly, or the two additions and three multiplications to calculate the arrayorder index.
Experiment results

Methodology
Given that we are using an image-order algorithm, and that we are dividing up the work by having each thread work on a subset of the total image, we adopt the term 'block size' to refer to the size of the image tile assigned to a thread in the CPU version or a CUDA thread block in the GPU version.
Our primary research goal is to determine how much performance varies as a function of block size and layout of data in memory by measuring runtime and cache utilization characteristics.
The runtime we measure is the elapsed time required to perform raycasting and store a resulting pixel buffer. It does not take into account I/O time for either storing an image or loading the data, which is known to account for a sizeable fraction of total rendering time in large-data visualization applications. We are not trying to solve the I/O problem, and our focus on rendering only is appropriate for many use cases, such as creating multiple images from the same dataset as is typical of interactive visualization applications. For this same reason, we do not include the time required by our CUDA implementation for memory transfers between host and device.
Our second performance metric is L2 cache misses, which offers a more detailed picture of memory traffic. To obtain this data, we use the Performance Application Programming Interface (PAPI) 1 for the CPU platforms. PAPI requires a kernel module as well as application instrumentation via a simple API to specify which hardware counters to monitor. 2 On the GPU platform, we used the profiler available with the CUDA Toolkit version 3.2. 1 Conventional thinking is that there is a strong correlation between lower memory traffic levels and increased performance, however this is not always the case (Datta et al. 2009a ).
Platforms and datasets.
For this study, we make use of several different multi-and many-core systems. We ran our tests on a single node of each of the following systems, each having core/cache configurations shown in Table 2: carver.nersc.gov has 400 nodes, each consisting of dual quad-core 2.67 GHz Intel 5550 Nehalem processors and between 24 GB and 48 GB of DDR3 RAM. hopper.nersc.gov has 6384 nodes, each consisting of dual 12-core 2.1 GHz AMD 6172 MagnyCours processors and 32 GB of DDR3 RAM. oscar.ccv.brown.edu has two GPU test nodes, each consisting of dual 2.5 GHz Intel 5630 Nehalem processors and 24 GB of RAM. The GPU accelerator is a NVIDIA Fermi M2050, a many-core GPU with 448 'CUDA cores' grouped as 14 'multiprocessors' CUDA3-ProgGuide sharing 3 GB of GDDR5 memory. The M2050 is installed as a PCI-E x16v2 add-on card in these IBM iDataPlex nodes.
Our dataset is a 512 3 volume that we flattened from an adaptively refined mesh produced by a combustion simulation. Figure 1 shows the rendering of the dataset across the 10 representative views we used for our tests. We chose these views to span a diverse range of memory stride and access patterns during the ray integration stage. Our shading calculations require a gradient field with three elements per every element in the scalar data set. This gradient field can be precomputed and cached in memory, but it increases the memory footprint by a factor of three. Alternatively, the gradient field can be computed as needed inside the shading calculation. This choice poses a classic trade-off of storage space versus compute time. We experimented with both methods, using central differences in both cases to compute the gradient vector, and found that on-thefly computation was faster for both the CPU and GPU implementations by an average of 8.2% on the Intel/Nehalem, 19.4% on the AMD/MagnyCours, and 80.7% on the NVI-DIA/Fermi. The additional cache pressure of loading in precomputed gradients appears to be more detrimental than the six-point stencil operation needed to recompute the gradient at each ray integration step. The stencil operation has significant spatial and temporal locality, however, since the scalar values that are needed are either spatially nearby (especially in the case of Z-ordered memory), or are loaded at a similar time by the previous or consequent integration step. Conversely, this locality is not exploited when loading precomputed gradients from a separate memory buffer. 4.1.2 Parameter sweep/auto-tuning. The term 'auto-tuning' has come to mean the process of evaluating the performance of all potential implementations of a kernel, potentially using some advanced search heuristics to avoid performing an exhaustive search of all possible permutations. In our case here, it means sweeping through tunable algorithmic parameters to find the combination resulting in the best performance. Unlike a 'production' auto-tuning system, which would include a code generation phase followed by a lengthy run, we are performing only the parameter sweep to obtain performance measurements for each such configuration.
In our CPU tests, we explore 64 different image tile sizes where width and height both vary over f1; 2; 4; 8; 16; 32; 64; 128g. A complete test battery consists of approximately 5120 tests on the AMD/MagnyCours: 10 views, 64 block sizes, with and without early ray termination (ERT), two memory layouts, and two concurrency levels, and only the dynamic work assignment. The test battery on the Intel/ Nehalem was similar, with three work assignment methods but only 8-way concurrency, for a total of 7680 tests. In addition, we reran another 7680 tests on the Intel/Nehalem at 1way, 2-way, and 4-way concurrency, but with only the dynamic work assignment, in order to measure the runtime variation across concurrency levels reported in Table 3 .
On the GPU, we ran 880 tests over 22 different thread block sizes, with width and height varying over f1; 2; 4; 8; 16g (excluding the 1 Â 1, 1 Â 2, and 2 Â 1 blocks), across two memory layouts, and with and without ERT. We excluded the blocks with fewer than four threads because the NVIDIA/Fermi requires at least four threads per block to saturate the computational throughput of a warp of execution (as described in more detail later in Section 4.5).
Variation in runtime
To begin, we first wished to determine whether block size impacts performance, and if so, by how much, as measured overall by percentage variation. The summary results, Figure 1 . Output from our shared-memory parallel raycasting volume renderer, showing the 10 views we averaged our results over. Each frame is a 512 Â512 image rendered from a 512 3 dataset of a combustion simulation output. Simulation data courtesy of J. Bell and M. Day, Lawrence Berkeley National Laboratory. Table 3 . Percentage variation in runtime (averaged over 10 views) across all block sizes, broken down by levels of concurrency, memory layout, and ERT. Block size has a dramatic impact on performance, particularly as concurrency increases to all available CPU cores. On the GPU, variation was greatest for Z-ordered memory, because the performance of the fastest block configurations increased considerably with Z-ordered access. shown in Table 3 , indicate that block size can have a dramatic impact on performance. From Table 3 , we draw several conclusions. First, the amount of variation increases with increasing concurrency on the CPUs. This result is expected since there is increased opportunity for reuse of data that is shared among more threads. Block sizes that better utilize the cache hierarchy can perform much better, increasing the variation in performance when compared with less optimal block sizes. Second, block size has a nearly equal effect on variation for both array-and Z-ordered layouts in CPU memory. Although Z-ordering improves the spatial locality of data within the same 3D neighborhood, it is still sensitive to changes to spatial locality caused by different block sizes. Third, variation is greater for Z-ordering in GPU memory by more than a factor of three when compared with array ordering (see Section 4.7 for more details). Figure 2 amplifies the results shown in Table 3 for the Intel/Nehalem platform. The parameter sweep through block sizes at a single concurrency produces a 2D array of performance data. However, we are interested in the patterns that may emerge across concurrency levels, so by 'stacking' the 2D arrays at each concurrency level, we have a 3D dataset to which we can apply traditional visualization techniques for inspection. The corners enclosed by the red contour correspond to block sizes that performed worse relative to others at that concurrency level. The innermost blue contour shows the region of best performance, which is composed of medium-sized blocks and is consistent across concurrency levels. We see relatively lower variation at lower concurrency levels, and the largest relative outliers (the red contour) at 8-way concurrency. Again, both memory layouts exhibit similar performance characteristics in terms of variation across block size.
It is likely that larger block sizes suffer from poor load balancing since the granularity of the image tiles is so much larger. In this application, the amount of work each thread performs while processing an image tile is a function of scene characteristics, and there is not necessarily the same amount of actual work per block. In addition, the raycasting algorithm employs an 'empty-space skipping' (Müller et al. 2006 ) optimization that skips pixels that are predetermined not to lie within the bounding box of the oriented 3D volume after it has been projected onto the 2D view plane. Therefore, image tiles that have less coverage of the 3D volume can end up calculating substantially fewer ray integration steps.
Intel/Nehalem
On the Intel/Nehalem, medium-sized blocks show the best performance in terms of both faster runtime and lower levels of L2 cache misses (Figure 3 ) regardless of work assignment algorithm or memory layout. With smaller block sizes, we also see consistently higher runtimes. This result is likely due to a relatively lower level of temporal cache coherence, as evidenced by the relatively higher rate of L2 cache misses associated with smaller block sizes.
With the static work assignment algorithm, we conducted an experiment aimed at inducing temporal cache coherency. This algorithmic option, which appears in Figure 3 as 'static/ sync', forces all threads synchronize at the block level. The intent is to force all threads to access nearby memory locations in a synchronous fashion. This configuration performed so poorly that we abandoned further investigation. The poor performance is most likely due to load imbalance. In contrast, the dynamic work assignment shows better overall performance due to its better load balancing characteristics. Runtimes (averaged over 10 views, then normalized across concurrency level relative to the best runtime) from the Intel/ Nehalem across four levels of concurrency presented in 3D using isocontours. Two of the axes are block sizes, while the third axis is concurrency. Larger and smaller block sizes produce relatively poorer performance, particularly at higher concurrency levels, while medium-sized blocks produce relatively better performance at all concurrency levels.
The Z-ordered memory layout results in consistently better performance due to better use of the memory hierarchy. We see consistently lower levels of L2 cache misses when using the Z-ordered layout.
AMD/MagnyCours
The AMD/MagnyCours shows similar trends to the Intel/Nehalem: medium-sized blocks performed best, and there are benefits for both ERT and Z-ordering (see Figure 4 ).
We ran tests at two different concurrencies to look for effects due to non-uniform memory access (NUMA), where a core on one socket accesses memory with affinity to a different socket. Although our MagnyCours test system has two twelve-core processors in two physical sockets, these 24 cores actually share four memory controllers, so we use the term 'NUMA socket' to refer to the set of six cores that share a memory controller.
In the six-way concurrency test, we allocated four disjoint memory buffers and used the first-touch policy of the Linux kernel to set the affinity of each buffer to its own Figure 3 . Runtime (a) and L2 cache misses (b) averaged across 10 views fro 8-way concurrency on the Intel/Nehalem platform with varying block size, memory layout, work assignment method, and early ray termination (ERT).
NUMA socket. In this test, we executed four groups of six threads, with each group scheduled on its own NUMA socket. Our theory is that this configuration should lead to more uniform memory access, and therefore better performance. Surprisingly, we found the opposite to be true. At 24-way concurrency, with only a single memory buffer shared by threads executing across all 24 cores, performance was better across all conditions: block size, memory layout, and with and without ERT, even though this meant that 18 of those cores were accessing memory on a different NUMA socket. Moreover, Figure 4 shows that the block sizes that exhibited the fewest L2 cache misses are in the six-way concurrency condition, even though those block sizes had longer runtimes than their corresponding trials in the 24-way concurrency condition. We suspect that there are other important effects at the L3 cache level that we are not capturing with our L2 cache measurements. Exploring this issue further and refining our measurement of the cache hierarchy will be the subject of future work.
NVIDIA/Fermi
On the GPU, we see a wide variation in runtime across different block sizes ( Figure 5 ). Both Z-ordering and ERT show benefits for the more optimal block sizes. The best configurations on the NVIDIA/Fermi are the 4 Â 8 and 8 Â 4 thread blocks with Z-ordering and ERT. Performance appears to vary with the total number of threads in a block, leading to the diagonal striping in Figure 5 .
Even though CUDA thread blocks require a minimum of 32 threads to saturate computational throughput, many of the block sizes with 16 or fewer threads still perform well because of the branching nature of our algorithm, which causes divergence among CUDA threads. A thread block is executed in a single-instruction-multiple-thread (SIMT) fashion in which 'warps' of 32 threads are executed across four clock cycles in subsets of eight threads that share a common instruction. If those eight threads do not share a common instruction, such as when conditionals cause branching code paths, the threads diverge and must be executed serially. This situation is prevalent in our 1  2  4  8  16  32  64  128  1  2  4  8  16  32  64  128  1  2  4  8  16  32  64  128  1  2  4  8  16  32  64 algorithm. For example, imagine a thread block owning a region of the image that only partially covers the data volume. Some of the threads immediately exit because of the empty-space skipping optimization in our algorithm, while the other threads proceed to cast rays through the volume. Even the threads that proceed together with raycasting may have rays of different lengths, which will cause divergence and load imbalance. Since a warp must be scheduled across at least four clock cycles, using fewer than four threads per thread block will guarantee under-utilization, so we excluded those configurations from our sweep. Empirically, the sweet spot for thread block size is 16 or 32 threads, depending on the memory ordering and whether ERT is enabled. Many block sizes with 16 threads perform well even though this is less than the warp size of 32 threads, indicating the complex interaction of the CUDA runtime and warp scheduler in handling branching for this particular algorithm and problem. It is also likely that larger thread blocks exhibited greater load imbalance because the variation in ray lengths tends to increase with block size.
Surprisingly, the small thread blocks that displayed the worst performance also exhibited the fewest L2 cache misses (see Figure 5 ). Yet, the converse is not true: the most optimal block sizes do not show the most L2 cache misses. Instead, L2 cache misses appear to rise uniformly with the total number of threads in a block, leading to the same diagonal striping as seen in the runtime plot. Therefore, we conclude that achieving the best performance on the GPU is a trade-off between using enough threads to saturate a warp and using few enough to maintain good cache utilization. We also find that, as in the CPU tests, L2 cache misses are systematically less when using the Z-ordered memory layout on the GPU because of the improved spatial locality.
Interestingly, the NVIDIA CUDA Programming Guide (NVIDIA Corporation 2010) says: 'The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation is therefore recommended'. Our experiments show a wide variation in performance depending upon thread block size. While there is little surprise that such variation exists, the amount of variation, as much as 265%, is somewhat unexpected, as is the fact that the optimal block size for one problem is not the same as for another problem when run on the same platform.
Algorithmic optimization: ERT
We tested an algorithmic optimization called ERT, where a conditional within the inner-most loop that iterates along an individual ray tests the condition of the ray reaching full opacity. If it has, the loop is aborted since any further steps will not contribute to the color or opacity of that pixel. This optimization is dependent both on the input data and on the transfer function that determines how data values map to color and opacity. For our specific data set and transfer function, we see approximately 10% fewer integration steps when ERT is enabled, which is directly reflected in our reported runtimes: in cases where ERT is enabled, we see a 10-15% improvement in absolute runtime.
To demonstrate the relationship between the transfer function and the benefits of ERT, we ran an additional test with a 'shallower' transfer function (called 'B' in Figure 6 ) that did not penetrate as deep into the volume. With ERT enabled, this transfer function exhibited 19.7% fewer integration steps and ran 19.1% faster, as expected. However, the image rendered with transfer function 'B' has important differences in the areas and features of the dataset that are visualized, which may or may not be appropriate according to the application.
The benefits of ERT are highly dependent upon scene characteristics and transfer function, so it is impossible to say with certainty how much ERT will help algorithm performance in general. In earlier work investigating the use of ERT in raycasting volume rendering (Krüeger and Westermann 2003) , an opaque volume exhibited a 300% improvement in runtime, while a semi-transparent volume exhibited no improvement in runtime from ERT and instead sustained In an early test case where there was no reduction in raycasting steps from ERT for a particular scene, we measured a penalty of 5%. In summary, the effects of ERT are highly variable: there are combinations of data sets and transfer function where ERT has no effect, and others where it lead to a dramatic reduction in the number of integration steps.
Algorithmic optimization: Z-ordered memory
Z-ordered memory outperforms array-ordered memory on all platforms and at all concurrency levels, and these performance gains increase with concurrency (see Figure 7) . The benefits of Z-ordering at higher concurrency are likely due to the larger penalties for non-contiguous access and cache misses in shared memory systems that service many cores, as on the AMD/MagnyCours and NVIDIA/Fermi. While the Intel/Nehalem and AMD/MagnyCours show a modest gain from Z-ordering, ranging from 3% to 29%, the NVIDIA/Fermi exhibits gains of 146% at 64-way concurrency, i.e. 64 threads per thread block. As more cores access memory through a shared memory controller and subsystem, the improved locality of Z-ordered memory access and correspondingly lower L2 miss rates becomes increasingly beneficial. Thus, the NVIDIA/Fermi, whose memory subsystem must service 14 multiprocessors each with many thread blocks in flight, sees the largest improvements from Z-ordering. While the slower block configurations improve somewhat with Z-ordering, the best configurations improve greatly, leading to larger variation for Z-ordering than array ordering.
Because many datasets are not already stored with a Z-ordered layout, there is in practice a cost associated with re-ordering the data buffer, which we have not included. However, in the use case where multiple visualizations will be generated from the same data set, this initial cost is amortized.
Conclusions and future work
The main point of this study has been to explore the relationship between tunable algorithm parameters and known algorithmic optimizations and the resulting impact on performance of a staple visualization algorithm on modern multi-core and many-core platforms. Our results suggest a wide variation in performance can result: up to 254% on multi-core CPUs and 265% on many-core GPUs. We used the findings of this study to set tunable algorithmic parameters for a set of extreme-concurrency runs (Howison et al. 2010 (Howison et al. , 2011 that required literally millions of CPU hours; by finding and using optimal settings for tunable algorithmic parameters, we in effect saved millions of (a) Transfer function 'A' (b) Transfer function 'B' Figure 6 . Two different transfer functions have different benefits from early ray termination (ERT), but also yield images that accentuate different features of the underlying dataset. For the results presented in this paper, we used transfer function 'A'. Figure 7 . The performance gains (averaged across thread block size, with ERT) from using Z-ordered memory instead of arrayordered increase with concurrency. The gains are most notable on the NVIDIA/Fermi, where we count the number of threads per block as the concurrency.
additional CPU hours that would have been spent executing an application in a non-optimal configuration. This work, which uses a well-established methodology for finding optimal performance, shows that such a methodology can be useful for visualization algorithms as well, and that the algorithmic parameters that produce the best performance vary from problem to problem and platform to platform, often in a non-obvious way. Our GPU results underscore this result: the best performance results in what appears to be a crossover point between cache utilization and thread warp size and thread divergence that occurs due to this particular algorithm. This approach helps empirically establish algorithmic parameters that may be difficult, if not impossible, to determine using a performance model that predicts performance bounds, and has been widely used in the computational science community with great success (Datta et al. 2009b, a; Kamil et al. 2010 ).
The algorithm we study, raycasting volume rendering with perspective projection, uses an unstructured, or irregular memory access pattern. Each ray will, in effect, execute a different traversal path through memory, and the set of paths and the number of computational steps each requires is a function of runtime parameters, such as viewpoint, data set, and color transfer function. Other visualization algorithms exhibit similar memory access characteristics, such as parallel streamline computation (Camp et al. 2011) , and could benefit from this performance optimization methodology.
An avenue for future work would be to expand the scope of algorithmic parameters and options and discover how they inform performance within the context of the autotuning approach. Does the cost of branch prediction in the inner ray integration loop outweigh the benefit of an acceleration option such as ERT? In object-order (bricking) approaches, what is the relationship between brick size/ shape and actual observed cache utilization? How well would a bricking strategy work on the GPU, and does its use improve memory access patterns on the GPU? Would an alternate encoding of data in bricks, such as using a space-filling curve layout, result in better or worse utilization of the memory hierarchy, and what is the relationship of that performance with brick size and shape? How does a change in final image resolution alter the choice of block size? Could predictive performance models be enhanced to include some statistical estimation of runtime given known variability in algorithm performance that depends upon data or runtime characteristics? What is the impact of multi-field data, particularly with respect to memory utilization, and do the lessons learned for scalar data apply to complex data types?
Another avenue of future work would be to explore the efficacy of alternative programming environments, such as MapReduce (Dean and Ghemawat 2008) , for use on multiand many-core systems. Recent work by Stuart and Owens (2011) pursues this line of investigation for several computational kernels. Interestingly, their implementation of a GPU-capable MapReduce library 'relaxes' some of the MapReduce semantics to expose GPU-centric features, such as thread block size, although it is not clear whether this kind of control, which is necessary for auto-tuning, exists outside of Stuart and Owens' GPMR library, for example on multi-core CPU implementations. Even though our shared-memory implementation does not have an explicit 'reduce' step, one desirable outcome would be the ability to write code once that runs on many different types of platforms. Given that the settings that produce optimal performance vary by platform, dataset, and other runtime attributes, there seems to be a clear benefit to using an autotuning methodology to find those that produce optimal performance rather than by-hand coding.
