Abstract-Graphic process units (GPUs) are well suited to computing-intensive tasks and are among the fastest solutions to perform Computed Tomography (CT) reconstruction. As previous research shows, the bottleneck of GPU-implementation is not the computational power, but the memory bandwidth. We propose a cache-aware memory-scheduling scheme for the backprojection, which can ensure a better load-balancing between GPU processors and the GPU memory. The proposed reshuffling method can be directly applied on existing GPU-accelerated CT reconstruction pipelines. The experimental results show that our optimization can achieve speedup ranging from 1.18-1.48. Our cache-optimization method is particular effective for lowresolution volumes with high resolution projections.
II. GPU ARCHITECTURE
All Current GPUs are based on single instruction multiple data (SIMD) design. Here we take a recently released GPU -NVIDIA GeForce GTX 480 as an example to discuss the GPU architecture. Fig. 1 . These processors within one SMP can share an L1 cache and access a fast on-chip user-controllable cacheshared memory. One shared memory access costs 2 clock cycles in the best case, compared to hundreds of clock cycles for off-chip memory (also called device/global memory) access. The amount of shared memory and L1 cache in one SMP is user-configurable (16KB + 48KB or 48KB + 16 KB). All 15 SMPs will share a single chunk of L2 cache (768KB). Bilinear filtering is not done in SMPs but in texture filtering units. The peak filtering rate for the GTX 480 is 42G texels/s. There is a separate texture cache dedicated for filtering (average 12KB per SMP) while it still uses the same L2 cache and device memory.
Device memory is an off-chip memory that stores the input data and receives the output from the processors. The GTX 480 has 1.5GB DDR5 device memory with peak bandwidth 177.4 GB/s. The maximum GPU global bandwidth can only be achieved by memory coalescing, which essentially translates into 1 memory instruction per 128 bytes. This implies 32 neighbouring threads (a warp) read/write within a 128-byte-aligned segment. With proper alignment, sequential mapping of threads to memory address will yield a coalesced memory access pattern.
NVIDIA GPUs can be programmed via a C-like API -CUDA. CUDA is a general purpose API which exposes more control over how a task is computed on the GPU hardware, as compared to graphics-based APIs (CG, GLSL). The task- CUDA kernels are executed in terms of thread blocks, whereas the total task is called grid. On the hardware level, each block is mapped to a single SMP. In the back-projection stage of the CT reconstruction, SMPs are assigned to different regions of the resulting volume sequentially. This enables a mapping where the grid-block decomposition in CUDA corresponds to the volumetric reconstructed 3D dataset.
III. CONE-BEAM BACK-PROJECTION
In the back-projection computation, the commonly used method is the voxel-driven method. For example, slabs in the volume (Fig. 2) are mapped to thread blocks in CUDA. This volume-to-slabs decomposition has been used in [2] [3] [4] . Another type of mapping is based on decomposing the volume into horizontal tiles and each tile is mapped to a CUDA thread block, as used in [5] , [6] . While the former use 2D square tiles, the latter uses 1D linear tiles. We take the slab-based approach as an example to show how to optimize the hardware mapping to improve cache hit rate.
Each 3D slab of the reconstructed-volume is assigned to a SMP. In the depth dimension, the brick extends through the reconstructed volume. The slab is further divided into vertical tiles. Here, the product of the tile's width w and height h needs to be below the maximum number of threads per block L. In the NVIDIA Fermi GPU, L = 1024. The tile's width should be a multiple of 32 to ensure a coalesced writing pattern.
When performing the back-projection inside each slab, the order of execution of this back-projection follows along the set of 2D vertical tiles arranged in depth order. After we finish back-projecting, we incrementally store the resulting slices back in global memory.
The mapping of the kernel to the projection geometry is depicted in Fig. 2 . The concurrent execution of threads inside an SMP is limited by the maximum number of threads that can be run there. The NVIDIA 480 has 15 SMPs. A set of projections i a to i b is processed at the same time. Then the total projected area processed by the GPU, arising from projection i a to projection i b is:
In Eq. 1, max{thread} is the maximum resident threads for each SMP, is the projection angle for i th projection and l is the ratio of source-object distance over source-detectordistance. Assume is the constant representing the amount of L2 caches, the projected area should be smaller than the cache limit (Eq. 2) to ensure better cache hit-rate. Back-projection implementations [1] [2] [3] [4] [5] [6] all use texture memory for projection data storage to deal with irregular memory fetching pattern. Using texture memory has several advantages. First, the trilinear/bilinear interpolation is supported at hardware level, which can reduce the computation cost of GPU processors. Second, hardwarescheduled cache can benefit data reading with locality. Last, except locality, texture memory has no other constraint such as coalescing or confliction. 
IV. BACK-PROJECTION ORDERING
According to different orderings of back-projection loops, the published methods can be classified into three categories.
A. Single Projection Method FOR each projection FOR each (slab) slice along the depth dim Accumulate projected values in the current slice Write results to current slice END END
The idea behind this scheme is to obtain a maximum reuse of the SMPs' L2 cache holding the projection. The locality of the texture reading is increased since the amount of overspreading of the reading location is restrained. Then for each slice there is a cache-miss at the beginning but not likely thereafter. The disadvantage of this method is memory-writing overhead. N projections will require reading and writing the computed results N times. Also, when we perform incremental writing on the global memory, the L2 cache will also cache the output slices. This undesirable cache behavior will reduce the effective capacity of L2 cache. Keck et. al. [3] used this method in CUDA-based SART. This method is better when cache-miss penalty is larger than memory writing overhead.
B. Multiple Projection Method FOR each (slab) slice FOR each projection Accumulate projected values in the current slice END Write results to current slice END
Xu and Muller [1] used this method but their work was based on the classic graphics pipeline. This method does not have writing output overhead (N projection only need to write the computed result once). The cache performance will be the bottleneck of this approach. The downside of this method is that cache for the input projection will be depleted very fast.
Noël et. al. [6] also used this method in their tile-based decomposition CUDA kernel. This method is better when cache-miss pentaly is less than memory writing overhead. Okitsu et. al. [5] used this method in their tile-based decomposition kernel. This method requires the trade-off between better cache and less output. The hybrid ordering improves the cache hit rate. The downside of this method is that the innermost loop for J projections is usually unrolled to avoid conditional branches. But it will require more registers and will result in register spilling which will reduce concurrency in the GPU.
C. Hybrid Ordering
This method helps explore the balance between singleprojection and multiple-projections. The innermost loop number J needs to be small enough to not exceed the L2 cache size (Eq. 2). On the other hand, J needs to be large enough such that the number of slice updates in global memory is minimized.
V. CACHE-AWARE METHOD

A. 3D Reshuffling
As for projection angles, there can be two cases: 1. Back-Projection within [45,135), [225, 315) degrees. A 2D slice with area is projected into a region with area within 0, √2/2 √2/2 , the cache problem is not very severe. Problem only arise when projections are concentrated on 45+90k degrees, where k N.
2. Back-Projection around [135,225), [-45, 45 ) degree. A 2D slice with area is projected into a region with area within √2/2 √2/2 , . Then the SMP will be depleted of cache frequently.
We propose a method using a 3D transpose to improve the cache-hit rate. Note that rotating the volume by 90 degree will change case 2 into case 1, but writing the results directly into global memory will cause the coalescing problem. The reason can be explained for the 2D case since we essentially switch the row-major storage into column-major, which is not sequentially stored in memory any more but scattered into different memory segments. We use shared memory as a buffer to perform the matrix transpose inside each SMP before writing to the device memory, to preserve the coalescing pattern of the output. Also, we avoid shared memory bank conflicts by using a 33×4 byte pitch. This will facilitate 32 parallel memory-conflict free threads. Another copy of the volume need to be allocated in the device since the 3D transpose is not in-place.
The reshuffling from XYZ to ZYX is listed below: 1. Load a 2D XZ slices into shared memory 2. 2D transpose to ZX slices 3. Write to ZYX in device memory
VI. RESULTS AND DISCUSSION
The implementation was based on NVIDIA's CUDA 3.2 and on the GTX 480. All CUDA Kernels had the same configuration: 32×8 threads per block. The inputs were 364 1024×768 projections on a half circal trajectory. We reconstructed a 3D volume within the FOV with two different resolution, 256 3 and 512 3 . We tested the different running times for the Single projection method (S), Multiple projection method (M) and Hybrid ordering method (H). We then added the comparison to our optimized 3D transpose methods (T). As shown in Table I , our transpose method had better performance regardless of the back-projection loop-ordering. The 3D Transpose method used 16x16 CUDA blocks. Although it wasted some bandwidth, as opposed to the 32x32 CUDA block, since the memory accesses were not fully aligned, the 16x16 configuration was experimentally faster than 32x32 and it achieves ocupancy 1. Our 3D transpose method took 2ms for a 256 volume and 20ms for a 512 3 Volume. The final back-projection performance is shown in Table II.  For a 256 3 volume, our remapping method with sharedmemory-enabled transpose achieved a total speedup of 1.48. For a 512 3 volume, our remapping method achieved a total speedup of 1.18.
Besides NVIDIA, other major GPU manufactures are ATI and Intel. In this paper, we focused on NVIDIA's GPUs to evaluate different acceleration techniques. Although the specification of GPUs may change, the concept and algorithm introduced in the paper can generally port to a different GPUs by using a cross-vendor programming API -OPENCL.
VII. CONCLUSIONS
We present a volume reshuffling scheme optimizing the locality of the memory reading pattern in GPU-based backprojection. Our scheme can yield better cache hit rate and can be added to variety of existing methods with different backprojection ordering [2] [3] [4] , [6] , except for those implementations that use a square-tile in block-level mapping [5] . The results show our method is particularly useful for
