Although there has been progress in applying GPU-technology to Computed-Tomography reconstruction algorithms, much of the work has concentrated on optimizing reconstruction performance for smaller, medical-scale datasets. Industrial CT datasets can vary widely in size and number of projections. With the new advancements in high resolution cameras, it is entirely possible that the Industrial CT community may soon need to pursue a 100-megapixel detector for CT applications. To reconstruct such a massive dataset, simply adding extra GPUs would not be an option as memory and storage bottlenecks would result in prolonged periods of GPU downtime, thus negating performance gains. Additionally, current reconstruction algorithms would not be sufficient due to the various bottlenecks in the processor hardware. Past work has shown that CT reconstruction is an irregular problem for large-scale datasets on a GPU due to the massively parallel environment. This work proposes a high-performance, multi-GPU, modularized approach to reconstruction where computation, memory transfers, and disk I/O are optimized to occur in parallel while accommodating the irregular nature of the computation kernel. Our approach utilizes a dynamic MIMD-type of architecture in a hybrid environment of CUDA and OpenMP. The modularized approach showed an improvement in load-balancing and performance such that a 1 trillion voxel volume was reconstructed from 10,000 100 megapixel projections in less than a day.
INTRODUCTION
Computed Tomography (CT) uses a series of X-ray images taken at various angles around an object to generate cross-sectional slices to approximate the object's interior and exterior structure. These slices are then combined to form a reconstructed, 3-dimensional volume of the imaged object.
1 While CT imaging is typically associated with the medical field, it is prevalent in industry and commonly used for internal component inspection and other nondestructive evaluation applications. While the reconstruction process is similar in both fields, the datasets used in industry are around 4, 000 3 volumetric pixels (voxels) in size compared to the 128 3 − 1, 000 3 voxel volumes in the medical field because medical imaging must limit the patient's radiation exposure time. The larger datasets require more processing power and computation time and can require multiple days to complete even on a multi-core, multi-threaded computer. Further, with improvements in Gigapixel camera architecture 2 and new advancements in high resolution cameras such as those by the European Space Agency 3 and DARPA, 4 industrial dataset projections may soon be 100-megapixels (10,000 × 10,000) in size, and, with our current reconstruction techniques, would not be possible to reconstruct in a reasonable amount of time.
Many efforts have been attempted to bridge the gap between the processing times of medical and industrial datasets by utilizing the massively parallel architecture of the Graphics Processor Unit (GPU). 5 GPUs can provide an orders-of-magnitude performance gain on highly parallel problems such as CT reconstruction where the same mathematical operations are done on multiple voxels in parallel. 6 However, even with multiple GPUs, industrial reconstruction can still take hours to complete, and with larger datasets in the foreseeable future, the industrial imaging community needs greater processing power and capabilities to handle such datasets. The naïve solution of adding more GPUs would give diminishing returns on runtime improvement. This is due to hardware limitations in bandwidth and memory on both the host and GPU device, adding more GPUs will create a bandwidth bottleneck and result in prolonged periods of GPU idle time.
One approach to improve performance on industrial size datasets, described in Ref. 7 , implements an irregular algorithm that exploits the GPU's unique cache structure and executes small X-ray data prefetches from the host to the device(s). While this approach, when using 8 GPUs on a 64 Gigavoxel dataset, roughly tripled voxel throughput compared to a more direct port of the CPU-based CT reconstruction to GPU kernels, it still experienced bottlenecks while transferring reconstructed slices from GPU memory onto storage media, which caused suboptimal GPU utilization since GPUs were idling while waiting for a memory transfer and storage media write to complete. This paper will present a further improvement to the previously mentioned irregular algorithm that takes a modularized, MIMD approach to ameliorate the disk transfer bottleneck and maximize CPU and GPU resource utilization. It will then use the large-scale (1 Teravoxel), synthetic dataset from Ref. 7 to evaluate the algorithm's performance and the GPU scalability in order to predict this method's applicability to current and future industrial datasets.
APPROACH
As mentioned in 1, the approach from Ref. 7 suffered from prolonged GPU idle time when performing storage operations. The bottleneck was a consequence of stalling GPU computation while writing the reconstructed image planes to storage media. Although this freed memory for the next image plane sub-volume, it severely underutilized GPU resources. The approach outlined in this paper addresses the inefficiency by storing the image planes in host memory and writing them to storage media in parallel once a GPU has reconstructed its given sub-volume. After a GPU has transferred its image planes to the host, the next sub-volume reconstruction is immediately executed, leading to drastically improved utilization of the GPU processor.
More specifically, dedicated threads are dynamically allocated on the host machine to exclusively handle either prefetching the X-ray images from storage media, controlling the GPUs, or transferring the images planes to storage media. Once the X-ray images are in host memory, the GPUs reconstruct sub-volume blocks by iteratively requesting X-ray images from host memory, computing the image planes, and transferring the reconstructed image planes to host memory. While the GPU controlling threads reconstruct sub-volumes, the data transfer threads write the image planes stored in host memory to storage media.
To accommodate environments with limited memory, the total volume is partitioned so that for each section, an appropriate number of X-ray images are prefetched so that all the X-ray images and reconstructed image planes for that section can fit into host memory. Due to the varying number of X-ray images needed to reconstruct a single slice throughout the volume, the number of X-ray images must be dynamically assigned during reconstruction. Additionally, the GPUs treat each volume block as a total volume and iteratively reconstruct sub-volumes of that block. Once all the reconstructed image planes for a block are written to storage media, the host memory is freed and the next batch of X-ray images are read into host memory for the next volume block. By partitioning the volume to accommodate a variety of systems, the process is not limited to systems with a massive amount of free host memory. Note that for industrial applications, if the total volume was entirely stored on host memory, the memory requirements would be massive, on the order of 100 Gigabytes to several Terabytes.
IMPLEMENTATION
At a high level, this algorithm runs in two main loops, an outer loop for partitioning the entire volume into sub-volumes and an inner loop for iterating though each sub-volume and executing the GPU kernels. The size of each sub-volume is dynamically determined by maximizing the number of X-ray images that can be prefetched while maintaining that all relevant X-ray images and reconstructed image planes fit in host memory.
For a given sub-volume, once the X-ray images have been read into host memory from storage, host threads are allocated to be either GPU controlling or write threads, which allows for GPU execution and image plane transfer from host memory to storage media to occur simultaneously compared to the processes being disjoint in the method described in Ref. 7 . Figure 1 and Figure 2 give a visual representation of how the parallelization affects execution. Additionally, as first described in Ref. 7 , GPU textures are fully exploited by using small sets of X-ray images to be prefetched from host memory to device memory. The inner loop finishes once the sub-volume has been successfully written to storage by the CPU threads, and the outer loop finishes once the entire volume has been reconstructed (see Algorithm 1 for a high-level description).
The key irregular features of this approach are its utilization of GPU textures, dynamic partitioning of X-ray images to fit into available host memory, and CPU thread allocation to parallelize computation and image plane transfer to storage media.
GPU Textures
This feature follows the non-modular implementation from Ref. 7 . The fast texture cache hit rate of the GPUs is exploited by dynamically fitting the number of X-ray images to transfer to the GPU to a user inputed slice-totexture ratio. Setting the ratio low (between 0 and 1) implies most of the GPU memory is reserved for storing finished image planes and the small amount left for X-ray images fits into the texture cache. Although the GPU needs to transfer more X-ray images to and from its texture cache to reconstruct all the image planes, its runtime improves because of a high texture cache hit rate along with an improved L1, L2, and constant cache hit rate. Overall, performance improves by going against standard recommendations to maximize data transfer from the host to the GPU. 
Data Partitioning
In order to accommodate not only workstations with different amounts of system memory but also the varying number of voxels to reconstruct and differing acquisition geometries, the number of image planes to reconstruct in the inner loop must be dynamically determined. The challenge in this determination is in the irregular number of X-ray images needed to reconstruct a given image plane throughout the volume. In order to maximize performance, a constant number of image planes per iteration cannot be used.
At the beginning of each outer loop iteration, the number of image planes to reconstruct is determined by calculating the host memory required to reconstruct all remaining image planes (the memory must hold the image planes and necessary X-ray images). If the host memory required is greater than 90 percent of the free memory on the system, the number of image planes to reconstruct is decremented by one, and the required memory is recalculated. This process is repeated until the required host memory is less than or equal to 90 percent of the free host memory. The 90 percent value is used versus 100 percent to account for any inaccuracies in the calculations and to allow space for OS-specific requirements. Although this linear search for the maximum number of image planes is not of optimal efficiency, the extra time to run the search is negligible when compared to the overall performance of the algorithm.
Once the maximum number of image planes to reconstruct is calculated, CPU threads are initialized to prefetch the required X-ray images into host memory. Once all X-ray images are stored in CPU memory, the inner loop begins the reconstruction on that section of image planes. This process repeats until the entire volume is reconstructed.
CPU Thread Allocation
As detailed above, once the X-ray images necessary for reconstructing a sub-volume are in system memory, the inner loop begins reconstruction. However, before the inner loop starts, CPU threads are assigned to be either GPU controlling threads or image plane transfer threads. Since the number of GPUs available on the system is known a priori, one CPU thread is assigned to each GPU.
* Then, the remaining number of CPU threads on the * If there are not as many CPU threads as GPUs on the system, one CPU thread is still assigned to each GPU and the operating system's thread scheduler is expected to accommodate the excess CPU threads system (total number of CPU threads minus number of GPUs) are assigned to write the reconstructed image planes from host memory to storage. † The CPU threads assigned to each GPU communicate to the CPU write threads that an image plane is ready to be written to disk through an integer array the size of the number of image planes being reconstructed in a given inner loop.
‡ Once a GPU is finished reconstructing, the CPU thread allocated to it sets the associated cells in the communication array to 1. Each CPU write thread continually loops through a disjoint subset of the array's cells to avoid memory collisions and the same image plane being written to storage twice, checking if any cell is 1. If a cell is 1, then that CPU thread sets that cell back to 0 and writes the associated image plane to storage media. Once all the image planes from a sub-volume are transfered to storage, the inner loop ends, host memory is freed, and the outer loop repeats. This scheme of dynamic task partitioning allows for modularity between the GPU reconstruction and image plane writing. Further, the number of CPU threads used is determined by the limitations of the system, meaning a wider variety of workstations can run this algorithm.
Algorithm 1
numW riteT hreads = totalN umT hreads − numGP U s worldSliceDone = 0 while worldSliceDone ≤ worldT otalSlices do determine totalSlices so X-ray images and output fit in memory slicesDone = 0 while sliceDone < totalSlices do read in X-ray images once per outer loop spawn threads with ids from 0 to totalN umT hreads − 1 if thread's id < numGP U s then thread controlling a GPU assign thread to GPU run GPU compute kernel transfer slices to CPU update write array else numW rittenSlices = 0 while numW rittenSlices < totalSlices do for cell in subset of write array do if cell is 1 then set cell to 0 write that image plane from CPU memory to storage end if end for end while end if end while synchronize CPU threads worldSliceDone += slicesDone end while 4. ANALYSIS
Evaluation
The experiments were performed on a workstation running Windows Server 2008 consisting of 2 Intel(R) Xeon(R) E5-2687W octo-core processors clocked at 3.1 GHz with hyper-threading for a total of 16 CPU cores, 32 threads, † If there are more GPUs than CPU threads, one CPU thread is assigned to be the transfer thread ‡ Determined as described in 3.2 512 GB RAM, and 2 Nvidia S2090 devices connected via 4 PCI-E 2.0 x 16 host interface cards. Each S2090 contains 4 Tesla M2090 GPUs where each GPU has 6 GB of GDDR5 memory and 16 multiprocessors where each multiprocessor has 32 CUDA cores. The workstation has an eight drive Raid 0 array managed by an Intel RAID Controller (RS25AB080) with 1 GB of DDR3 Cache. The drives in the array are 3TB Hitachi 7.2K RPM SATA 6Gb/s Drives.
The CPU-based time collected measured total reconstruction time, which included all memory transfers, GPU kernel launches, and disk I/O operations. The time was measured for the non-modular GPU version (version A), the dynamic memory allocation and parallelized disk I/O GPU version (version B), a OpenMP parallel CPU-based version (version C), and a single thread CPU-based version (version D). All were run using a 10,000 x 10,000 x 10,000 voxel (1 Teravoxel) volume reconstructed from 10,000 100-megapixel x-ray projection images. Versions A and B were run using 1, 2, 4, 6, and 8 GPUs. The measurements for the 1 and 2 GPU runs were taken on the first 200 and the middle 200 image planes and then averaged. The other GPU runs' measurements were taken on all 10,000 image planes. For version C and D, due to the extremely long run time per single image plane (about 0.4 hours for C and.0 4.5 hours for D), the measurements were only taken on image planes 300 to 303, averaged, and then extrapolated for the full 10,000 image planes.
Lastly, the GPU kernels were compiled using CUDA version 5.0, and the CPU code was written in C++ using Visual Studio 2008's C++ compiler. Figure 3 illustrates voxel processing throughput of the modular GPU version (version B) compared to the nonmodular version (version A) of the Teravoxel reconstruction using 1, 2, 4, 6, and 8 GPUs. For a single GPU, version B's voxel throughput is not noticeably improved compared to A's (0.27 versus 0.24) because the host system is able to keep up with the memory transfers and storage I/O required, meaning the GPU does not experience significant idle time in either version. As the number of GPUs increases, the GPUs in version A are forced to wait longer to begin processing the next set of X-ray images because of the increased demand on limited storage I/O. Due to version B's storage I/O modularization and parallel GPU kernel launches, it better handles a higher number of GPUs and has improved scalability with respect to the number of GPUs. 
Results

CONCLUSION
For CT reconstructions in the industrial field, utilizing the massively parallel architecture of the GPU can dramatically improve performance. This orders-of-magnitude speedup is important for industrial applications since the processing times for their larger datasets (on the order of Gigavoxels) using a CPU-based algorithm can take days to complete. The algorithm described in Ref.
7 is one such GPU-based CT reconstruction which goes beyond the direct port of a CPU-based algorithm to GPU kernels and improves performance even more by taking advantage of the GPU-specific capabilities and hardware. (Gigavoxels in size), it will noticeably hurt performance on future-sized datasets (Teravoxels in size). These datasets are soon to be realized due to the advancements in high performances cameras, and the industrial community needs a CT reconstruction technique that can handle such massive amounts of data.
The implementation described in this paper modifies the one from Ref. 7 to dynamically modularize the memory transfers and parallelize the GPU computations and storage media writing. These improvements cause it to outperform the non-modular version on a future-sized dataset 1 Teravoxel in size.
This work has shown that transforming a CPU-based algorithm into one using GPUs involves not only rethinking the nature of the computations but also restructuring the memory and storage media usage. Future improvements to this implementation include expanding it to a cluster implementation where each node has access to at least one GPU. The cluster version will also need to dynamically determine how many image planes each node should reconstruct in a way that takes into account not only the memory limits of the node but also the number and type of GPU. This and future work will help prepare the industrial community for the technological advancements to come in CT reconstruction.
ACKNOWLEDGMENTS
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DEAC04-94AL85000.
