Purpose: Interventional reconstruction of 3-D volumetric data from C-arm CT projections is a computationally demanding task. Hardware optimization is not an option but mandatory for interventional image processing and, in particular, for image reconstruction due to the high demands on performance. Several groups have published fast analytical 3-D reconstruction on highly parallel hardware such as GPUs to mitigate this issue. The authors show that the performance of modern CPU-based systems is in the same order as current GPUs for static 3-D reconstruction and outperforms them for a recent motion compensated ͑3-D+ time͒ image reconstruction algorithm. Methods: This work investigates two algorithms: Static 3-D reconstruction as well as a recent motion compensated algorithm. The evaluation was performed using a standardized reconstruction benchmark, RABBITCT, to get comparable results and two additional clinical data sets.
I. INTRODUCTION
Recent C-arm systems with flat panel detectors allow the rotational acquisition of high-resolution projection images for 3-D volume reconstruction. 1 The additional intraprocedural 3-D information enables advanced techniques for diagnosis ͑e.g., perfusion imaging͒, planning, navigation, treatment, and follow-up. Recently, the reconstruction of moving structures has also become a matter of research. Time resolved image data can support complex interventional procedures in cardiology, such as transcutaneous valve replacement, implantation of biventricular pacemakers, and the assessment of myocardial perfusion. 2 For seamless integration into clinical workflows, it is highly desirable to introduce as little time penalty as possible. Therefore, the 3-D reconstruction has to be performed concurrently with the acquisition of the projection images ͑on-the-fly͒, finishing just after the acquisition. Table I lists common clinical protocols that show the targeted time range. In order to fulfill these real-time requirements, hardware acceleration is mandatory. Of course, the real-time requirement is valid for the complete image processing chain from image acquisition over preprocessing to 3-D reconstruction. However, since the reconstruction step is the major bottleneck, we focus on that part in this work.
Hardware acceleration has been an issue of high importance in the recent past in all areas of medical image computing, e.g., registration, 3 segmentation, 4 and reconstruction. 5 The decision for the most suitable hardware platform is domain-specific and depends on the amount of data, the computational load, and the parallelism in the algorithm. Standard reconstruction algorithms are embarrassingly parallel and hence can exploit the high level of parallelism provided by modern hardware, such as the vector processing units of current CPUs, the eight processing elements of IBM's CELL processor, 6 or the many shader cores of modern graphics processing units ͑GPUs͒. For 3-D reconstruction, recent publications have demonstrated that GPUs have quite a speed advantage over CPUs for standard algorithms. 7 However, algorithmic innovations in the field of motion compensated ͑3-D+ time͒ image reconstruction [8] [9] [10] may lead to a shift back to CPUs in the future. The reduced degree of parallelism in these novel methods results from various issues, e.g., irregular memory access patterns or data dependencies.
The main contribution of this paper is twofold. First, we show that the gap between GPUs and CPUs becomes smaller for static 3-D reconstruction due to CPU evolution. Second, we demonstrate that novel more complex algorithms can highly benefit from modern CPU architectures with large caches.
II. METHODS AND MATERIALS

II.A. Medical image computing architectures: CPU vs GPU
Computing performance is affected by several factors. Hardware performance depends highly on the number of computing elements, its clock speed, and the memory hierarchy. On the other hand, the performance of a specific algorithm depends, e.g., on the ratio of read and write operations or the ratio of memory accesses and compute operations. For comparing high-end CPUs and GPUs we examine two current GPUs from NVIDIA ® and three different Intel ® Xeon ® servers ͑cf. Table II͒ . In the following, we give a brief overview of both architectures and the implications for algorithmic performance. CPUs have always been general purpose processors able to execute arbitrary computing tasks. Over time, parallel execution units were added. Current CPUs are able to perform a single instruction on four data elements simultaneously ͓single instruction, multiple data ͑SIMD͔͒ with their vector units and feature up to eight cores on a single chip, with four chips in a PC that makes 32 cores in a single unit. CPUs feature a cache hierarchy of three levels of increasing size and access to the huge main memory. Cache sizes are up to 16 MB on a single chip and main memory can comprise hundreds of GB.
GPUs were developed for one specific task: Rasterization. Only recently have they become usable for generic computations, thanks to their programmable shaders. Current GPUs feature up to 448 stream processors corresponding to the individual lanes of traditional SIMD units. GPUs provide shared memory and read-only caches for texture memory and equipped with its own memory, introducing an extra level in the memory hierarchy. The GPU's memory has a very high bandwidth but is smaller than main memory ͑up to 6 GB͒. Medical image processing algorithms consist of several common patterns which benefit differently from each platform. Data interpolation is a part of most algorithms, e.g., in registration, reconstruction, and segmentation. GPUs with their dedicated texture sampling hardware can greatly accelerate this operation. A prominent example is ray-casting with trilinear interpolation. 11 Typical medical data sets comprise a huge amount of data. The high memory bandwidth of GPUs is beneficial if the size is sufficient. However, the data have to be transferred from the main memory to the device and back. CPUs benefit from their huge caches, allowing fast read and write access to data from main memory.
II.B. Impact of CPU evolution on static 3-D reconstruction
Today's reconstruction implementations are usually based on the FDK method. 12 We focused on the backprojection and measured the performance of already preprocessed data sets. This preprocessing includes physical corrections ͑beam hardening and scatter͒, redundancy weighting, and filtering according to Strobel et al. 1 As reported in literature 13 it makes up only about 10% of the total runtime. Algorithm 1 shows the basic steps of the backprojection of one projection image. For each projection image, all voxels x គ = ͑x , y , z͒ are updated with the corresponding detector value which is determined by perspective projection of the voxel onto the detector. The FDK method is embarrassingly parallel and hence can exploit the high level of parallelism provided by modern hardware. On GPUs, it benefits from hardware-supported interpolation of projection images and the high memory bandwidth. However, several aspects of CPU development have increased their performance to a similar level. The number of cores was raised to 32 in one system, which, with four times SIMD, equal 128 GPU processing units. Current CPUs feature an increased memory bandwidth due to their integrated memory controllers and faster RAM technology.
We applied various optimization techniques to the FDK algorithm to exploit the parallel processing capabilities of current CPUs and GPUs. They are detailed below.
II.B.1. CPU-optimization techniques
The backprojection problem is embarrassingly parallel since there are no data dependencies between the voxels. For each projection image all voxels can be updated independently from all others. Only concurrent updates of the same voxel from different viewing directions have to be avoided since this could cause incorrect results.
II.B.1.a. Multithreading and vectorization. Multithreading was implemented using Intel's Threading Building Blocks library by partitioning the reconstructed volume into many subvolumes that were processed by individual computing threads. Critical code parts, i.e., the inner loop, were manually vectorized using SIMD intrinsics.
II.B.1.b. Loop transformations.
Loop transformations ͑loop unrolling, loop fusion͒ can reduce the number of branches in the code and enable the compiler to schedule instructions better, such that latencies are hidden and overall instruction throughput is improved.
II.B.1.c. Temporal blocking.
Temporal blocking is a technique to improve the use of the fast caches. Voxels are updated several times before storing them back to main memory. Consequently, the bandwidth requirements of the algorithm are reduced. Since backprojection is bandwidthlimited on most platforms, temporal blocking can significantly reduce overall execution time.
II.B.1.d. Early termination. As mentioned before, subvolumes were used for multithreading. For each subvolume, the computation is aborted early if it is not in the field-of-view ͑FOV͒ of the current projection image. To check this, the eight corner voxels are projected into the detector plane. Then, the area of intersection between this "shadow" and the detector is calculated.
II.B.2. GPU-optimization techniques
Some of the CPU-optimization strategies ͑see Sec. II B 1͒ were also applied on the GPU. GPUs are able to switch threads fast and thereby hide memory latency. Therefore, the volume was partitioned into many chunks which are processed in parallel. The parallelization and optimization strategies are adapted from Scherl et al. 14 First of all, the hardware texture interpolation units were utilized for projection access. Further, memory accesses must be coalesced on GPUs to get good throughput. Therefore, the problem was reorganized as a 2-D problem, where every kernel loops over all voxels in the y-direction. Further details can be found in the original paper.
14
II.C. Motion estimation for 3-D+ time reconstruction
In a last year's MICCAI contribution, 9 a novel algorithm was presented that is able to estimate 4-D nonperiodic motion patterns using a time-continuous cubic B-spline motion model. The most time-consuming part of this method is the motion estimation as illustrated in Algorithm 2. Here, x គ is the ͑x , y , z͒ coordinate on the control grid, x គЈ denotes the transformed position, and t refers to the time. In each iteration of the optimization procedure, the derivative of the FDK reconstruction formula with respect to the motion model parameters has to be computed. This computation is analogous to a FDK reconstruction. However, instead of accumulating voxel values in the innermost loop, the derivative of the B-spline tensor product is evaluated in four dimensions. Cubic B-splines depend on four sample points in each dimension. In total, this results in 256 ͑4 4 ͒ times more data to be stored. For the application of interest ͑cardiac vasculature reconstruction͒, only a sparse set of voxels has to be considered, ranging from 0.1% to 2%. However, for future applications and other algorithms, the number of voxels can be as high as 100%. A GPU implementation requires sparse sampling if high-resolution volumes are to be reconstructed. In general, GPUs have less memory than CPUs and therefore, CPUs will be able to handle larger volumes. This can be nicely illustrated by the state-of-the-art Tesla™ C2070 GPU which is equipped with 6 GB onboard memory. For the presented problem, it allows only the derivative of a 184 3 dense volume to be completely stored in the device memory ͑4 byte‫ء‬ ‫ء652‬ 184 3 = 5.94 GB͒.
II.C.1. CPU-optimization techniques
Optimization was straightforward, analogous to the static case ͑cf. Sec. II B͒. The inner loop now contains the computation of the derivative for each voxel with respect to the B-spline model parameters. On the CPU, multithreading was implemented using OpenMP. The sparse set of voxels was distributed equally among all computing threads. The inner loops within the derivative computation were unrolled.
II.C.2. GPU-optimization techniques
The volume and the current projection are stored in the GPU device memory to avoid frequent data transfers over the PCIe bus. A sparse representation of the volume was used to reduce the memory requirements. The backprojection part was optimized analogous to the static case ͑cf. Sec. II B͒, e.g., use of texture units. In contrast, coalesced memory access cannot be used, since the offset of the memory access depends on the motion.
III. EXPERIMENTAL SETUP
III.A. Computer systems
All methods were benchmarked on three CPU-based many-core systems from Intel and on two NVIDIA ® Tesla™ GPUs. The main specs of all four systems are summarized in Table II . The first CPU system comprises two Xeon ® 5550 quad-core processors ͑eight cores total͒. The second one features four Xeon ® X7460 hexacore processors ͑24 cores͒. The third one has four Xeon ® eight-core processors ͑32 cores͒. Hyperthreading was enabled on the eight-core and 32-core systems, hence they exposed twice as many virtual processors to the operating system. Additionally to the CPU-based systems, we ran the tests on two workstations, each equipped with a NVIDIA ® Tesla™ accelerator card ͑C1060 and C2050͒ with CUDA 3.1.
All tests on CPUs were performed using 64-bit Linux and the Intel compilers in version 11.1. All CPU-based systems used in this report are preproduction systems. Production hardware is expected to deliver similar performance levels.
III.B. Static 3-D reconstruction
The evaluation of our implementation of static 3-D reconstruction was performed using the open reconstruction benchmark RABBITCT. 15 It comprises a public data set and a test framework that allow for presentation of comparable results. The task of RABBITCT is to implement a standard FDK algorithm and reconstruct volumes of sizes 256 3 , 512 3 , or 1024 3 voxels from a standardized data set. Additionally to performance measurements, it allows easy verification of the implementation's correctness.
In this work, we reconstructed a volume size of 512 3 voxels. Three different data sets were used for evaluation and verification of the optimized static 3-D reconstruction ͑cf. Table III͒ . Data set ͑A͒ is the publicly available RABBITCT data set and data sets ͑B͒ and ͑C͒ are clinical cases acquired with a Siemens C-arm system. Table III lists the number of projections and the dimensions of the projection images for all three data sets.
We evaluated three different optimization levels. The baseline was defined by our vectorized and multithreaded implementation 16 ͑"Base" in Table IV͒ . "Cache" denotes a cache-optimized implementation with added temporal blocking. Finally, "E.term" is the most optimized version with temporal blocking and early termination. The subvolume size for Cache and E.term was empirically chosen to be 128 ϫ 64ϫ 4 as this configuration performed best over all architectures and data sets.
III.C. Motion estimation for 3-D+ time reconstruction
The experiment for this paper was modeled after a clinical protocol used in Ref. 9 . The B-spline motion model was parametrized by 5 3 control points in space and 35 in time resulting in a total of 3 ‫ء‬ 5 3 ‫ء‬ 35 degrees of freedom. We used 133 projection images with a size of 960ϫ 960 pixels. The reconstructed volume had a size of 256 3 voxels. The sampling of the volume was set to 5%. The mean execution time for one iteration was measured by averaging 100 iterations. A typical reconstruction would require 50-100 iterations. Table IV shows the results for all four computer systems and for three different optimization methods. Depending on the system under investigation, the optimization techniques have varying impact. The effect of optimized cache usage due to temporal blocking correlates with the relative memory bandwidth available to each core. The eight-core Intel ® Xeon ® 5500 server has sufficient bandwidth and thus the speed-up was only up to 1.06ϫ. On the 32-core Intel ® Xeon ® server, we observed a speed-up of at least 1.14ϫ. Finally, the 24-core X7460 system is really bandwidthstarving, hence the good result of at least 1.35ϫ. This observation demonstrates the impact of the higher memory bandwidth of the Nehalem microarchitecture. The proposed method of early termination further reduced the reconstruction time to about 72%-82% compared to the cacheoptimized version on all three systems ͑1.22-1.39ϫ͒. The effectiveness of the approach can be seen by looking at the numbers of a single data set, e.g., ͑A͒. With the chosen size of 128ϫ 64ϫ 4 voxels, about 24% of the subvolumes are outside of the FOV. The reconstruction time is 75%-78% of the Cache implementation s on all systems. In the end, the newest GPU ͑Tesla™ C2050͒ was the fastest system, more than twice as fast as the best CPU system. Figure 1 shows the results for the three CPU-based systems with varying numbers of threads. All three CPU systems performed better than the GPU. We observed an almost linear speed-up with the number of threads used. On Nehalem-based systems, the maximum speed-up was even higher than the physical number of cores due to hyperthreading. The 32-core Intel ® Xeon ® server outperformed the fastest GPU ͑C1060͒ by a factor of 63.26 when using 64 threads. Note that the C2050 performed worse than the C1060 because the drivers and compiler optimizations for the Fermi architecture are still under development. We further observed side effects of pinning threads to specific CPU cores, which is usually done to increase performance. When using only one or two threads on the Xeon ® 5500 system, pinning resulted in plummeting performance. We suppose this to happen because the cores with threads pinned to them get too hot, disabling the built-in overclocking mechanism ͑"TurboMode"͒.
IV. RESULTS AND DISCUSSION
IV.A. Static 3-D reconstruction
IV.B. Motion estimation for 3-D+ time reconstruction
The relatively bad performance of the GPU may be explained by the frequent memory write-access operations. Due to its architecture ͑cf. Sec. II A͒, the CPU benefits from its fast caches which are not present on current GPUs. Further, the four nested inner loops of the derivative computation cause a high register pressure on the GPU. Consequently, less threads could be executed simultaneously, reducing the ability to hide memory latency.
V. CONCLUSION AND OUTLOOK
In this paper, we have shown two things. First, we have shown that static 3-D reconstruction can be computed onthe-fly on a high-end multi-CPU server, although GPUs are still faster. Our study only investigated the major part of the reconstruction, the backprojection, which accounts for more than 90% of the computation time. Regarding the real-time requirements in a realistic scenario, additional preprocessing and data transfer has to be performed. As an example, we consider the most demanding protocol from Table I ͓Head ͑LC͔͒, with an acquisition time of 20 s. Preprocessing roughly adds 1-2 s to the 18.82 s on the CPU or 8.98 s on the C2050, respectively. Data transfer does not need to be considered as it can be hidden by streaming projections from the system to the workstation in the background. Further, we demonstrated that a more complex algorithm, in our example a motion compensated 3-D+ time reconstruction, can highly benefit from the large caches of modern CPU architectures.
A Tesla™ C2050 GPU was 2.1 times faster for the static problem and the 32-core Intel ® Xeon ® server more than 60 times faster for the motion estimation part of the motion compensated algorithm. This illustrates nicely that for complex algorithms, a careful combination of both acceleration platforms can lead to an optimal result. This point is also reflected in the future hardware development of two major vendors, Intel ® and NVIDIA ® , which indicates that CPUs and GPUs are becoming more akin. Our results suggest that to achieve maximum performance, an expensive high-end server is necessary. However, even an off-the-shelf quadcore workstation is able to outperform a Tesla™ C1060 GPU by more than four times.
In summary, we could show that the ever increasing complexity of medical image computing algorithms requires paying close attention on the hardware architecture in order to enable their integration into the clinical workflow.
ACKNOWLEDGMENTS
Thanks to the Regional Computing Center Erlangen ͑RRZE͒ for support and hardware access. Thanks to the staff and facilities of the Intel ® Manycore Testing Lab ͑http:// software.intel.com/en-us/articles/intel-many-core-testinglab/͒. This work was funded by a research grant from Intel. 
