Abstract-Iterative image reconstruction can dramatically improve the image quality in X-ray computed tomography (CT), but the computation involves iterative steps of 3D forward-and back-projection, which impedes routine clinical use. To accelerate forward-projection, we analyze the CT geometry to identify the intrinsic parallelism and data access sequence for a highly parallel hardware architecture. To improve the efficiency of this architecture, we propose a water-filling buffer to remove pipeline stalls, and an out-of-order sectored processing to reduce the off-chip memory access by up to three orders of magnitude. We make a floating-point to fixed-point conversion based on numerical simulations and demonstrate comparable image quality at a much lower implementation cost. As a proof of concept, a 5-stage fully pipelined, 55-way parallel separable-footprint forward-projector is prototyped on a Xilinx Virtex-5 FPGA for a throughput of 925.8 million voxel projections/s at 200 MHz clock frequency, 4.6 times higher than an optimized 16-threaded program running on an 8-core 2.8-GHz CPU. A similar architecture can be applied to back-projection for a complete iterative image reconstruction system. The proposed algorithm and architecture can also be applied to hardware platforms such as graphics processing unit and digital signal processor to achieve significant accelerations.
Forward-Projection Architecture for Fast Iterative
Image Reconstruction in X-Ray CT Abstract-Iterative image reconstruction can dramatically improve the image quality in X-ray computed tomography (CT), but the computation involves iterative steps of 3D forward-and back-projection, which impedes routine clinical use. To accelerate forward-projection, we analyze the CT geometry to identify the intrinsic parallelism and data access sequence for a highly parallel hardware architecture. To improve the efficiency of this architecture, we propose a water-filling buffer to remove pipeline stalls, and an out-of-order sectored processing to reduce the off-chip memory access by up to three orders of magnitude. We make a floating-point to fixed-point conversion based on numerical simulations and demonstrate comparable image quality at a much lower implementation cost. As a proof of concept, a 5-stage fully pipelined, 55-way parallel separable-footprint forward-projector is prototyped on a Xilinx Virtex-5 FPGA for a throughput of 925.8 million voxel projections/s at 200 MHz clock frequency, 4.6 times higher than an optimized 16-threaded program running on an 8-core 2.8-GHz CPU. A similar architecture can be applied to back-projection for a complete iterative image reconstruction system. The proposed algorithm and architecture can also be applied to hardware platforms such as graphics processing unit and digital signal processor to achieve significant accelerations.
Index Terms-Algorithm and architecture co-optimization, hardware acceleration, iterative image reconstruction, separable footprint projection, X-ray computed tomography.
I. INTRODUCTION

X
-RAY computed tomography (CT) is a widely used medical imaging method that produces three-dimensional (3D) images of the inside of a body from many two-dimensional (2D) X-ray images. A 2D X-ray image captures X-ray photons that pass through a body. As different materials attenuate X-ray differently, they can be effectively differentiated by their attenuation coefficients. Using many X-ray images taken around an axis of rotation, the attenuation coefficient of each volume element (voxel) can be reconstructed, providing high-resolution imaging for medical diagnosis.
In current clinical practice, a single CT scan using a state-ofthe-art helical CT scanner records up to several thousand X-ray images taken in multiple rotations as the patient's body is moved slowly through the scanner. The projections are captured on an array of detector cells and a dedicated computer is used for image construction. Efficient algorithms, such as filtered backprojection (FBP) [1] and its variants, are in common commercial use to handle large projection data sets and reconstruct images at sufficient throughput. However, being an analytical algorithm, FBP disregards the effects of noise. To improve the image quality and/or reduce X-ray dose, statistical image reconstruction methods have been proposed [2] , [3] . These methods are based on accurate projection models and measurement statistics, and formulated as a maximum likelihood (ML) estimation. Iterative algorithms such as conjugate gradient (CG) [4] , coordinate descent (CD) [5] and ordered subsets (OS) [6] , have been proposed. These algorithms find the minimizer of a cost function by iterative forward-and back-projection. Iterations increase the compute load substantially over FBP and impede routine clinical use.
Recently, a separable footprint (SF) projection algorithm was designed to simplify the forward-projection by approximating the voxel footprints as separable functions [7] . The SF projector has high accuracy and favorable speed, but it is still very computationally intensive: each forward-and back-projection requires on the order of 100 billion floating-point multiply-accumulate (MAC) operations, requiring minutes or longer for each forward-and back-projection on a state-of-the-art multicore microprocessor [8] .
High-performance computing platforms have been proposed to accelerate image reconstruction. For example, graphics processing unit (GPU) has recently been demonstrated to achieve 10 to 100 times speedup over a microprocessor for image reconstruction [9] , [10] . As a vector processor, GPU can be programmed for efficient parallel processing [11] . Provided with sufficient memory bandwidth, GPU accomplished a 30 times speedup of cone-beam Feldkamp (FDK) back-projection over a system based on 12 2.6-GHz dual-core Xeon processors [9] , and a 12 times speedup of algebraic reconstruction [10] . Field-programmable gate array (FPGA) is an another family of hardware platforms that enable more flexibility in mapping parallel computation with an improved efficiency. It was shown to accomplish a 6 times speedup of the cone-beam Feldkamp (FDK) back-projection [9] , [12] . However, existing GPU and FPGA implementations are tailored to analytical reconstruction algorithms or algebraic reconstruction methods [9] , [10] , [12] , [13] , and challenges still remain in mapping statistical iterative algorithms.
In this paper, we propose architecture and algorithm co-optimization for iterative image reconstruction. We show through numerical simulation that iterative image reconstruction algorithm can be robust to quantization noise. Even with a much shorter word length and coarse quantization, the resulting noise introduced to the reconstructed image is limited, causing no perceptual degradation in image quality. The results provide the basis of a fixed-point quantization that cuts the memory bandwidth and reduces the complexity of arithmetic operations, thus enabling more parallel implementations.
We propose a highly efficient hardware architecture based on a thorough geometry analysis that helps simplify complex control loops, eliminate data dependencies, and maximize temporal and spatial locality of reference. In particular, we present algorithm restructuring to take advantage of loop-level parallelism, water-filling buffer to minimize pipeline stalls, and out-of-order scheduling to compress off-chip memory bandwidth to enable more parallel architectures.
A prototype 55-way parallel SF forward-projector is demonstrated on a Xilinx Virtex-5 FPGA [14] as a proof of concept. The design is capable of completing 925.8 million voxel projections/s. The proposed architecture is also applicable to backprojection and motivates more efficient designs on alternative hardware platforms including GPU and digital signal processors (DSP). The numerical and geometrical insights can be employed in both software and hardware implementations of iterative image reconstruction to achieve significant accelerations.
II. BACKGROUND
Current generation CT systems have a cone-beam projection geometry, illustrated in Fig. 1 [3] , [15] - [17] . The X-ray source rotates on a circle centered at on the plane. The angle indexes the projection view measured from positive -axis to X-ray source. For each angle , the source emits X-rays that project the volume onto the detector. The transaxial direction is perpendicular to and the axial direction is parallel to .
A. Statistical Iterative Image Reconstruction
A CT system captures a large series of projections at different view angles, recorded as sinogram. Mathematically, sinogram can be modeled as , where represents the volume being imaged, is the system matrix, or the forward-projection model, and denotes measurement noise. The goal of image reconstruction is to estimate the 3D image from the measured sinogram . A statistical image reconstruction method performs the ML estimation of based on detector measurement statistics. The estimation can be formulated as a solution to a weighted least square (WLS) problem [3] , [18] .
(1) where is a diagonal matrix with entries based on photon measurement statistics [3] . A solution to (1) satisfies [18] . If is invertible, the unique solution to (1) is given by , where , the adjoint of the system matrix, represents the back-projection model. This solution can be interpreted as the weighted back-projection of , followed by a deconvolution filter . As the deconvolution filter has a high pass characteristic, the deconvolved image is affected by high frequency noise [18] . One approach to control this noise is to add a penalty term to form a penalized weighted least square (PWLS) [3] , [18] cost function: (2) where is known as the regularizer and is a regularization parameter. One example of is an edge-preserving regularizer [19] .
Minimizing (2) requires iterative methods [4] - [6] . In this paper we consider a diagonally preconditioned gradient descent method to solve (2) [6] , [18] :
The solution is obtained iteratively. In each iteration, a new 3D image estimate is obtained by updating the previous image with a chosen step, the negative gradient of the cost function scaled by . Fig. 2 shows a block diagram of this iterative approach. To start, the CT scanner produces the measured sinogram, and the FBP algorithm is used to estimate the initial image , followed by computed forward-projection to obtain the computed sinogram . The error between the computed and measured sinogram is back-projected , then offset by a regularization term. The result is scaled by , and used to improve the initial image to produce . The image is iteratively updated to minimize the cost function.
B. Forward-and Back-Projection
Forward and back-projection are the most computationally intense operations in iterative image reconstruction due to the large size of the system matrix . It is infeasible to store , thus the forward-projection , and back-projection in (3) are computed on the fly.
The forward-projection is mathematically based on the Radon transform. The Radon transform of a 3D volume at view angle is described by the line integrals [7] : (4) where is the line that connects the X-ray source and the detector cell at . In a practical implementation, a 3D continuous volume is discretized to a collection of volume elements, or voxels , where is the voxel coordinate. The grid spacings are and dimensions are along the directions. Let be the common voxel basis function, defined as a cubic function, , and be the location of voxel . We have (5) To account for the finite detector cell size, the projection is convolved with the detector blur . Following a common assumption that the detector blur is shift invariant, independent of the view angle , and acts only along and coordinates, then the ideal noiseless forward-projection on the detector cell centered at is given by (6) where (7) and (8) where is the footprint of voxel and is the blurred footprint. For a detailed description of this derivation, see [18] . The separable footprint (SF) method [7] approximates the blurred footprint function as the product of and , thus (6) is approximated as (9) Based on (9), one complete forward-projection involves multiplication and summation over six nested loops:
, and . For a practical object made up of more than 10 million voxels, a SF forward-projection that comprises more than 900 view angles, as in a commercial axial CT scanner [3] , requires on the order of 100 billion multiply-accumulate (MAC) operations. In the following sections, we explore architecture and algorithm co-optimization to accelerate the SF forward-projection.
For the sake of completeness, we briefly summarize backprojection. Back-projection is the operation that smears the projection in detector space back into the object space to reconstruct the 3D volume [18] . Back-projection is mathematically described as (10) where is the weighted difference between measured sinogram and the computed sinogram . Similarly, the SF method approximates back-projection as (11) Note that the equations governing forward-and back-projection are similar and they also share a common architecture. In this paper, we will focus the discussions on forward-projection, but the results can also be applied to back-projection.
III. QUANTIZATION ERROR INVESTIGATION
Iterative CT image reconstruction algorithms are usually implemented in 32-bit single-precision floating-point quantization. Floating-point arithmetic costs more hardware resources and longer latency than integer (or fixed-point) operations. The substantially smaller area and higher speed provide strong incentives for using fixed-point operations. However, fixed-point quantization introduces errors that may degrade image quality. We show in the following that good image quality can be achieved with appropriate quantization choice and sufficient number of iterations.
Our experiment was done using a 61-slice test volume, with each slice made up of 320 320 voxels. Errors are defined in reference to a baseline that is the image reconstructed using 32-bit floating-point quantization after 1 000 iterations. We converted floating-point to fixed-point and varied the word length and quantization of each parameter and operand. Mean absolute error (MAE) and root mean square error (RMSE) of the image update in every iteration were measured compared to the baseline. The errors are expressed in Hounsfield unit (HU), which is a linear transformation of the linear attenuation coefficient (the attenuation coefficient of water at standard pressure and temperature is defined as 0 HU and that of the air is HU). We used an OS algorithm [6] with 82 subsets which is a variation of (3) that uses a subset of the projection views for each update. Fig. 3 comprises the 32-bit floating-point quantization and the fixed-point quantization described in Table I . We use the notation to denote a fixed-point format with before the radix point and after the radix point. The experiment confirms that the fixed-point quantization errors introduced can be limited to fairly low levels. More iterations can help suppress the errors, and the word length can be increased to reduce the errors further if necessary. Fig. 4 shows the images obtained by iterative image reconstruction as well as the absolute pixel-by-pixel differences between the reconstructed image using 32-bit floating-point quantization and the reconstructed image using fixed-point quantization. Three representative slices in the region of interest are shown from left to right. The vast majority of the pixel errors remain relatively small. We observe no perceptual difference between floating-point and fixed-point reconstructed images. These initial results suggest that the iterative image reconstruction algorithm can be robust to quantization error. The property allows us to simplify the hardware with much more efficient integer arithmetic and smaller memory.
IV. ARCHITECTURE AND ALGORITHM CO-OPTIMIZATION
Forward-and back-projection are the core and most computationally intense building blocks of iterative image reconstruction. A simplistic forward-projection architecture includes image memory on the input and detector memory on the output as in Fig. 5 ; back-projection exchanges the positions of image and detector memory but its processing architecture is similar. In a state-of-the-art commercial CT scanner, the image and detector datasets are up to 1 GB in size. Such enormous datasets can only be accommodated in off-chip memory, and input and output data are selectively brought to on-chip memory (cache) for processing. The on-chip memory is smaller but much faster and sometimes immediately accessible by the processor, while the larger off-chip memory interface is much slower and costs a longer latency to access. Iterative image reconstruction algorithm in its original form requires moving of large datasets on and off chip constantly, resulting in a low throughput due to limited off-chip memory interface.
Parallelism can be used to improve the throughput, but it further increases memory bandwidth. The architecture can be pipelined, though its throughput is far from ideal due to loop-carried dependencies from geometry processing. In the following we investigate the projection geometry and design algorithms and architectures to reduce the memory bottleneck and improve the efficiency of parallel and pipelined architectures.
A. Projection Geometry
The projection geometry is central to the proposed algorithms and architectures. Fig. 6 illustrates the X-ray projection of a single voxel of dimension centered at . We define the magnification factor as the ratio of the source-to-detector distance (which is a constant in cone-beam geometry) over the distance between the source and . (The magnification factors of all voxels in an axial column are equal.)
is maximized when the voxel is closest to the X-ray source and minimized when the voxel is furthest to the X-ray source, i.e., where FOV, or field of view, is the diameter of the volume that is reconstructed from all view angles, and is the source-torotation-center distance. Now, consider the position of a voxel relative to the X-ray source-the transaxial width of a voxel's projection is maximized if the transaxial diagonal of the voxel is perpendicular to the line joining the X-ray source and the center of the voxel, illustrated in Fig. 7 . Considering both the magnification and the transaxial diagonal of the voxel, the transaxial span of the projection of a voxel, quantized to the axial spacing of the detector grid, is (13) where denotes ceiling.
The magnification factor in (12) can also be used to derive the axial span. Typically the axial spacing of the detector grid is designed to match the voxel grid by having . Therefore, on average one voxel maps to one detector cell along the axial direction. However, grid misalignment and geometry cause multiple consecutive voxels in an axial column to project to a single detector cell, as shown in Fig. 8 . The axial height of a voxel's projection is minimized if the voxel is located Fig. 9 . It follows that the number of voxels in an axial column that project to a single detector cell is (14) For a numerical example, substituting sample helical conebeam geometry parameters given in Table II, we get  and , i.e., one voxel's projection spans at most 11 detector cells along the transaxial direction, and at most 3 consecutive voxels in an axial column project to one detector cell.
B. Loop-Level Parallelism and Water-Filling
The SF forward-projection algorithm contains six layers of nested loops (9): (view angle), ( index), ( index), ( index), ( index) and ( index) for each forward-projection. The innermost loop computes the transaxial projection of a voxel. As discussed in the previous section, one voxel projects to a row of up to detector cells, each of which can be evaluated independently. Thus we exploit loop-level parallelism by allocating multiply-accumulate (MAC) units and detector memory banks for the transaxial projection, as shown in Fig. 10 . The quantization study showed that the transaxial projection can be carried out in a 16-bit 16-bit fixed-point multiply followed by a 28-bit accumulate. To operate at a high clock frequency, e.g., 200 MHz on a Xilinx Virtex-5 FPGA, we pipeline the MAC unit to 3 stages: multiply (MU), add (AD), and write back (WB). Let be the wordlength of that is stored in the detector memory and be the clock frequency. Then, the required read and write bandwidth to the on-chip detector memory is b/s. Since one complete transaxial projection block uses MAC units, the total on-chip detector memory bandwidth is b/s. The outer loop can be easily pipelined, but it is complicated by loop-carried dependencies: multiple voxels in an axial column can project to a single detector cell, as illustrated in Fig. 8 , so the pipeline would have to be stalled, waiting for write back to complete before next add. The 3-stage pipeline chart in Fig. 11 shows that one pipeline bubble is necessary to resolve data dependency. A deeper pipeline will result in more stalls.
The mismatch between the voxel grid and detector grid requires the joint consideration between the loop and the loop. To eliminate loop-carried dependencies, we propose an algorithm transformation to merge the two loops. In the transformed algorithm, for each -th detector cell, we identify the group of contiguous voxels along the axial column that project to the cell and sum up the contributions. In particular, we allocate shift registers, each providing one candidate voxel (because up to voxels in an axial column project to a single detector cell), as in Fig. 12 . Each candidate voxel is multiplied by its axial footprint and the contributions are summed, which is equivalent to a partial unrolling of the loop. An example is shown in Fig. 13 using 2 and the input multiplexer will direct the new voxel input to . Note that in the above example, one new voxel is brought in the water-filling buffer every cycle to support the average input consumption rate. The average consumption rate is one input per clock cycle because and are designed to be matched as previously described. However, the actual input consumption varies every cycle and prefetching is needed to avoid stalling the pipeline. A longer shift register and prefetching guarantee a lower stall rate, but increase latency and resource usage. We experimentally verified the stall rate versus shift register length, and the results are listed in Table III . We choose a 2-stage shift registers in our prototype design for a stall rate %. A lower stall rate is possible with longer shift registers.
The new water-filling architecture can be implemented using 3 MAC units that are pipelined in two stages: read (RE) and sum (SU), which augment the 3-stage pipeline in Fig. 11 to 5 stages as in Fig. 14 . Pipeline bubbles due to loop-carried dependencies have been eliminated to achieve an average throughput of voxel projections/s. The required on-chip image memory bandwidth is b/s with as the voxel wordlength. Substituting parameters from Table II , %, and MHz that is typical of an FPGA platform, the proposed projection module completes 185.2 million voxel projections/s and requires an on-chip image memory bandwidth of 2.6 Gb/s and detector memory bandwidth of 123.2 Gb/s. In the following section, we propose out-of-order scheduling to reduce the detector memory bandwidth.
A complete forward-projection module consisting of the water-filling axial projection and parallel transaxial projection has been synthesized on a Xilinx Virtex-5 XC5VLX155T FPGA and the device usage is listed in Table IV .
C. Out-of-Order Scheduling
We could further parallelize the and loops, but it would increase the memory bandwidth. Absent of any temporal locality of reference, the off-chip memory bandwidth will be easily saturated as we continue to parallelize. To circumvent the difficulty, we compress the off-chip memory bandwidth by an out-of-order access schedule that maximizes the temporal locality of reference. To explain the approach, note that the voxels along a line cast projections onto the same block of detector cells, thus the on-chip memory can be reused without resorting to off-chip access, as shown in Fig. 15 . Based on this observation, we design an out-of-order scheduling algorithm as follows: (1) divide the detector into sectors as in Fig. 16(a) ; (2) draw the upper and lower edge of each sector by connecting the X-ray source and the upper and lower end of the sector; (3) determine the set of voxels whose projections lie entirely in each sector. Assign the set of voxels to a projection module for processing to maximize the detector memory's locality of reference.
If we choose the sectors to be non-overlapping as in Fig. 16(a) , some voxels will be missed as their projections do not completely lie in any sector. Adjacent sectors will have to overlap by an amount to ensure all voxels are accounted for. (Recall that is the maximum transaxial span of a voxel's projection. An overlap of or more is not necessary.) For simplicity of implementation, we choose a fixed overlap of in making sectors. Now another problem arises with the choice of a fixed overlap, as some voxels will be counted twice in adjacent sectors, as shown in Fig. 16(b) . To avoid double counting, we keep track of the upper and lower edge of each sector.
The out-of-order schedule can be computed in design time and stored in memory. The required memory is , where is the wordlength to store the coordinate pair. Using the sample geometry in Table II , the out-of-order schedule memory takes 796.5 MB. If we take into account the multiple rotations in a CT scan that repeat view angles and only voxels inside the FOV, the out-of-order schedule memory size is reduced to 86.3 MB, which is still significant.
To further shrink the out-of-order schedule memory, we design a run-length encoding to compress the schedule. The encoding scheme is illustrated in Fig. 17 : we store the voxel coordinates along of , and encode and store of as the run length from . of becomes the of and the encoding follows a similar fashion. The direction to count run length depends on the view angle , as described in Table V . For a numerical example, if we choose a sector size of , the out-of-order schedule memory can be compressed by an order of magnitude to 8 MB.
Table VI lists a few more example sector sizes based on the geometry in Table II. If we choose sector size , with a fixed sector-sector overlap of , the detector is divided into 89 sectors. A sector covers an average of voxel columns. Sectors are processed sequentially. After finishing one sector, we move forward by a stride of to the new sector. The external memory access is reduced to only banks every sector. When , the off-chip detector memory bandwidth of the proposed projection module described in the previous section is reduced to 245.2 Mb/s. As we increase the sector size, both the stride and sector coverage increase, resulting in an almost constant off-chip memory bandwidth. A larger sector size requires a larger on-chip memory but a smaller out-of-order schedule memory.
The out-of-order scheduling requires sectored processing. The number of on-chip detector memory banks has to be increased from to . Since a projection covers only an segment of the sector, a rotator and an inverse rotator are needed to select the detector memory banks. The rotator-based architecture can be implemented using multiplexers and it incurs a high routing overhead. An alternative selector-based architecture allocates sec transaxial projection blocks, and each block can be enabled or disabled by the write enable to the corresponding memory bank. The comparison between the rotator-based and the selector-based architecture is illustrated in Fig. 18 with FPGA synthesis results listed in Table VII . A selector-based architecture uses fewer logic units or FPGA slices, but more MAC units or DSP48E slices. In both architectures, a small sector size results in more efficient use of hardware.
The detector memory is dual-port to support one read and one write per cycle for the read-accumulate-write operation. To enable loading and unloading from off-chip memory without stalling the computation, we increase the number of detector memory banks from to . While memory banks are accessed for the projection of the current sector, the remaining banks are being unloaded/loaded to/from off-chip memory. To avoid stalling the pipeline, the loading and unloading time by the memory banks should be no greater than the time spent on the projection computation. This condition can be easily met in the proposed sectored processing. Fig. 19 . Inputs are read from the image memory, held by the water-filling buffer before being processed by the partially-unrolled axial projection block. Transaxial projections are performed in parallel and the results are accumulated in the detector memory. A selector-based architecture orchestrates sectored processing following an out-of-order schedule. A summary of the architecture metrics is listed in Table VIII. The projection module has been mapped to a Xilinx Virtex-5 XC5VLX155T FPGA [14] and the device utilization is listed in Table IX . We followed the sample geometry in Table II and chose a small sector size with . The projection MHz. The substantially reduced off-chip memory bandwidth allows us to parallelize the design further by multiple projection modules. The Xilinx Virtex-5 XC5VLX155T FPGA can accommodate 5 parallel projection modules, and the device utilization is shown in Table IX . The parallel projection modules will be assigned to non-adjacent sectors, so they will be able to operate independently for a 55-way parallel computation towards a combined average throughput of 925.8 million voxel projections/s at MHz. The 55-way parallel forward-projector is integrated with two DDR400 64-bit DRAM channels that each provides up to 25.6 Gb/s off-chip memory interface. One DRAM channel is used as the off-chip image memory and the other as the off-chip detector memory. This 55-way parallel design completes one forward-projection of a 320 320 61 test object over 3,625 views in 6.31 seconds. The same task implemented in C requires 31.1 seconds of execution time on an 8-core 2.8-GHz Intel processor for a throughput of 203.0 million voxel projections/s. The C program uses 16 threads, and is optimized based on the projection geometry.
VI. CONCLUSION
We present algorithm and architecture techniques to construct a highly efficient hardware-based forward-projection for iterative image reconstruction. The solutions are based on a study of the projection geometry which uncovers loop-level parallelism, locality of reference, as well as geometric mismatch between the object grid and the projection grid. We exploit loop-level parallelism and spatial locality of reference to unroll inner loops for a high throughput. However, geometric mismatches and offchip memory access bottleneck limit the achievable throughput. A water-filling buffer is thus created to bridge the geometric mismatch and remove the pipeline stalls, and an out-of-order schedule is designed to compress the off-chip memory access. The cost of implementing these schemes is kept low by judicious considerations of buffer length used in the water-filling buffer, sector size and architecture used in the sectored processing, as well as run-length encoding designed to compress the out-of-order schedule memory.
The resulting architecture is fully pipelined and can be parallelized for a very high throughput. We demonstrate the design in a 5-stage pipelined, 55-way parallel forward-projector implemented on a Xilinx Virtex-5 XC5VLX155T FPGA that achieves an average throughput of 925.8 million voxel projections/s at a clock frequency of 200 MHz. Note that the throughput is limited by the number of MAC units available on this device, as a Virtex-5 XC5VLX155T FPGA contains only 128 DSP48E slices. The latest Xilinx Virtex-7 devices offer up to 3,600 DSP slices [20] , which will allow for a much higher throughput potential.
The proposed architecture can be adopted for back-projection for a complete iterative image reconstruction system, which is part of our future work. Testing fixed-point quantization of higher-resolution images also remains our future work. The proposed algorithm and architecture techniques also apply to designs that are built on alternative hardware platforms, such as GPU and DSP to achieve significant accelerations.
