Forward and Backward projections are two computational costly steps in tomography image reconstruction such as Positron Emission Tomography (PET). To speed-up reconstruction time, a hardware projection/backprojection pair has been built following algorithm architecture adequacy principles. Thanks to an original memory access strategy based on an 3D adaptive and predictive memory cache, the external memory wall has been overcome. Thus, for both projector architectures several units run efficiently. Each unit reaches a computational throughput close to 1 operation per cycle.
INTRODUCTION
Reconstruction of images in tomography is cpu intensive and usually postponed. There are several implementations of reconstruction systems to speed up its computation : PC clusters 1, 2 , DSP 3 , multi-processor machines [4] [5] [6] , Cell 7, 8 , GPU 9 or FPGA [10] [11] [12] . All these implementations have to face the memory bottleneck due to the limited bandwidth of the main memory. Our projector and backprojector IPs reach an efficient computing throughput. Indeed, this projector/backprojector hardware pair overcomes the memory access bottleneck thanks to a prefetching memory strategy.
projectors which correspond to our hardware IPs. On the other hand, the 3D-EM algorithm defines the projection and backprojection steps as a matrix multiplication. Because this matrix is defined by the number of bins (number of events into one detector) and the number of voxels, its size is too huge to be stored without factorisations, compressions or approximations 13 . Our study will be focused on the geometric matrix system in which each element (i,j) corresponds to the probability that a photon pair produced in voxel j is detected on the captor i.
The projection and backprojection steps in iterative algorithms can be done either through a computation on the fly of the matrix system by efficient geometric operators, or thanks to a compressed system matrix built beforehand. In this latter case, the system matrix is computed by analytic pre-computation 14 , Monte Carlo simulations 15 or experimental measurements 16 . Moreover, not only sparsity and symmetry are exploited in order to get a reasonable matrix size, but also compression techniques with loss 14 . When system matrix is compute on the fly, the standard manner is to use a projection/backprojection pair made of a ray driven projector and a bi-linear voxel driven backprojector. This unmatched pair is known to be a convergent solution for EM algorithms 17 .
ALGORITHMS

3D-RP
The 3D ReProjection (3D-RP) 18 algorithm computes a first estimation of the volume by a 2D Filtered BackProjection (FBP2D) made of a 2D Ramp Filtering step F r followed by a 2D backprojection BP 2D . Afterwards, a 3D Forward Projection FP 3D of the estimated volume is proceeded in order to complete the projection data. Finally the volume is reconstructed through a 3D Filtered Backprojection (FBP3D) made of a 3D Colsher Filtering step F c followed by a 3D backprojection BP 3D .
3D-EM
The Expectation Maximisation algorithm applied to Emission Tomography 19, 20 is an iterative technique for computing the maximum likelihood estimates of the activity density f from the measured data g. The acquisition system is modelized by the matrix system H = (h ij ). At each iteration step, a new volume estimationf (n+1) is computed from the last volume estimationf (n) as described below :
(SC) Sinogram Comparison between the measured data g i and the estimated dataĝ (n) :
(BN) Backprojection Normalisation with application of the sensibility image s j to the backprojected image
(IC) Image Correction with the application of the correction image Figure 1 . A backprojector unit is composed of a 3D backprojection operator joined with a 3D-AP cache. The memory access strategy is based on a fast and small cache memory inside the SoPC. It predicts the needs of the 3D Back-Projection which draws a 3D sinusoid into the 3D memory space of the projection data. Therefore the cache succeeds to mask the 4-10 cycles latency of the slower and bigger external SDRAM memory. 
Projector IP
The ray casting 3D projector IP 22 achieves a high computation throughput thanks to a synchronised ray casting of LORs drawing close paths in the volume. This temporal and spatial locality in the volume accesses allows an efficient use of the hierarchical 3D adaptive and predictable cache.
TOWARDS A HARDWARE/SOFTWARE RECONSTRUCTION
A Hardware/Software Architecture on a SoPC
The reconstruction system uses a board with a System on Programmable Chip (SoPC) connected to a PC host. The SoPC technology solution is a middle way between a completely software or hardware (ASIC) implementation. Thanks to its programmable logic resources (FPGA) a dedicated architecture can be implemented. This architecture can be seen as a co-processor integrated in a classical system with an embedded processor and memory. In this way, a hardware/software partitioning can be done on the chip according to the computation steps. Moreover, the PC processor can be used as an additional computation resource.
3D-RP Reconstruction
The filtering steps (F r and F c ) are done on the PC host processor whereas the projection and backprojection steps are done on the SoPC board. The projector (F P 3D ) defines a hardware configuration of the SoPC board and the backprojector (BP 2D and BP 3D ) another one. To compute a 3D-RP reconstruction, we will need to configure one chip as a backprojector for BP 2D and BP 3D , and another one will be configured as a projector for the F P 3D step.
3D-EM Reconstruction
As illustred on figure 3, the 3D-EM reconstruction system is mainly composed of two SoPCs : one for a bin flow architecture (FP 3D and SC) and another one for a voxel flow architecture (BP 3D , BN and IC). This two architectures are made of a chain of modules associated each one to a computation step. A buffer at the output of each hardware modules stores a block of proceeded data which are ready to be used by the next module.
These steps are done one after another because of the limited bandwidth of the external memory. The last module of the chain uses double buffering technique to compute a block of data and as the same time writes the previous block of data computed.
Bin Flow Architecture
This architecture is made of a 3D projector FP 3D and a module for the SC step. This one is defined only as a divider of bins which can be conceived with a computation throughput of one division per cycle. Therefore, the computing time of the SC step is limited mainly by the external memory access : one read access on a block of measured data g i and one write access on a block of sinogram comparison (n) i .
Voxel Flow Architecture
This architecture is made of a 3D backprojector BP 3D and two multipliers for the BN and IC steps. The hardwired multiplier on Virtex chip can be efficiently used to reach a computation throughput of one multiplication per cycle. As for the SC step, the BN and IC steps are limited mainly by the external memory access : one read access on a block of sensibility image voxels s j for the BN step, one read access on a block of image voxelŝ f (n) and one write access on a block of image voxelsf (n+1) for the IC step. has been used as data set. A brain phantoms has been reconstructed with the software tomography distribution STIR version 2.0 24 and with a software version of the 3D-RP and 3D-EM as well. The goal of this comparison is to validate our projection/backprojection pair. STIR and our projection/backprojection pair are quite similar. STIR one is made of a ray tracing projector using a variation of Siddon's algorithm 25 and an incremental, beamwise interpolating backprojector using Cho's algorithm 26, 27 . Original and reconstructred phantoms are presented on figure 4.
QUALITY OF RECONSTRUCTION
A convergent projector/backprojector pair
On table 5.1, STIR and our software are compared through the Mean Absolute Percentage Error (MAPE) for analytic and iterative algorithms. For 3D-RP, STIR and our reconstructions are very close with only a MAPE of 0, 63%. For iterative reconstruction, we don't use a regularization step as STIR OSMAPOSL does. As a matter of fact, the MAPE is higher and reaches 2.42% after 30 iterations. Nevertheless, this result proves that our projection/backprojection pair allows an acceptable 3D-EM convergence.
3D-RP(Gac et al)/3D-RP(STIR) 3D-EM(Gac et al)/OSMAPOSL(STIR) MAPE
0.63% 2.42% Table 1 . Mean Absolute Percentage Error (MAPE) computed between STIR and our software for analytic and iterative algorithms.
Fixed/Floating point arithmetic
If the software backprojector used to validate the overall reconstruction uses a fixed point arithmetic as our hardware backprojector (FPGA technology doesn't allow floating point arithmetic), it's not the case for the software projector. As a consequence, the software projection/bacprojection pair is not a 100 % bit true version of the hardware pair. Thus the sudy above doesn't reflect the exact quality of reconstruction that we can expected from the hardware pair. Nevertheless, we have observed on previous work 21 that backprojection is quite robust to noise computation. Indeed, between two software backprojectors, one using a fixed point arithmetic and the other one a floating point arithmetic, a MAPE of only 0, 13% is observed. And as projector and backprojector use quite similar dynamic of computation, we can expect an equivalent robustness to computation noise. This means that the results presented on table 5.1 are not too far away from the quality of reconstruction that we can expect with our hardware projector/backprojector pair. .
RECONSTRUCTION TIMES
Estimation
The SoPC time performance presented at this section are based on the measured performances of our projectors IPs 21, 22 and the estimated performances of the additional hardware modules (only for the 3D-EM algorithm). The computing time of our projectors IPs has been measured on a Virtex 2 Pro at 35 Mhz. From these measures, we get a reconstruction time in clock cycles per operation. As Virtex 2 Pro technology is an older generation than Pentium 4 one, we have scaled our Virtex 2 Pro results to a 200 MHz architecture that can be reasonably obtained on a Virtex 4 chip.
For 3D-RP (table 2) and 3D-EM (table 3) , we compute for each step the processing time through the multiplication of the cycles per operation (Cycles/Op) times the number of operation per element (Nb Op/Elt) times the number of element (Nb Elt). For backprojection, an element corresponds to a reconstructed voxel and an operation to the accumulation of one bin to the reconstructed voxel. For forward projection, an element corresponds to a volume integration along a LOR and an operation to the computation of a line intersection between a LOR and a voxel. The projection data correspond to a Siemens HR+ scanner. They are made of 239 + 162 = 401 sinograms for 3D-RP and of 239 sinograms for 3D-EM. The volume is made of 128 * 128 * 63 voxels.
A 3D backprojector with 8 units can be put on a Virtex 4 FX60 and a 3D projector with 8 units on another one. With these 8 units 2D/3D backprojection and 3D forward projection reach respectively a computational throughput of about 0,25 cycles per operation and 0,5 cycles per operation. To the total reconstruction time an overhead of 20 % has been added corresponding to the communication cost between each step.
Elt
Nb Table 3 . Evaluation of the reconstruction time and cycle efficiency of our reconstruction system for one iteration of the 3D-EM algorithm
Comparison with STIR
We have compared the expected reconstruction time of our SoPC system with STIR reconstruction time measured on a Pentium 4 for 3D-RP and 3D-EM, respectively on Table 5 . Reconstruction time and cycle efficiency measured for the software OSMAPOSL STIR reconstruction done on a Pentium 4 and expected on our reconstruction system.
Discussion
As presented on table 6, a speed-up of 7.5 can be expected for 3D-RP and 3.5 for 3D-EM. Moreover, one can observed that speed up factors achieved by the backprojector is better than the forward projector for both reconstruction algorithms (3D-RP and 3D-EM). This can be explained by the greater complexity of the 3D projector to be efficiently parallelised. Indeed, throwing of LORs have to be synchronised to keep a good temporal and spatial locality when passing through the whole volume. Moreover STIR takes advantage of the exploitation of symmetries to reduce the number of rays thrown which is not the case for our current projector.
Because STIR backprojection is ray-driven and as the number of LORs is less for 3D-EM, its number of operation is less. As a consequence, speed-up factor of our backprojector compared with STIR one, is less for 3D-EM (12) than for 3D-RP (17.5). The better efficiency of STIR projector for 3D-EM compared with 3D-RP is due to the exploitation of the axial symmetry more important in the projection data for 3D-EM. For all these reasons, speed up factor is twice less for 3D-EM (3.5) than for 3D-RP (7). For the same reasons, the 3D-RP efficiency ratio (120) which is the ratio between our Cycles/Voxels efficiency and STIR one's, is higher than 3D-EM ratio (60). Table 6 . Speed up factors and Efficiency ratio between our reconstruction system and STIR software implementation.
SPEED-UP FACTOR EFFICIENCY
CONCLUSION
An analytic and iterative 3D PET reconstruction system, integrating a hardware projection/backprojection pair has been described. Thanks to an architecture algorithm adequacy methodology, 3D projection and 2D/3D backprojection reach an efficient computing throughput. For both 3D-RP and 3D-EM algorithm, 3D backprojection is accelerated by an order of magnitude and 3D projection remains the main computational costly steps. Improvements can be reached with a special effort on the 3D projector conception. For instance, exploitation of the LORs symmetries as STIR projector does, can give a better speed up factor. In comparison with the software implementations of STIR, an acceleration of 7 for 3D-RP and 3.5 for 3D-EM can be expected. The reconstruction system presents a notably better cycle efficiency which let's hope better speed up factor with higher system frequencies on the next FPGA generations or on an ASIC implementation.
