Abstract -A novel FPGA architecture is presented for fast three-dimensional (3D) image reconstruction of digital holograms in this paper. The architecture is based on Fresnel transform for the 3D rendering. The implementation features low on-chip hardware resource consumption for the large size digital holograms. It uses the off-chip memory for buffering the intermediate results for subsequent computation. The adoption of the off-chip memory is realized in a network-on-chip (NOC) platform for efficient data accessing.
Introduction
Three dimensional (3D) digital imaging is gaining importance in applications such as metrology, biology, industrial inspection, and consumer electronics. Digital holography (DH) [6, 7] has been found to be effective for 3D imaging by recording the wavefront of a 3D object via charged-coupled devices (CCDs). The 3D image reconstruction of an object under observation can be carried out by digital diffraction computation on the recorded hologram. Different techniques can be used for diffraction computation, including Fresnel transform method, convolution method and angular spectrum method. Although these methods have been found to be effective, they share a common drawback of requiring high computational complexities. The fast Fourier transform (FFT) can be used to accelerate the computation. However, realtime 3D reconstruction may still be difficult for computers with limited computation capacities.
The general purpose graphic computation units (GPUs) can be used to accelerate the diffraction computation. A number of GPU-based implementations [1, 8, 9] for diffraction computation have been propsoed. These implementations exploit the many-core capability of the GPUs to offer a significant enhancement of throughput at the expense of higher power consumption. The implementations therefore may not be well-suited for the mobile/embedded devices with strict limitation on power dissipation.
To reduce the power consumption, field programmable gate arrays (FPGAs) have been adopted to implement diffraction computation. The architectures in [2, 3, 10] are designed based on the convolution approach. The work in [4] implements the angular spectrum method in hardware. The study in [5] realizes the Fresnel transform architecture. Although these implementations have been found to be effective for providing high throughput computation, area costs may not be an important concern in the architectures. For example, circuits in [5] consumes large on-chip memory for efficient pipelining operations for Fresnel transform. The size of con-chip memory consumption may grow enormously for large digital holograms. Therefore, it would be difficult to implement diffraction computation circuits for large digital holograms in FPGA devices with limited hardware resources.
One way to solve the problem is to use the off-chip memory as the buffers. However, some designs focusing only on the stand alone FPGA implementation without considering the integration of the design into the system-on-chip (SOC) or network-on-chip (NOC) platforms. Therefore, off-chip memory access may not be efficient, especially for the column-wise one dimensional (1D) FFT operations where access of non-consecutive memory locations are required.
The objective of this paper is to present a novel FPGA-based coprocessor for the 3D image reconstruction. The circuit is able to operate real-time Fresnel transform without large hardware resource utilization. Moreover, the circuit is EEE 129-1 capable of actively accessing the off-chip memory without the aids of other processors and/or direct memory access (DMA) controllers. Digital holograms with large sizes can then be processed by FPGA devices with limited hardware resources. The proposed circuit has been implemented in Altera DE4 development board. It acts as a hardware accelerator in the Qsysbased NOC platform. Experimental results reveal that the proposed architecture is effective for applications where high speed computation, low hardware costs, and low power consumption are desired.
Fresnel Transform for Digital Holography
The proposed architecture is able to perform diffraction computation for a DH microscopic imaging system. The resulting hologram, denoted by η, can be captured by CCDs and stored in digital computer. Given the hologram η, an object's image B in a plane parallel to the hologram plane at distance z can be reconstructed by Fresnel transform as follows:
where λ is the wavelength of light source, and (p, q) and (r, s) are the coordinates on the hologram and image planes, respectively.
Since the hologram η is discretized in a CCD, the discrete representations of Fresnel transform is necessary for DH. Suppose the digital recording/sampling operations produce N × N samples for η with sampling interval ∆ f in both the x and y directions. Direct discretization of the Fresnel integral gives the following:
where
is the (x, y)-th sample of the discretized hologram η, 0 ≤ x, y ≤ N − 1, and
is the inverse of ∆ f scaled by λ z N .
The Proposed Architecture
The proposed architecture aims to compute eq. (2) by FPGA. There are three units in the proposed architecture: pretransform unit, FFT unit, and post-transform unit.
The goal of pre-transform unit is to compute
The FFT unit then takes Fourier transform on ρ x,y . The result produced by FFT unit, termed τ u,v , is given by
Define
EEE 129-2
By substituting eqs.(5)(6) into eq. (2), it follows that
Therefore, when τ u,v is available, the post transform unit computes α × ω u × ω v × τ u,v to find ε u,v . In addition, φ u,v , the phase of ε u,v , is also computed in the unit for hologram reconstruction.
As shown in Figure 1 , all the three units are attached to the NOC system by Altera Qsys. Each unit has a dedicated network interface for accessing data outside the unit. In the NOC system, a NIOS II embedded processor, and a DRAM controller are also included. The NIOS II processor is used for coordinating the pipelining operations among these units. The DRAM controller is used for the off-chip memory accesses requested by the embedded processor and the three units. 
Pre-transform Unit
The operations of pre-transform unit is based on eq.(3). Therefore, the unit involves the computation of µ x , µ y and multiplications. To accelerate the computation, the values of µ x and µ y can be pre-computed, and stored in tables. Because 0 ≤ x, y ≤ N − 1, µ x and µ y only take N different values when λ , z, and ∆ f are known. Therefore, each table for the computation of µ x and µ y contains N entries. Figure 2 shows the architecture of the pre-transform unit, which contains an address generation unit (AGU), a controller, two tables, two complex number multipliers, and two buffers. The AGU is responsible for the generation of addresses for reading η x,y from off-chip RAM to the read buffer. The controller then generates indices x and y for loading µ x and µ y values from the tables. The multipliers in the circuit are then used to compute ρ x,y , which is first stored in the write buffer, and then sent back to off-chip RAM for subsequent FFT operations.
Because the multipliers in the architecture are for complex numbers with floating point format, it may be difficult for the multiplications to be completed in a single clock cycle. In our design, all the multipliers perform multiple clock cycles multiplications. To enhance the throughput, they are all fully pipelined. Therefore, in addition to indices generation, the controller also coordinates different components in the circuit for pipelining operations.
FFT Unit
The goal of FFT unit is to compute τ u,v given by eq.(5). The FFT unit consists of an AGU, a controller, and a one-dimensional FFT (1D-FFT) module. To perform two-dimensional FFT (2D-FFT) using the 1D-FFT module, rows of the array {ρ x,y , 0 ≤ x, y ≤ N − 1} are loaded from off-chip RAM and operated one at a time. The FFT unit then writes the computational results directly back to the same row in the off-chip memory. After the row operations are completed, the column operations will proceed in the same manner. After the completion of all the column operations, the array stored in the on-chip RAM is {τ u,v , 0 ≤ u, v ≤ N − 1}, the 2D-FFT of {ρ x,y , 0 ≤ x, y ≤ N − 1}. EEE 129-3 We use Altera FFT MegaCore function to implement the 1D-FFT module. Because one row or one column is operated at a time, the transform length of the FFT is N. The 1D-FFT module has single data input and single data output. The module is fully pipelined. In addition, the input/output dataflow of the module is able to operate in streaming mode, allowing the continuous process of input data stream, as well as producing the continuous output data stream.
To perform 2D FFT using the 1D FFT module, the AGU in the FFT unit generates addresses for loading the source data from off-chip RAM and writing the results produced by 1D-FFT module to the off-chip RAM. Because the 1D-FFT has single data input and single data output, two addresses are generated in each clock cycle: one for loading data, and another for writing result. In addition, because the 1D FFT module is fully pipelined, and is able to operate in streaming mode, consecutive rows (or columns) can be loaded to the module in a seamless way. This can be accomplished by the employment of read and write buffers, each holding a row (or a column) of the source data (or results). EEE 129-4
Post-transform Unit
The post-transform unit is responsible for reconstructing the object image ε u,v using eq.(7). As depicted in Figure 3 , the architecture of the post-transform unit is similar to that of the pre-transform unit, comprising of an AGU, a controller, two tables, three multipliers, two buffers. Additional arctan circuit for phase computation is also required.
In the post-transform unit, the tables are used to store the pre-computed values of ω u and ω v . Similar to the cases for µ x and µ y , because 0 ≤ u, v ≤ N − 1, each table for the computation of ω u and ω v contains N entries. The controller in the post-transform unit operates in the similar fashion to that of the pre-transform unit. The controller produces indices (i.e., u and v values) for loading ω u and ω v values from the tables. The AGU generates addresses to the off-chip RAM for loading τ u,v . The result of multiplication, ε u,v , is then used for computation of the phase φ u,v using arctan circuit. After that, the phase is stored back to off-chip RAM. 
Experimental Results
Some experimental results of the proposed architecture are presented in this section. The design platfrom is Altera Quartus II [11, 12] with Qsys. The target FPGA device is Altera Stratix IV EP4SGX230. All the computations in the proposed architecture are the single-precision floating point computations. Therefore, the numbers of the proposed architecture are represented by IEEE 754 single-precision format, where the length of a number is 32 bits. The off-chip memory is 800 MHz DDR II memory with size of 1 Gbytes. The circuit operates at 200 MHz. Table 1 shows the consumption of the hardware resources of the proposed architecture for holograms with dimensions 256 × 256 and 512 × 512. There are four types of area costs considered in the experiment: adaptive logic lookup tables (ALUTs), dedicated logic registers, embedded memory bits, and DSP blocks. To reduce the consumption of general purpose hardware resources such as ALUTs and logic registers, the embedded logic registers and DSP blocks are used for the implementation of on-chip memory and arithmetic operators, respectively. Because the number of arithmetic operators is independent of dimensions, it can be observed from Table 1 that the number of DSP blocks is independent of the size of holograms. In addition, because only the read buffer and write buffer in each unit are implemented by the embedded memory blocks, the consumption of embedded memory bits is small.
The area costs of the entire NOC system are summarized in Table 2 . It can be observed from Table 2 that the hardware utilization of the entire NOC system is small as compared with the hardware capacity of the target FPGA device. In fact, the EEE 129-5 utilization of ALUTs, dedicated logic registers, embedded memory bits, and DSP blocks are 26 %, 32 %, 11 % and 15 % of those provided by Altera Stratix IV EP4SGX230, respectively. To further evaluate the performance of the proposed architecture, comparisons with the work in [5] are made, as shown in Table 3 . We see from the table that, as compared with its counterpart, the proposed architecture consumes significantly less embedded memory bits at the expense of higher latency. In fact, the embedded memory bits consumption is only 2% (352584 vs. 16936864) of that of the architecture in [5] . The signicant reduction in embeded memory bits is beneficial for allowing the fast diffraction computation to be implemented in smaller FPGA devices with lower costs.
Finally, Figure 5 and Figure 6 show the 3D image reconstruction results of the proposed architecture. The image considered in the experiment are produced by the digital holographic microscopies (DHMs). It is the microlens with radius of curvature 120 microns. The error of 3D reconstruction is only 0.1 micron. Therefore, while having high speed computation, the architecture is also able to achieve high accuracy for 3D reconstruction 
Conclusion
The experimental results reveal that the proposed architecture is well suited for low cost FPGA implementation of 3D image reconstruction for digital holograms. The architecture has low consumption of ALUTs, dedicated logic registers, embedded memory bits and DSP blocks. In particular, the architecture only consumes 352584 bits of embedded memory. The architecture is then beneficial for applications requiring both low hardware resource utilization and high speed computation.
