An area-efficient high-throughput architecture based on distributed arithmetic is proposed for 3D discrete wavelet transform (DWT). The 3D DWT processor was designed in VHDL and mapped to a Xilinx Virtex-E FPGA. The processor runs up to 85 MHz, which can process the five-level DWT analysis of a 128 Â 128 Â 128 fMRI volume image in 20 ms.
Introduction: 3D discrete wavelet transform (DWT) processing is widely applied for many image and video systems, such as digital television broadcasting, seismic data collection, 3D=4D medical imaging, and telemedicine because of its potential for perfect reconstruction and its lack of blocking artefacts [1, 2] . FPGA technology has been proposed as a practical hardware solution for this task because of its low cost, highly parallel processing ability, and reconfigurability [3, 4] .
There are three basic kinds of structures [5] to implement wavelet transforms: convolution-based filter bank structures, lifting factorisation-based structures, and B-spline-based structures. While the lifting structure can significantly reduce the number of multiplications and accumulations, the convolution architectures can exploit constant multiplication algorithms, such as canonical signed digit (CSD) arithmetic [5] , residue number systems [6] , or distributed arithmetic (DA) [7, 8] .
Reference [5] shows that the convolution-based architecture for biorthogonal 9=7 wavelet transform can have a lower area cost, higher throughput, and lower power consumption compared with liftingbased structures.
3D wavelet transform is quite a challenge for FPGA implementation because of its high demands of hardware area, memory management and computing speed. Some 4D medical imaging systems even need to perform 3D wavelet transforms for multiple 3D volume images. While several 1D and 2D DWT architectures have been introduced and evaluated [4] [5] [6] [7] [8] , very few 3D architectures have been reported. References [9, 10] report several CSD-based and Booth multiplier-based 3D DWT architectures.
Among various constant multiplication algorithms, DA is reported to have advantages over other algorithms for its low area cost and low circuit complexity, and was used for 1D and 2D DWT processors [6, 7] . In this Letter, the Daubechies 9=7 coefficients were chosen as the basis for the system because Daubechies 9=7 coefficients are suitable for high-ratio compression, denoising and restoration of image and video and have been applied in the Motion JPEG standard [5] . In the following Sections, a DA-based areaefficient high-speed 3D DWT architecture is presented by exploiting the symmetry of biorthogonal 9=7 coefficients to reduce area cost further.
3D wavelet transform:
According to Mallat's pyramid algorithm, the iterative process of 1D DWT decomposition can be performed by highpass and lowpass filters:
where the octave average signals c(m, l À 1) of the (l À 1)th level decomposition are fed back into successive filters. Similarly, 3D DWT decomposition can also use the pyramid architecture, as shown in Fig. 1a . 3D DWT can be considered as a combination of three 1D DWTs in the x-, y-and z-directions, as shown in Fig. 1b . Hence, the preliminary work in DWT processor design is the design of highpass and lowpass polyphase filters, which are convolutions of filter coefficients and input pixels.
Tap-merged DA implementation: In this Letter, we use Daubechies 9=7 coefficients for the 3D DWT processor. The LUT size used in DA architectures can be reduced by exploiting the symmetry of the Daubechies 9=7 coefficients with the following equations:
cðn; lÞ ¼ hð0Þcðn; l À 1Þ
This tap-merging strategy will reduce the taps from 9 and 7 to 5 and 4, and decrease the size of the DA LUTs from 2 9 words to 2 5 words for the lowpass filter, and from 2 7 words to 2 4 words for the highpass filter, as shown in Fig. 2 . After tap-merging additions, the bits of c(m, l À 1) are increased to 10 bits, and the whole bit-parallel implementation of the subfilter needs 10 2 5 -word DA LUTs and 10 2 4 -word DA LUTs. The outputs from the 10 lowpass and highpass LUTs in the bit-parallel DA implementation are summed by two 10-input Carry-Save Adder (CSA) trees for the highpass and lowpass polyphase subfilters, respectively. 3D DWT architecture: The area reduction in polyphase subfilters by the above tap-merged DA makes it possible to map three pipelined subfilters into one FPGA chip for x-, y-and z-direction analysis. The whole topology of the 3D DWT processor is shown in Fig. 3 .
To reduce the required memory size for the intermediate results, the 128 Â 128 Â 128 volume image is split into eight blocks of 68 Â 68 Â 68 pixels with a four-pixel overlap between adjacent blocks because of the edge extension in the wavelet transform. The size of intermediate RAMs is 10 lines of 68 pixels and 10 blocks of 68 Â 68 pixels. The block arrow in Fig. 3 means a parallel input of nine pixels for y-direction or z-direction subfilters. All polyphase subfilters have two outputs in every cycle: an octave average signal c(n, l) and a detail signal d(n, l).
Fig. 3 3D DWT processor architecture
The proposed 3D DWT architecture was implemented in VHDL and synthesised for Xilinx Virtex-E FPGAs. Table 1 gives the results compared with previously published architectures. The average area cost per subfilter (about 423 slices) of the proposed architecture is much lower than existing reports [6, 10] ). Such a low area cost of the proposed architecture makes it possible to put three pipelined subfilters together to attain a high sampling rate, which may be very useful for many compute-intensive real-time applications, such as digital video and 3D=4D medical imaging. Multiple level DWT decompositions can be done by feeding the 3D octave average signal back to the processor. Because of the downsampling in each recycle, the computation of a five-level decomposition can be (1 þ 1=8 þ 1=64 þ 1=512 þ 1=4096) ' 1.15 times of the firstlevel decomposition. Therefore, a five-level 128 Â 128 Â 128 standard 3D fMRI image decomposition can be finished in (8 Â 68 Â 68 Â 68) Â 1.15=2 $ 1.45 million cycles. After being mapped on the Xilinx Virtex-E FPGA, the above 3D DWT processor architecture can run up to 85 MHz (11.2 ns), which can process about 50 such volume images per second.
Conclusion:
The proposed DA-based architecture can save hardware costs while retaining the capability of high throughput. With the processor running at 85 MHz, it can process a five-level DWT analysis of the 128 Â 128 Â 128 fMRI volume image in about 20 ms. Such a high-speed processing ability should be attractive for many real-time digital video and 3D=4D medical imaging technologies. 
