Abstract: Lifting Scheme based 2-D Inverse Discrete Wavelet Transform 2-D (IDWT) core for JPEG 2000 is implemented into FPGA following a new approach of reusing hardware components. The approach leads towards higher area efficiency and speed optimization. Design realized by Le-Gall 5/3 filter, achieved significant acceleration that executes at over 300 MHz with 7.13 Msamples throughput whereas using less than 1% of logic elements in Altera Stratix II FPGA. High quality reconstructed image are extracted from Matlab and VHDL simulations. Implementation details of the individual hardware blocks, synthesis result, and performance analysis are presented.
Introduction
The use of Lifting Scheme (LS) based 2-D IDWT in JPEG 2000 [1] has sparked interest in hardware design with optimized area, speed, very low power consumption, and high throughput. Software-based implementation can provide with greatest deal of flexibility in terms of selection of wavelet, bits-per-pixel, number of transform levels, and image size etc but it may not meet some timing constraints in many real time applications. Hardware implementation allows a considerable speed up factor to be obtained. ASIC designs provide the fastest operation since they use dedicated hardware, the tradeoff for speed, however, is lack of support for parameterization. Moreover, their long-term availability is a concern in several of the end markets in which this technology is of interest. FPGA become the most logical choice as these have become fairly cheap, offer good performance, and ease the development immensely compared to ASIC design. A low-complexity lifting architecture of 54.3 MHz clock speed is achieved in [2] .The reconfigurable architecture in [3] achieved 112 MHz with the tile size of 128 × 128 using 473 Lookup Table ( LUT) in Xilinx FPGA with 400 pixels/sec throughput. The 2-D DWT in [4] claimed 100% hardware utilization with regular data flow and low control complexity but reached a clock speed of 66.89 MHz with the tile size of (64 × 64) pixels. The architecture consumed 7726 LE in Xilinx FPGA. The throughput of the design was reported 66.8 Mpixels/sec with 332 clock cycle latency. Barco-Silex reported [5] IDWT core with 129 MHz clock speed that can support 128 × 128 tile size and uses 4544 LE in Altera FPGA with a throughput of 96% LL. To achieve higher processing speed and area efficiency a 2-D IDWT hardware module is designed with acceptable values of parameters for the VLSI application. The decomposed coefficients of forward transformed image are reconstructed using the developed module. VHDL codes are developed in smaller sub-modules in Altera Quartus II. The controller sub-modules are implemented as Finite State Machine (FSM), which updates (or transforms) each raw input sample progressively via intermediate states. By designing two consecutive memories data-in and data-out in parallel, one pixel processing is done in one clock cycle, which results in achieving higher operational frequency with lower latency (50 ns). The design is further optimized using inter-modules pipeline stages. The performance evaluation parameters of the 2-D IDWT module provides an operating frequency of 300 MHz, power dissipation of 666 mW, and the number of ALUT used is 57 with throughput of 7.13 Msamples/sec.
Lifting Scheme realization for the IDWT
The IDWT has traditionally been implemented using convolution or filter banks that require both a large number of arithmetic computations and a large memory for storage, features that are not desirable for either highspeed or low power image processing applications. LS based IDWT has many advantages over the traditional convolution-based approach. It reduces the computational and memory cost, allowing "in-place" computation [2] .
The main feature of the LS is to break up the high-pass and the low-pass wavelet filters into a sequence of upper and lower triangular matrices. The factorization is obtained by using an extension of the Euclidean algorithm resulting banded matrix multiplications.
In our implementation Le Gall (5, 3) filter has been used, with
So the polyphase matrix of the filter bank is
Based on LS the possible factorization leads to a band matrix multiplicationP
We consider even terms of the output stream as low pass sub band and the odd terms as the high-pass subbands.
The above matrices in the time domain can be represented as
For an input stream x of length N and y are the transformed signal values. The IDWT is derived first scaling the low-pass and the high-pass sub-bands with respective coefficients and then applying the dual and primal lifting steps after reversing the signs of the coefficients, and finally the inverse lazy transform by up scaling the output before merging into a single reconstructed stream.
Matlab implementation of the 2-D lifting IDWT algorithm
Results for 3-level IDWT reconstruction as it appears in memory and "Lena" test image are shown in Fig. 2 (a) that can be readily extended to color images. Scalability is verified by simulating the IDWT with image size of 256, 128, 64 32 and 16 blocks. We evaluated the performance for the parallel implementations in terms of memory usage and the total number computations. Our results showed less memory requirements due to the "in-place" lifting computations and significantly less computational overhead due to the minimized inter-block communication. The control units coordinate steps in order to process the whole image and are responsible for generating enable signals, address lines, and so on. At the end, the inverse transformed image coefficients are available in the internal memory. All necessary boundary information is included in the computation. Level of transform is controlled by the 3-bit "Level " signal (max 7 levels transformation). Fig. 1 (b) shows the top-level view of the of the 2-D IDWT architecture in the prototyping hardware. It works on a single-phase clock. IDWT implementation requires 2 data inputs at a time in order to calculate an output coefficient. The architecture proposed accommodates this by introducing appropriate pipeline/delay stages for the data inputs within the IDWT core.
The DWT 1D Control module is performing data multiplexing that also generate the address for memory reads and writes. The image height and width, along with the filter coefficients are passed as parameters to the DWT 2D control module. The IDWTCore module computes the transform coefficients of the input image pixels obtained by the memory read operation. After the computation, the high and low pass coefficients are passed to the right memory location. The inferRAM provides the necessary data input to the IDWTCore. After each level of transformation, the roles of memory banks are swapped. The input and output address are generated through the DWT 2D control module during vertical operation. These generated ad-dresses are supplied to necessary memory address input and output buses to the inferRAM. The inferRAM module accepts the write addresses generated by the DWT 1D Control module and the coefficients generated by the IDWTCore module. A single execution of this module writes the two coefficients to the right memory bank. This module generates the timing signal for the ram rd signal (read), and the ram wr (write) signal. The data is then passed to the IDWTCore module. 
Comparisons of simulation results
The majority of the verification for the design has done through post-placeand-route simulations model, and comparing to the results obtained in Matlab. We were confident that if the Matlab results were identical to the hardware results, we have correctly implemented the algorithms. For functional and timing simulation, ModelSim-Altera is used through developed test bench (VHDL) and appropriate stimuli to validate the design. The resulting coefficients for 3 different pixel data sets (Test 1, Test 2, and Test 3) are shown in Fig. 2 (c) note that the Matlab and the VHDL coefficients are almost the same. Then, to perform a complete 2-D IDWT module the input test image (256 × 256) were fed into Matlab and VHDL module. Fig. 2 (a) and (b) are constructed from the 1st, 2nd and 3rd level of reconstruction respectively in Matlab and VHDL. Fig. 2 (c) shows some sample inverse transformed coefficients extracted from Matlab and VHDL. Fig. 3 captured from ModelSim-Altera showing timing of internal memory control signals, outputs produced at each stage and the loading of the coefficients in the appropriate locations of the internal memory. The timing of the different control signals.
After some initial latency, the first coefficient is available at 241 ns. Two consecutive memory reads are performed at a time that takes 1 clock cycle. 'ram addr'signal holds the address of the memory location to be read from.
In parallel, signal "ram wr" goes high, and in 291 ns first inverse transformed coefficients is ready to be stored in memory. Signal "idwtcore addr" provides the address of the memory to be written. at 438157 clock cycle "ready" signal high indicating the completion of the transformed process. Fig. 3 . Simulation in the Modelsim-Altera (timing diagram).
Conclusion
Quartus II Integrated Synthesis (QIS) is used to synthesize the 2-D IDWT design codes into gate-level schematic shown in Fig. 1 (b) . QIS reports 57 ALUTs, 60 dedicated logic registers that is less than 1% logic utilization for the design. Power play early power Estimator reported 0.662 W and the total thermal power dissipation is 0.661 W. The maximum operating frequency is 300 MHz meting all timing requirements without any negative slack and failed path, and the latency of the system is 50 ns, The data transfer time is (438157Clock cycle/300 MHz) 1 ms. Throughput is 7.13 Msamples with 114 frames/sec. The hardware development presented in this work presents a novel processing architecture capable of executing the 2-D IDWT algorithm at over 300 MHz to reconstruct image in real time. The system is secure because no external memory is used and the data flow is protected during the whole decoding process. Through parallelization and line-based data processing, high output rates are achieved. The preliminary results are very promising; however, extensive further work needs to be done towards the extension of the system to handle different arithmetic representation, different wavelet analysis and synthesis schemes along with different architectures.
