In this paper, a high-efficient lined-based architecture for the 9/7 discrete wavelet transform (DWT) based on lifting scheme is proposed. The proposed parallel and pipelined architecture consists of a horizontal filter (HF) and a vertical filter (VF). The critical path of the proposed architecture is reduced. Filter coefficients of the biorthogonal 9/7 wavelet low-pass filter are quantized before implementation in the high-speed computation hardware. In the proposed architecture, all multiplications are performed using less shifts and additions.
INTRODUCTION
In the field of digital image processing, the JPEG-2000 standard uses the scalar wavelet transform for image compression [1] ; hence, the two-dimensional (2-D) discrete wavelet transform (DWT) has recently been used as a powerful tool for image coding/decoding systems. Two-dimensional DWT demands massive computations, hence, it requires a parallel and pipelined architecture to perform real-time or on-line video and image coding and decoding, and to implement high-efficiency application-specific integrated circuits (ASIC) or field programmable gate array (FPGA). At the kernel of the compression stage of the system is the DWT.
Swelden proposed using the biorthogonal 9/7 wavelet based on lifting scheme for lossy compression [2] . The symmetry of the biorthogonal 9/7 filters and the fact that they are almost orthogonal [2] make them good candidates for image compression application. The coefficients of the filter are quantized before hardware implementation; hence, the multiplier can be replaced by limited quantity of shift registers and adders. Thus, the system hardware is saved, and the system throughput is improved significantly.
In this paper, we proposed a high-efficient architecture for the even and odd parts of 1-D DWT based on lifting scheme. The advantages of the proposed architectures are 100% hardwareutilization, multiplierless, regular structure, simple control flow and high scalability.
The remainder of the paper is organized as follows. Section 2 presents the lifting-based 2-D discrete wavelet transform algorithm, and derives new mathematical formulas. In Section 3, the high-efficient architecture for the lifting-based 2-D DWT is proposed. Finally, comparison of performance between the proposed architectures and previous works is made with conclusions given in Section 4.
The Lifting-Based 2-D DWT Algorithm
Usually the Lifting-based DWT requires less computation compared to the convolution-based approach. However, the savings depend on the length of the filters. During the lifting implementation, no-extra memory buffer is required because of the in-place computation feature of lifting. This is particularly suitable for the hardware implementation with limited available on-chip memory. Many papers proposed the algorithms and architectures of DWT [3] , [4] , [5] , [6] , [7] , [8] , [9] , but they require massive computation. In 1996, Sweldens proposed a new lifting-based DWT architecture, which requires half of hardware compared to the conventional approaches [2] . The discrete wavelet transform factoring into lifting scheme is represented as [10] :
andδ are the coefficients of lifting scheme, and ζ and ζ
are scale normalization factors. The architecture based on lifting scheme consists of splitting module, two lifting module and scaling module. The architecture of 9/7 1-D DWT based on lifting scheme is shown in Figure 1 .
The 9/7 2-D DWT Algorithm
According to the architecture of 9/7 1-D DWT based on lifting scheme, the architecture of modified 9/7 2-D DWT based on lifting scheme can be derived and shown in Figure 2 . The equations of the 2-D DWT based on lifting scheme is represented as The horizontal filter (HF) is represented as:
The vertical filter (VF) is represented as: High frequency part:
Low frequency part:
~305T he 23rd Workshop on Combinatorial Mathematics and Computation Theory
The modified 9/7 2-D DWT Algorithm According to eq. (1), the transform matrix of the 9/7 DWT based on lifting scheme is modified as
The modified horizontal filter (HF) is represented as:
The modified vertical filter (VF) is represented as: High frequency part: 
Finally, four subbands of HH , HL , LH and LL are performed by 2 HH , 2 HL , 2 LH and 2 LL . The equations of four subbands are represented as follows:
According to the equations of modified horizontal filter (HF), the architecture for modified horizontal filter (HF) is proposed and shown in Figure 3 . The proposed architecture for modified horizontal filter (HF) consists of input-delay unit, middle-delay unit, back-delay unit, five multiplexers and two processing elements (PEs). The PE(A/B) performs O1 and PE(C/D) performs O2.
Similarly, the proposed architecture for modified vertical filter (VF) is shown in Figure 4 . The proposed architecture for modified vertical filter (VF) consists of two delay units (Ds), seven long-delay (8 Ds) units, eight multiplexers and two processing elements (PE(A/B) and PE(C/D)). The architecture of scaling normalization (SN) is shown in Figure 5 . The new architecture of PE is shown in Figure 6 . The proposed PE architecture reduces the critical path [11] [12] . The 2-D DWT system is shown in Figure 7 .
The High-Efficient Architecture for
Lifting-Based 2-D DWT In 8 8× 2-D DWT, it requires 106 clocks to perform 2-D DWT. Clock cycles 2 to 66 perform O1, clock cycles 7 to 70 perform O2, clock cycles 24 to 87 perform O3, and clock cycles42 to 106 perform O4. The data flow for HF is shown in Table 1 , and the data flow for VF is shown in Table 2 . Every PE requires N N × clocks to perform the output.
Filter coefficients of the biorthogonal 9/7 wavelet low-pass filter are quantized before implementation in the high-speed computation hardware. In the proposed architecture, all multiplications are performed using shifts and additions after approximating the coefficients as a booth binary recoded format. The multiplier is replaced by a carry-save-adder (CSA) and three hardwire shifters in processing element (PE) [13] .
Conclusions and Discussions
Filter coefficients are quantized before implementation using the biorthogonal 9/7 wavelet. The hardware is cost-effective and the system is high-speed. The architecture reduces power dissipation by m compared with conventional architectures in m-bit operand (low-power utilization).
In this paper, the high-efficient and low-power architecture for 2-D DWT have been proposed. The CSA (carry-save-adder) replaces the multiplier in the proposed architecture. Hence, the architecture performs compression in . It requires 4 / 2 N memories to store ) , ( j i LL in every stage. In the architectures for vertical analysis and horizontal analysis, they require 11 7 + N memories to store temporary data. Hence, the buffer size is 11
. The control complexity is simple. The comparison between previous works and this work is shown in Table 3 .
The proposed architecture has been verified by Verilog-HDL and implemented on FPGA. The advantages of the proposed architecture are 100% hardware utilization and ultra low-power. The architecture has regular structure, simple control flow, high throughput and high scalability. Thus, it is very suitable for new-generation image compression systems, such as JPEG-2000. x(0,0) 1
x(0,1) 2
x(0,2) 3
x(0,3) H 1 (0,0) 4
x(0,4) L 1 (0,0) 5
x(0,5) H 1 (0,1) 6
x(0,6) L 1 (0,1) 7
x(0,7) H 1 (0,2) H 2 (0,0) 8
x(1,0) L 1 (0,2) L 2 (0.0) 9
x(1,1) H 1 (0,3) H 2 (0,1) 10
x(1,2) L 1 (0,3) L 2 (0.1) 11
x(1,3) H 1 (1,0) H 2 (0,2) 12
x(1,4) L 1 (1, x(3,0) HH 1 (0,0) 25
x(3,1) HL 1 (0,0) 26
x(3,2) HH 1 (0,1) 27
x(3,3) HL 1 (0,1) 28
x(3,4) HH 1 (0,2) 29
x(3,5) HL 1 (0,2) 30
x(3,6) HH 1 (0,3) 31
x(3,7) HL 1 (0,3) 32
x(4,0) LH 1 (0,0) 33
x(4,1) LL 1 (0,0) 34
x(4,2) LH 1 (0,1) 35
x(4,3) LL 1 (0,1) 36
x(4,4) LH 1 (0,2) 37
x(4,5) LL 1 (0,2) 38
x(4,6) LH 1 (0,3) 39
x(4,7) LL 1 (0,3) 40
x (5, 
