I. INTRODUCTION

W
ITH the rapid progress of VLSI design technologies, many processors based on audio and image signal processing have been developed recently. The two-dimensional discrete wavelet transform (2-D DWT) plays a major role in the JPEG-2000 image compression standard [1] . Presently, research on the DWT is attracting a great deal of attention [2] - [6] . In addition to audio and image compression [7] - [10] , the DWT has important applications in many areas, such as computer graphics, numerical analysis, radar target distinguishing and so forth. The architecture of the 2-D DWT is mainly composed of the multirate filters. Because extensive computation is involved in the practical applications, e.g., digital cameras, highefficiency and low-cost hardware is indispensable.
At present, many VLSI architectures for the 2-D DWT have been proposed to meet the requirements of real-time processing. However, because the filtering operations are required in both the horizontal and vertical directions, designing a highly efficient architecture at a low cost is difficult. Lewis and Knowles [11] used the four-tap Daubechies filter to design a 2-D DWT architecture. Parhi and Nishitani [12] proposed two architectures that combine the word-parallel and digital-serial methodologies. Chakrabarti and Vishwanath [13] presented the nonseparable architecture and the SIMD array architecture. Vishwanath et al. [14] employed two systolic array filters and two parallel filters to implement the 2-D DWT. The modified version uses four parManuscript received June 3, 1999 ; revised August 11, 2000 . This paper was recommended by Associate Editor J.-N. Hwang.
P.-C. Wu is with the Information Technology Division, Institute for Information Industry, Taipei, Taiwan, R.O.C. (e-mail: pcwu@netrd.iii.org.tw).
L.-G. Chen is with the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. (e-mail: lgchen@cc.ee.ntu.edu.tw).
Publisher Item Identifier S 1051-8215(01)03013-0.
allel filters as reported in [15] and [16] . Chuang and Chen [17] proposed a parallel pipelined VLSI array architecture for the 2-D DWT. Chen and Bayoumi [18] presented a scalable systolic array architecture. Other 2-D DWT architectures have been reported in [19] - [23] . Among the various architectures, the best-known design for the 2-D DWT is the parallel filter architecture [15] , [16] . The design of the parallel filter architecture is based on the modified recursive pyramid algorithm (MRPA) [13] , which intersperses the computation of the second and following levels among the computation of the first level. The MRPA is feasible for the 1-D DWT architecture, but is not suitable for the 2-D DWT, because the hardware utilization is inefficient and a complicated control circuit results from the interleaving data flow. Therefore, in this paper, we propose a new VLSI architecture for the separable 2-D DWT. The advantages of the proposed architecture are the 100% hardware utilization, fast computing time, regular data flow, and low control complexity. Additionally, because of the regular structure, the proposed architecture can easily be scaled with the filter length and the 2-D DWT level.
This paper is organized as follows. Section II introduces the 2-D DWT algorithm. Section III discusses the previous design techniques. In Section IV, an efficient architecture for the 2-D DWT is proposed. Section V compares the performance of various 2-D DWT architectures. Finally, we state our conclusions in Section VI.
II. 2-D DWT ALGORITHM
The proposed architecture deals with the separable 2-D DWT, whose mathematical formulas are defined as follows: , and the outputs are the three subbands LH, HL, and HH, of size . In the second-level decomposition, the input is the LL band and the outputs are the three subbands LLLH, LLHL, and LLHH, of size . In the third-level decomposition, the input is the LLLL band and the outputs are the four subbands LL LL LL LH LL HL, and LL HH, of size . The multi-level 2-D DWT can be extended in an analogous manner. Fig. 2 shows the result of the "Lena" image after a three-level 2-D DWT.
III. PREVIOUS TECHNIQUES
At present, the best-known architecture for the 2-D DWT is the parallel filter architecture [15] , [16] . The design of the parallel filter architecture is based on the MRPA [13] . The MRPA is initially proposed for the 1-D DWT architectures. As illustrated in Fig. 3 , the MRPA intersperses the computation of the second and following levels among the computation of the first level. Because of the decimation operation, the quantity of processing data in each level is half of that in the previous level. The total quantity of processing data can be counted as follows: Because the quantity of processing data in the second and following levels (i.e., ) is the same as that in the first level (i.e., ), the computing time of the first level can be filled as shown in Fig. 3 . The hardware utilization is efficient. Hence, the MRPA is feasible for the 1-D DWT architectures.
However, as illustrated in Fig. 4 , we find that the MRPA is not suitable for the 2-D DWT architectures. Since the quantity of processing data in each level is a quarter of that in the previous level, the total quantity of processing data is counted as follows: (7) where 2-D DWT level; quantity of processing data in the first level; that of the second level; . . . that of the th level. If the 2-D DWT level is large enough, (7) will become (8) Because the quantity of processing data in the second and following levels (i.e.,
) is only one third of that in the first level (i.e., ), the computing time of the first level cannot be filled, as shown in Fig. 4 . Hence, the hardware utilization is inefficient, and a complicated control circuit results from the interleaving data flow. 
where is the 2-D DWT level. Table I lists the hardware utilization of the parallel filter architecture for different 2-D DWT levels. In one-level 2-D DWT, the hardware utilization of the parallel filter architecture is only 50%. As the 2-D DWT level increases, the utilization converges to 66.67%. Because of inefficient hardware utilization, the parallel filter architecture requires a longer computing time, the main problem of the parallel filter architecture as well as the present 2-D DWT architecture design.
IV. PROPOSED 2-D DWT ARCHITECTURE
The block diagram of the proposed 2-D DWT architecture is shown in Fig. 7 , which includes a transform module, a RAM module, and a multiplexer. The size of the RAM module is . The decomposition scheme is level by level and described as follows. In the first-level decomposition, the multiplexer selects data from the input image. The transform module decomposes the input image to the four subbands LL, LH, HL, and HH, and saves the LL band to the RAM module. After finishing the first-level decomposition, the multiplexer selects data from the RAM module. The LL band is then sent into the transform module to perform the second-level decomposition. The transform module decomposes the LL band to the four subbands LLLL, LLLH, LLHL, and LLHH, and saves the LLLL band to the RAM module. After finishing the second-level decomposition, the multiplexer selects data from the RAM module. The LLLL band is then sent into the transform module to perform the third-level decomposition. The transform module decomposes the LLLL band to the four subbands LL LL LL LH LL HL, and LL HH, and saves the LL LL band to the RAM module. This procedure repeats until the desired level (i.e., the last level) is finished.
The advantage of such a scheme is that the data flow is very regular. We can concentrate our effort to efficiently design the transform module. As shown in Fig. 8 , the transform module is tree-structured and comprises two stages. Stage 1 performs horizontal filtering, and stage 2 performs vertical filtering. To design the transform module efficiently, we assume " " to be the area cost and " " to be the time cost required in stage 1. According to the original design as shown in Fig. 8 , the number of filters required in stage 2 is double that of stage 1. That is, is the area cost required in stage 2. On the other hand, because of the decimation operation in stage 1, the quantity of data for filtering in each filter of stage 2 is half that of stage 1. Hence, the computing time required in stage 2 is half of , i.e., . Since stage 2 is cascaded after stage 1, stage 2 can not work until stage 1 finishes its job. Therefore, we find that there will be hardware idle in stage 2. In other words, the hardware utilization in the original design is inefficient.
In order to solve this problem, we consider a single decimation filter as shown in Fig. 9 . The frequency labels " " and "
" imply that the output frequency is half the input frequency. The decimation filter can be implemented directly by a filter followed by a two-folded decimator. However, the decimator discards one sample out of every two samples at the filter output, causing poor hardware utilization. Hence, we employ two different design techniques to enhance its performance. The first technique is the polyphase decomposition technique as illustrated in Fig. 10 , which decomposes the filter coefficients into even-ordered and odd-ordered parts. In the even clock cycles, the input data are fed to the odd part and multiplied with the odd-ordered coefficients. In the odd clock cycles, the input data are fed to the even part and multiplied with the even-ordered coefficients. The output data are the sum of the odd and even parts. The internal clock rate is half the input clock rate after employing the polyphase decomposition technique. Therefore, we can double the input clock rate to increase the throughput. When the quantity of processing data is the same, the computing time will be reduced to half. Thus, this technique can reduce the time cost to a half. We use the symbol " " to represent the polyphase decomposition technique. Table II shows the data flow of the decimation filter employing the polyphase decomposition technique.
The second technique is the coefficient folding technique. As illustrated in Fig. 11 , every two coefficients share one set of a multiplier, adder, and register. The switches control the data path. The operation of Fig. 11 is described as follows. Viewing the PE0 first, in clock-cycle 0, the input data is multiplied with the coefficient and added with the content of R1 (initially zero). The result is then stored to R0. In clock-cycle 1, the input data is multiplied with the coefficient and added with the content of R0, i.e., . The result is then output. In clock-cycle 2, the input data is multiplied with the coefficient and added with the content of R1, i.e., . The result is then stored to R0. In clock-cycle 3, the input data is multiplied with the coefficient and added with the content of R0, i.e., . The result is then output. The following clock cycles are arranged in an analogous manner. The operation of the PE1 is similar to the PE0. Because every two coefficients share one set of a multiplier, adder, and register, this technique can approximately reduce the area cost to a half. We use the symbol " " to represent the coefficient folding technique. Table III shows the data flow of the decimation filter employing the coefficient folding technique. Now, we employ these two design techniques to the decimation filters of stages 1 and 2, respectively. Hence, four different design methods are derived for the transform module. The design strategy (including the original design) is listed in Table IV . From Table IV , we find that if we employ the polyphase decomposition technique to stage 1 and the coefficient folding technique to stage 2, the area and time cost will both be the same and in stages 1 and 2. Thus, the total area cost is and the total time cost is . The AT product is reduced from to , and no hardware is idle in stage 2. Therefore, the performance of the new design method is three times more efficient than the original design. In contrast, the other design methods, as listed in Table IV , cause the hardware to be idle in stage 2. Hence, they are not efficient design schemes.
In stage 2 of the transform module, because the image data are fed by a raster-scan mode, each coefficient requires a line delay to store the row data for vertical filtering. Therefore, the registers in Fig. 11 need to be replaced with the line delays for ver- tical filtering. Fig. 12 shows the modified result. The data flow is shown in Table V where represents the th row data.
Every two input rows generate one output row. Fig. 13 shows the structure of the line delay, which is composed of select signals, , and storage blocks of size . The size of the line delay in the different decomposition levels is described below.
In the first-level decomposition, the select signal is enabled, and the others are disabled. The size of the line delay is the sum of all storage blocks as follows: (14) (a) (b) Thus, it can store the row data output from the decimation filter of stage 1. In the second-level decomposition, the select signal is enabled, and the others are disabled. The size of the line delay is the sum of the previous storage blocks as follows:
In the following decomposition levels, the select signals change the size of the line delay to in the third level, in the fourth level in the th level, in the th level. Assume that the low-pass filter has four taps:
, and , and the high-pass filter has four taps: , and . Fig. 14 illustrates the transform module employing both the polyphase decomposition and the coefficient folding techniques. The frequency labels " ," " ," and " " imply that the output frequency is a quarter of the input frequency. In stage 1, because we use the FIR direct-form to implement the polyphase decomposition technique, the low-and high-pass decimation filters can share the same registers. Here, we have assumed that the filters in stages 1 and 2 have the same length, but in practice, this condition is not necessary for the correct operation. In addition, because we employ the polyphase decomposition technique in the decimation filters of stage 1, the internal clock rate of the transform module is half the input clock rate. Fig. 15 illustrates the three-level 2-D DWT in the proposed architecture. The decomposition begins with an block in the first level, and ends with four pixels in the third level. Table VI shows the data flow according to the ports of the transform module. The clock cycles 0-63 perform the first-level decomposition, the clock cycles 64-79 perform the second-level decomposition, and the clock cycles 80-83 perform the third-level decomposition. Because of the regular structure, the proposed architecture can be easily scaled with the filter length and the 2-D DWT level. 
V. PERFORMANCE COMPARISONS
The typical 2-D DWT architectures include the parallel filter architecture [16] , direct architecture [14] , nonseparable architecture [13] , SIMD architecture [13] , and systolic-parallel architecture [14] . In Table VII , we compare the performance of our architecture and these 2-D DWT architectures in terms of the number of multipliers, the number of adders, storage size, computing time, control complexity, and hardware utilization. The computing time has been normalized to the same internal clock rate. The parameter is the filter length, is the image size, and is the 2-D DWT level. The computing time of our architecture is derived as follows: (16) where the factor is because the internal clock rate of our architecture is half the input clock rate. Therefore, if our architecture and other architectures have the same internal clock rate, the throughput of our architecture is twice that of other architectures. To do this, doubling the input clock rate for the pixel input can be used. The outcome of the comparisons shows that our design outperforms other architectures, especially in computing time, control complexity, and hardware utilization.
We also compare the computing time and the hardware utilization between our architecture and the parallel filter architecture [16] for different 2-D DWT levels. The design of the parallel filter architecture is based on the MRPA, which intersperses the computation of the second and following levels among the computation of the first level. Fig. 16 plots the computing time for different 2-D DWT levels, showing that in onelevel 2-D DWT , the computing time of our architecture is clock cycles. As the 2-D DWT level increases , the computing time converges to clock cycles. In contrast, the parallel filter architecture always requires the computing time of clock cycles. On the other hand, as shown in Table I , the hardware utilization of the parallel filter architecture is only 50% in one-level 2-D DWT . As the 2-D DWT level increases , the utilization converges to 66.67%. However, our architecture can consistently maintain 100% hardware utilization.
Concerning the storage size, the proposed architecture requires a RAM module of size to save the intermediate data. However, because the proposed architecture is mainly applied in the image compression systems, it can use the memory already existing in the systems to save the intermediate data.
Hence, in this condition, the proposed architecture will not require the RAM module. The remain storage size is , where is the line delays required in stage 2 for vertical filtering, and is the registers required in stage 1 for horizontal filtering.
VI. CONCLUSION
In recent years, many 2-D DWT architectures have been proposed to meet the requirements of real time processing. However, the hardware utilization of these architectures needs to be further improved. Therefore, in this paper, we have proposed an efficient architecture for the 2-D DWT. The proposed architecture has been correctly verified by the Verilog Hardware Description Language (Verilog HDL). The advantages of the proposed architecture are the 100% hardware utilization, fast computing time, regular data flow, and low control complexity, making this design suitable for next generation image compression systems, e.g., JPEG-2000.
