This paper proposes an efficient architecture for the two- 
I. Introduction
With the rapid progress of VLSI design technologies, many processors based on audio and image signal processing have been developed recently. The tow-dimensional Discrete Wavelet Transform (2-D DWT) is the most important technique of the JPEG-2000 image compression standard [l] .
Presently, research on the DWT is attracting a great deal of attention. In addition to audio and image compression, the DWT has important applications in many aspects, such as computer graphics, numerical analysis, radar target distinguishing and so forth. The architecture of the 2-D DWT is mainly composed of the multirate filters. Because an extensive computation is involved in the practical applications, e.g., digital cameras, high efficiency and low cost hardware is indispensable. Among the various architectures, the most prevalent design for the 2-D DWT is the parallel filter architecture [6], [7] . The design of the parallel filter architecture is based on The advantage of such a scheme is that the data flow is very regular. We can concentrate our effort to efficiently design the transform module. As shown in Fig. 2 , the transform module is tree-structured and comprises two stages.
Stage 1 performs horizontal filtering, and stage 2 performs vertical filtering. To design the transform module efficiently, we assume W ' to be the area cost and ''I" to be the time cost required in stage I. According to the original design as shown in Fig. 2 , the number of filters required in stage 2 is double that in stage 1. That is, 2a is the area cost required in stage 2. On the other hand, due to the decimation operation in stage 1, the quantity of data for filtering in each branch of stage 2 is half that of stage 1. Hence, the processing time required in stage 2 is,half of t, i.e., t/2. Because stage 2 is cascaded after stage 1, stage 2 can not work until stage I finishes its job. Therefore, from the above discussions we find that there will be 2a x (t -t/2) = at hardware idle in stage 2.
In other words, the hardware utilization in the original design of the transform module is inefficient. In order to solve this problem, we consider a single decimation filter. The decimation filter can be implemented directly by a filter followed by a two-folded decimator.
However, the decimator discards one sample out of every tow samples at the filter output, causing poor hardware utilization. Table 1 . From Table I , we find that if we employ the polyphase decomposition technique to stage 1 and the coefficient folding technique to stage 2, the area and time cost will both be the same a and t/2 in stages 1 and 2. Thus, the total area cost is 2a and the total time cost is t/2. The AT product is reduced from 3at to at, and no hardware is idle in stage 2. It can be seen that the performance of the new design method is three times more efficient than the original design.
In contrast, the other design methods, as listed in Table 1 Assume that the low-pass filter has four taps: U,, a,, U,, and u3, and the high-pass filter has four taps: bo, b,, b,, and b,. [6] C. Chakrabarti and C. Mumford, "Eficknt realizations of analysis and synthesis filters based on the 2-D discrete 
