Abstract-In this paper, we propose two architectures for the direct two-dimensional (2-D) discrete wavelet transform (DWT). The first one is based on a modified recursive pyramid algorithm (MRPA) and performs a "nonstandard" decomposition (i.e., Mallat's tree) of an image in approximately 2 2 3 clock cycles (ccs). This result consistently speeds up other known architectures that commonly need approximately 2 ccs. Furthermore, the proposed architecture is simpler than others in terms of hardware complexity.
Two Fast Architectures for the Direct 2-D Discrete
Wavelet Transform
Francescomaria Marino
Abstract-In this paper, we propose two architectures for the direct two-dimensional (2-D) discrete wavelet transform (DWT). The first one is based on a modified recursive pyramid algorithm (MRPA) and performs a "nonstandard" decomposition (i.e., Mallat's tree) of an image in approximately 2 2 3 clock cycles (ccs). This result consistently speeds up other known architectures that commonly need approximately 2 ccs. Furthermore, the proposed architecture is simpler than others in terms of hardware complexity.
Subsequently, we show how "symmetric"/"anti-symmetric" properties of linear-phase wavelet filter bases can be exploited in order to further reduce the VLSI area. This is used to design a second architecture that provides one processing unit for each level of decomposition (pipelined approach) and performs a decomposition in approximately 2 2 ccs. In many practical cases, even this architecture is simpler than general MRPA-based devices (having only one processing unit).
Index Terms-Discrete wavelet transform, nonseparable 2-D DWT, VLSI.

I. INTRODUCTION
T HE two-dimensional (2-D) DWT [1] , [2] has gained popularity in the field of image and video coding (e.g., [3] - [5] ) since it allows good complexity-performance tradeoffs [6] , [7] and outperforms the discrete cosine transform (DCT) at very low bit rates [7] - [9] . Nevertheless, 2-D DWT demands massive computations, and in applications requiring real-time performances, the use either of parallel implementations (e.g., [10] ) or of efficient VLSI ASICs is strategic.
When the 2-D wavelet basis functions are separable, the 2-D DWT can be split into two 1-D operations (i.e., row-wise and column-wise filtering). Several DWT architectures that exploit the "separable approach" and are effective for bidimensional applications have already been proposed (e.g., [11] - [16] ).
All of them employ one-dimensional (1-D) filter units but require the processing of a certain number of rows of the input image 1 before collecting, by means of transposition, the data to be processed by column-wise filtering. Such a transposition is a VLSI-area consuming task (especially when the sizes of the Manuscript received July 7, 1999 ; revised February 23, 2001 . The associate editor coordinating the review of this paper and approving it for publication was Prof. Chaitali Chakrabarti.
The author is with the Dipartimento di Elettrotecnica ed Elettronica, Facoltà di Ingegneria, Politecnico di Bari, Bari, Italy (e-mail: marino@deemail.poliba.it).
Publisher Item Identifier S 1053-587X(01)03892-2.
images are considerable) and represents an additional latency that strongly reduces the possibility of real time processing. Another approach for computing the 2-D DWT consists of directly decomposing the input signal into two dimensions ("nonseparable" or "direct" approach). Only a few architectures have been proposed that "directly" perform the 2-D DWT: the parallel filter proposed in [18] and the devices described in [19] and [20] . This lack of architectures is due to the fact that "direct" architectures require more multipliers and accumulators (MACs) than "separable" architectures do [i.e., versus , where is the size of the filter]. Nevertheless, we should note the following.
• Both approaches need storage units (which typically require VLSI-area). This means that for most applications (i.e., when is much greater than ), the area of these resources is dominant, and therefore, the need for more MACs in the direct approach is not so drastic.
• The direct approach does not require any transposition.
• Only the direct approach is possible when the wavelet basis functions are not separable (using nonseparable filters leads to more degrees of freedom in design and, consequently, better filters). Therefore, even though the separable approach allows "recycling" of devices designed for 1-D applications, when a device has to be explicitly designed for 2-D application, or when real-time performance is mandatory, we think that the direct approach needs to be explored since it avoids the latency introduced by the transposer. 2 In this paper, we propose an innovative processing unit ("quadri-filter") based on the direct approach that performs one level of decomposition of an data set in approximately clock cycles (ccs). Afterwards, we show how a MRPA-like scheduling can be used in order to derive, from the "quadri-filter," an architecture [say, general MRPA (GMRPA)] that performs a "nonstandard" decomposition (i.e., Mallat's tree [2] ) in approximately ccs. Despite its performance (that constitutes an interesting speed-up with respect to the above referenced architectures that need approximately ccs to perform a decomposition), the GMRPA-based architecture (in terms of hardware complexity) is simpler than classical devices.
In addition, we focus our attention on the linear-phase filter bases, which are very attractive for implementing pyramidal structures, since they may be easily cascaded without requiring phase compensation between two adjacent levels. We study how the symmetry/anti-symmetry of the linear-phase filters can be exploited in order to provide further reduction in the hardware complexity of the "quadri-filter" and to make feasible another, even faster, architecture [say, symmetric pipeline (SP)], based on the pipelined approach. We shall show that in many practical cases, SP, having one "symmetric"/"anti-symmetric" processing unit for each level of decomposition and performing a decomposition in approximately ccs, results in a less complex architecture than more general MRPA-based devices.
The paper is structured as follows. The "quadri-filter" is described in Section II. The GMRPA architecture is introduced in Section III. Optimizations achievable for linear-phase filter bases are studied in Section IV, and the SP architecture is derived in Section V. Comparative evaluations are provided in Section VI, and concluding remarks are summarized in Section VII.
II. "QUADRI-FILTER" BLOCK
The "quadri-filter" block is the cornerstone of the architectures proposed in this paper. Such a block performs one level of 2-D DWT decomposition, according to the direct-approach, i.e., computing the four downsampled 2-D convolutions:
where and denote coefficients, respectively, of the low-low, low-high, high-low, and high-high -tap filter bases (say, and ), 3 and , and denote, respectively, coefficients of the low-low, low-high, high-low, and high-high subbands produced at the decomposition level [say , each subband having coefficients]. Note that for represents the input into the first level of decomposition, i.e., the input image . In (1a)-(1d), the right side terms have been introduced only in order to make more clear the functionality of the "quadri-filter," which, as it is explained below, splits the 2-D computation into several 1-D computations (one for each row of the 2-D kernel).
The design of the "quadri-filter" has been derived from the well-known systolic convolver shown in Fig. 1 [22] . Such a device realizes a 2-D convolution of a word-serial/row-wise input data set using a filter. It employs a pipe of row-delay circuits having as many cells as the number 3 The condition of equal size for each filter basis is not a restriction since it can be simply matched by suitably zero padding the filters having shorter sizes. of pixels in a row of (e.g., ) 4 and 1-D convolvers operating in parallel on consecutive rows of . Each 1-D convolver has taps corresponding to the th row of the -tap 2-D filter. Since each DWT subband is obtained by a downsampled 2-D convolution (1a)-(1d), in the design of a DWT architecture, we can split the pipe of row-delay circuits of Fig. 1 into two distinct pipes working in parallel, as shown in Fig. 2 . In this way, the even-numbered and the odd-numbered rows of the input data set can be simultaneously fed (in a word-serial fashion), respectively, into the processors and in in order to directly achieve the downsampling along the rows. 5 The processors replace the 1-D convolvers of Fig. 1 and implement four different downsampled 1-D filter operations by using, respectively, the th row of the 2-D wavelet filters and . In other words, each computes the terms ll lh hl and hh , i.e., the four inner sums that are underlined in the (1a)-(1d).
(see Fig. 4 , where the following can be easily extended to other processors) is composed by processing elements PE ( ; see Fig. 5 ) and two coefficient-adders (ADD A/C and ADD B/D ). Its operating mode is described in Table I ; in every cc, the multiplier uses, in an interleaved fashion (controlled by the select signal ), and [ and ] and computes either (during the even-numbered ccs ) or (during the odd-numbered ccs ). Once computed, these terms are fed into the coefficient adder ADD A/C [ADD B/D ], which produces either ll lh (in the even-numbered ccs ) or hl hh (in the odd-numbered clock-cycles ), 4 The row-delay circuits can be realized in different ways. One possibility is a conventional shift-register implemented by a pipe of N D-flip-flops; another choice consists of pushing them into SRAM modules. The first solution allows faster working frequency than the second, which also requires an address generator (basically a counter modulo N). Nevertheless, depending both on the technology and on the value of N, the second strategy might be more suitable for implementation than the first one. 5 The simultaneous input of two adjacent rows has also been employed by other devices (e.g., the serial-parallel architecture proposed in [14] having a latency of N + N ccs). In addition, some "separable approach"-based devices exploit parallel I/O (e.g., [23] , [24] ). We would like to remark that neither a parallel access to the input dataset nor large buffers are strictly required in order to achieve the desired "double channel" input strategy. In fact, it is sufficient to scan the input image according to the zig-zag path proposed in Fig. 3 by using a sampling frequency f that is twice as high as the internal working frequency of the device f . Nonstandard scanning paths have also been utilized by the "direct approach" based device proposed in [20] , having a latency of 2N ccs and by the "separable approach"-based architecture described in [25] . as summarized in the legend of Fig. 4 . In this way, (1a), (1c) and (1b), (1d) are computed in the even-numbered and in the oddnumbered ccs, respectively, and the downsampling by column is directly achieved. Because of the interspersed computation of ll with hl (as well as for lh and hh ), the "quadri-filter" requires only two adders ADD LL/HL and ADD LH/HH (see Fig. 2 ) performing the outer sums of (1a), (1c) and (1b), (1d), respectively, and achieving the final results.
III. MRPA-LIKE-BASED ARCHITECTURE
MRPA-based architectures exploit the downsampling of the output subbands and perform the first decomposition level interspersed with all the other levels by means of only one processing unit. A MRPA-like-based architecture built on the "quadri-filter" block is represented in Fig. 6 . In this example, we have considered and levels of decomposition (the extension to different values is straightforward). In addition, the even-numbered processors have been grouped separately from the odd-numbered processors for the sake of clarity.
A. Design
The architecture is characterized by serial-in/serial-out sets of row delay circuits . Specifically, the row delay circuits are composed of cells and are needed to store the rows of the input image, whereas the row delay circuits are composed of cells and are employed to store the rows of (say, ), which are used as input for computing the decomposition level . The production of different decomposition levels is scheduled according to an algorithm that differs from the MRPA since our "quadri-filter" needs the parallel input of two rows; here, and (i.e., the th row of each subband at the decomposition level ) can be computed only when (and as soon as) two adjacent rows and have already been produced and are available, respectively, in and . To clarify, two periods of such MRPA-like scheduling are represented in Table II . Each period is composed of several subperiods, during which a select signal assumes a value . Each subperiod takes ccs and is devoted to produce one row of each subband of the decomposition level . The rows and are immediately output; conversely, the row is fed back and stored either in (if is even) or in (if is odd).
B. Timing
It should be noted that during a subperiod , the data stored in (and only them) are needed as input in the quadri-filter block. Therefore, only the cells of have to be triggered. In this way, in a proper timing unit, the clock signal CK is demultiplexed onto different lines (say CK ) triggering the homologous . Conversely, (or ) is devoted to buffer the downsampled row (or ) as soon as it is produced; therefore, it needs to be triggered by a clock signal (say, ck) twice as slow as CK because of the downsampling. As a consequence, ck is generated and demultiplexed onto two lines (say ck and ), triggering and , respectively. Two other select signals control the data flow in the device: , which is set to 1 [0] during the even-numbered [odd-numbered] ccs, and , which is set to 1 [0] during the production of even-numbered [odd-numbered] rows.
It should be remarked that timing unit and control system represent a cost that is typical of MRPA-based devices. Nevertheless, because of the serial input of the "quadri-filter," the control system of the proposed architecture needs an routing network, two -output demultiplexers, and -input multiplexers, and therefore, it appears simpler than control systems of other devices [for instance, the MRPA-based parallel filter described in [18] needs an routing network, -output demultiplexers, and -input multiplexers]. More detailed comparative evaluation of the architecture will be given in Section VI.
IV. SYMMETRY EXPLOITATION
DWT architectures have been designed based on the properties of the DWT decomposition (e.g., MRPA-based architectures). In some cases, the properties of specific wavelet filter bases (e.g., the four-tap Daubachies wavelet [1] ) have been exploited to design compact processors that do not require multipliers; however, the derived architectures (e.g., [26] and [27] ), even though they are efficient for the filters for which they have been designed, are not flexible. In this section, we focus our attention not on a specific filter basis but on a largely used class of filters, namely, linear-phase filters. These filters are very attractive for implementing pyramidal structures (such as DWT or, more in general, filterbanks) since they do not require phase compensation between two adjacent levels.
We will show that the "quadri-filter" can be optimized when the DWT has linear-phase bases since linear-phase filters have a symmetric (or anti-symmetric) impulse response. 6 Such symmetric properties can be formalized in (2a) and (2b), which express relations among the coefficients of a -tap filter for the case of symmetry and anti-symmetry, respectively:
Here, and denote the highest integers not greater than and , respectively . In the case of 2-D DWT having linear-phase bases, can be read without distinction as or . In order to consider a practical example, let us examine the case of a -tap symmetric filter . Anyway, the following analysis can be extended to any other case of evenlength/odd-length symmetric/anti-symmetric filter 7 .
For the considered case, because of the symmetry with respect to the fourth row of the filter kernel, from (2a), we derive which means that and , processing, respectively, and , adopt the same filter coefficients. This circumstance is remarked by the white arrows in Fig. 2 . Therefore, we can employ only one processor (e.g., ), instead of two, on condition that this processor receives in input the point-wise sum of with . As a result, the architecture can be reduced to the scheme shown in Fig. 7 .
Similarly, because of the symmetry with respect to the fourth column, we have (4) which can be exploited in order to optimize the design at the level of the 1-D processors .
In fact, let us consider the functionality of (extensions to other processors are straightforward). According to the global where the terms with subscript correspond to the sum of the similar terms with subscripts and in (1a)-(1d) and have to be added to the similar terms and (produced, respectively, by and ) in order to generate a specific subband coefficient. Since the underlined additions in (5a)-(5d) are performed in the pipe of row-delay circuits before being input into (see Fig. 7 ), (5a)-(5d) can be rewritten as which can be computed by the systolic array shown in Fig. 8 . Such an array represents a reduction of the scheme of Fig. 3 since it requires only PEs (only PEs for even values of ). The functionality of (when it implements a "symmetric" filter) is clarified by Table III . During any cc, the adders compute the sum (in case of "anti-symmetric" filter, ), and being, respectively, and . Afterwards, , which constitutes the underlined term in (7a), is fed into the processing element PE , working in the same way that has been described in Section II.
V. PIPELINED ARCHITECTURE
The above optimizations allow designing of a "symmetric"/"anti-symmetric" MRPA-like-based architecture (say, SMRPA) for linear-phase wavelet bases that needs less hardware than a general MRPA-like-based architecture (say, GMRPA). The structure of SMRPA can be easily derived from Sections III and IV, and we will not discuss in detail its implementation; an analysis of its hardware complexity and latency will be provided in Section VI. We devote this section to a second architecture that exploits the pipelined approach in order to speed up the processing.
A pipelined 2-D DWT architecture has as many processing units as the levels of decomposition to be implemented. Since, in our case, each processing unit basically consists of a "quadrifilter," a "symmetric"/"anti-symmetric" pipelined architecture (say, SP) can be reasonably implemented in case of linear-phase wavelet bases. In the next section, we will show that in many practical cases, SP is simpler than conventional (but, filter independent and more general) MRPA-based devices.
A. Design
In a pipelined 2-D DWT (see Fig. 9 ), the subband coefficients lh hl , and hh are generated in the slice of the pipeline and directly output . Conversely, once the coefficients ll are produced, they are fed into the slice to be used as input at the decomposition level . Obviously, because of the downsampling, the quadri-filter block has as many delay elements as the pixels in row of the input image (e.g., N ). "f-g" denotes the possibility of using the proposed scheme in case of anti-symmetric filter bases simply by complementing an input of the adders. In this example, L = 7. Dotted/dashed lines indicate the possibility of pipelining. Fig. 8 . One-dimensional processor P (odd-length symmetric filter basis). In case of even-length symmetric filters (e.g., L = 6), PE and the third delay are not needed, and p is input to the former fourth delay. "f0g" denotes the possibility of using the proposed scheme in case of anti-symmetric filter bases simply by complementing one of the inputs of the adders. Dotted/dashed lines indicate the possibility of pipelining.
at the level will be designed considering a input data set.
Note that, since the "quadri-filter" block outputs the subband ll in an interspersed fashion with the subband hl via the output line LL/HL (Fig. 7) , an adapter like that one shown in Fig. 10 has to be inserted between two consecutive blocks. Such an adapter divides the subband ll from the subband hl and provides the "quadri-filter" at the level with the parallel input of two rows of , even though is row-wise produced by stage in a serial fashion.
B. Timing
The above policy is assured by the select signal , 8 which is set to 1 [0], whereas the even-numbered [odd-numbered] rows of are produced. As a consequence, the even-numbered rows of (as soon as they are produced, i.e., ) are buffered in a shift register having cells in order to be fed into level simultaneously with the production of the adjacent odd-numbered row . Note that clock signals used in slice have to be "frozen," whereas the even-numbered rows are produced by slice since they are buffered and not immediately employed as input to level (see the AND gates in the controller of Fig. 10) .
A further remark concerns the data flow of the ll coefficients; their production is interspersed with gaps of 1 cc on the output line LL because of the downsampling by columns performed in the level . These gaps can be efficiently exploited. In fact, without deteriorating performance, a single multiplier (e.g., ) can also be used to perform the functionality of in PE on the condition that each coefficient ll is "replicated" during the gap that follows itself on the line LL. 9 Note that such a "replication" can be achieved in a very simple way, triggering (at the level ) the row-delay circuits and the D-registers by means of a clock signal CK , which is two times slower than the clock signal CK , which triggers the row-adder, the coefficient adder, the multiplier , and the adder (see the timing diagram in Fig. 9 ). The correct computation of (8a)-(8d) by means of a single multiplier is allowed by three two-input multiplexers:
(controlled by the select signal ), and (controlled by the select signal ), as shown in Fig. 11 . In each PE , the combined actions of and [ and ] exploit the gaps in order to allow to perform the job of and during the even-numbered ccs [the odd-numbered ccs ]. can be obtained by doubling the period of the clock signal has the same period of the signal ( has a period equal to ). and are obtained by the signal in the same way the clock signals and are obtained by the clock signal . These signals are recursively derived by the adapters inserted between level and level (see Fig. 10 ).
Because of the use of a single multiplier, for , only one coefficient-adder (instead of two) is really necessary for each processor, and only one row-adder 10 (instead of two) is really necessary for each "quadri-filter." Note that because of the downsampling, starting from level 2, the hardware is underutilized (i.e., throughput %). Higher hardware utilization can be achieved by introducing an MRPA-like-based block at the level 2 (i.e., designing a pipeline composed only by two processing units, where the second unit performs all the decomposition levels , where ), but we will not discuss in detail this hybrid solution.
VI. COMPARATIVE EVALUATION
In order to evaluate GMRPA, SMRPA, and SP, we compare them with the "parallel filter" described in [18] (say, PF), which is (to the best of our knowledge) the fastest 2-D DWT "direct approach"-based processor. Therefore, we will evaluate the hardware complexity and the latency , assuming levels of decomposition and -taps filter bases, where is the dimension of the input image. Since we aim to have measures independent from the specific integration technology, and will be evaluated, respectively, in terms of number of transistors and in terms of number of ccs. These figures of merit can be converted in "abstract values" of silicon area and time for a specific implementa- 10 The row-adder, as well as the coefficient-adders, has to be triggered by CK . tion when multiplied by the average area needed by a transistor and by the period of one cc, respectively. Note that due to the silicon area needed by the routing and because of some physical phenomena (e.g., load capacitances, fan-in, and fan-out, etc.), "actual values" for a specific integration technology may be different from the derived "abstract values," and therefore, they can be known only by means of a detailed analysis of the particular implementation.
A. Hardware Complexity
Characteristics of , and are plotted in Fig. 12 Another remark concerns the precision. In DWT applications, the required precision grows with the levels. As a consequence, whereas the only processing unit of GMPRA, SMRPA, and PF must achieve the precision required by the level (i.e., the highest one), SP might be realized by slices that in lower levels achieve lower precision. This possibility could provide further reduction to . In the above evaluations, the routing has not been considered. Nevertheless, we can reasonably state that due to the serial and semi-systolic structure of the "quadri-filter," GMRPA, SMRPA, and SP require a routing network that is simpler than the routing network needed by PF [i.e., complexity versus complexity].
B. Computing Performance
A relevant remark concerns the computing performance of GMPRA, of SMRPA, and of SP. In the MRPA-like-based architectures, each level of decomposition requires ccs, being the number of samples input into that level. Therefore, we have
which means since the right-side term of (8) is upper bounded by . Because of the pipelined structure, ccs, which is the latency of the first and slowest slice of SP. These latencies are significantly shorter than , which is approximately ccs [18] . In order to evaluate these latencies in time units, we can assume that the period of the working clock cc is lower bounded by the latency introduced by all the cells met in the [6] [7] [8] critical path. 11 The clock periods evaluated in such a way are reported in Table IV . In particular, we have that (9) and we can conclude that 1) (in time units); 2) (in time units). It should be remarked that more performing devices (in terms of time units) can be obtained by pipelining the processors internally. For instance, latches inserted as denoted by the dotted/dashed lines in Figs. 2, 4 , and 6-8 could significantly speed up the clock frequency. Anyhow, PF can be similarly pipelined as well, and therefore, the above comparative evaluation does not change its sense in presence (or not) of pipelining.
VII. CONCLUSION
In this paper, we have proposed two fast architectures for the direct 2-D DWT. The first one is a MRPA-like-based architecture performing a "nonstandard" decomposition in approximately ccs. It is significantly faster than classical MRPA-based devices, which commonly need approximately ccs. A comparative analysis of hardware complexity has also shown the effectiveness of the proposed device.
The second architecture exploits the pipelined approach and is even faster since it performs a decomposition in approximately ccs. "Symmetric"/"anti-symmetric" properties of linear-phase filter bases have been studied and exploited in order to reduce the hardware complexity. Obviously, the global complexity grows with the number of decomposition levels, and the effectiveness of the architecture, in terms of hardware complexity, also depends on the parameters of the specific application. We have shown that in many applications, a "symmetric"/"anti-symmetric" pipelined architecture is simpler (in terms of hardware complexity) than other known (but filter-independent) MRPA-based devices.
ACKNOWLEDGMENT
The author acknowledges the Associate Editor and the anonymous reviewers for some constructive comments that have improved this paper.
