Abstract-In this paper, we present our proposed architecture (PA) for the direct two-dimensional discrete wavelet transform (DWT), which performs a complete dyadic (i.e., nonstandard) decomposition of an 0 0 image in approximately 2 0 4
I. INTRODUCTION
T HE two-dimensional (2-D) discrete wavelet transform (DWT) [1] - [4] is a mathematical technique that decomposes a 2-D discrete signal in a multiresolution space domain by using dilated/contracted and translated versions of a single finite duration basis function, named the prototype wavelet. In the field of image processing, the 2-D DWT has been recently used as a powerful tool for texture discrimination (e.g., [5] , [6] ), fractal analysis (e.g., [7] , [8] ), and image compression (e.g., [9] - [18] ). Nevertheless, 2-D DWT demands massive computations. As a consequence, in applications requiring real-time performances (e.g., on-line video and image coding), the use either of parallel implementations (e.g., [19] - [21] ) or of efficient VLSI application-specific integrated circuits (e.g., [22] - [35] ) is strategic. Therefore, a large number of VLSI architectures have been realized or proposed that implement the 2-D DWT. Some of them are simply constituted by a one-dimensional (1-D) DWT device and a transposer and are suitable for those applications employing separable 2-D wavelet bases ("separable" approach). They process rows of Manuscript received January 2000; revised October 2000. This paper was recommended by Associate Editor A. Skodras.
The author is with the Dipartimento di Elettrotecnica ed Elettronica, Facoltà di Ingegneria, Politecnico di Bari, Bari 70125, Italy (e-mail: marino@deemail.poliba.it).
Publisher Item Identifier S 1057-7130(00)11669-6.
the input image before obtaining, by means of transposition, the data to be processed in the column-wise step. Even though can be reduced up to (where is the size of the filter) by means of recursive pyramid algorithm (RPA)/modified recursive pyramid algorithm (MRPA) scheduling techniques [26] , such a transposition remains an VLSI-area consuming task and, above all, represents an additional latency that strongly reduces the possibility of real time processing.
In order to avoid this bottleneck, the 2-D DWT can also be performed directly, decomposing the input signal into two dimensions ("nonseparable" or "direct" approach). Only a few architectures have been proposed that "directly" perform the 2-D DWT: the parallel filter proposed in [32] and the devices described in [33] - [35] . This lack is due to the fact that "direct" architectures require a number of multipliers and accumulators (MACs) higher than that employed by separable architectures [i.e., versus ]. Nevertheless, the following should be noted.
1) Both the approaches employ storage units, which typically require VLSI area (this means that for most applications, i.e., when , these resources become preponderant in the global VLSI-area count, and therefore the need for more MACs in the direct approach is not so drastic).
2) The direct approach does not require any transposition (this means that VLSI area and additional latency are saved with respect to the nondirect approach).
3) The direct approach is the only possible when the wavelet basis functions are not separable. For what concerns the performance, all the above-referenced architectures require at least clock cycles (ccs) to decompose an input image in decomposition levels (supposed 1 cc the latency of one MAC). Now, we observe that the demand for low-power VLSI circuits in modern mobile/visual communication systems is growing every day, and the progress in the running of VLSI technology has strongly reduced the cost of the hardware. Therefore, we think that to reduce the latency, even increasing the hardware, 1 is a possibility that merits consideration. In fact, a device having a latency ccs could be employed not only for performing a times faster processing than that allowed by another device having a latency of ccs, but also (if is clocked at a frequency ) for reaching the same performance reached by when this one is clocked at a frequency . This circumstance allows of reducing the supply voltage (linear in ) and the power dissipation (linear in ), respectively, by and by with respect to [36] . The above considerations motivate the object of this paper, which proposes a direct approach-based pipelined architecture performing the 2-D DWT of an input image in only ccs. Moreover, even though our architecture is a pipeline, 2 it has a high hardware utilization (i.e., efficiency): 100% for and 87.5% for . The joined effect of these characteristics provides our architecture with an complexity 3 notably lower than that of other architectures.
The outline of this paper is as follows. In the next section, we shortly recall the direct 2-D DWT and put in evidence two properties that will be exploited in the design of the architecture. Motivations and a top-level scheme of the architecture are introduced in Section III. The block implementing the first stage of the pipe (say, ) is described in Section IV. In Section V, two different approaches (say, and ) of designing the second stage are proposed: they depend on the number of decomposition levels (specifically, and have to be used if the particular application requires, respectively, and decomposition levels). Evaluations of the proposed architecture are given in Section VI, and conclusions are summarized in Section VII.
II. THE DIRECT 2-D DISCRETE WAVELET TRANSFORM
The 2-D DWT [1] - [4] is a multilevel decomposition technique. In it, each decomposition level can be seen as the further decomposition of a 2-D data set (having samples) into four subbands , , , and [having samples]. In the direct approach, such a 2 Typically, in pipelined DWT architectures, modules implementing levels higher than the first one are strongly underutilized because of the down-sampling (e.g., see the 1-D DWT architectures in [37] - [40] ). 3 AT parameter is an index of complexity for VLSI architectures introduced by Thompson [41] . Such index is given by the measure of the VLSI area (A) times the square of the measure of the latency (T ). Generally, it is sufficient to estimate A and T in O( 1 ) notation. Nevertheless, in this paper, we will measure them in terms of number of transistors and number of ccs in order to get a more accurate comparative evaluation. decomposition is produced by four 2-D convolutions followed by a decimation by two both in the horizontal and in the vertical dimension, as is depicted in Fig. 1 voted to perform all the four subbands (e.g., [21] , [30] , [32] , [33] , and [35] ). In such devices, the downsampling occurring at the generic level ( ) is generally exploited by "feeding back" the coefficients of the subband as soon as they can be used as input for computing the decomposition level (RPA-based approach [26] ): this folding minimizes the memory requirement and allows "recycling" of the same filter units for all the decomposition levels.
Nevertheless, as we have already said, to aim at reducing the latency, even increasing the hardware, is a strategy that merits analysis in order to match the demand of low-power VLSI circuits in modern mobile/visual communication systems.
Our idea consists of: 1) considering the intrinsic horizontal parallelism of the multisubband decomposition (i.e., to employ four filter units, one for each subband to be computed at the first level); 2) considering the intrinsic vertical parallelism of the multilevel decomposition (i.e., to pipeline the decomposition levels); 3) exploiting the downsampling in order to speed up the throughput in the architecture; 4) adopting an routing network, as suggested by (2), much simpler than the scheme employed in parallel devices. The architecture derived from this strategy is a "two-stage" serial pipeline resulting four times faster than already known devices. In it, the first stage (block ) is exclusively devoted to perform the decomposition level 1, while all the other decomposition levels are performed by the second stage. Such a second stage can be realized either as a "conventional" block [i.e., computing only the second decomposition level, supposed ; see in Fig This approach can be implemented by 1-D processors working in parallel [say, ; ) and by a common Row-Adder block: each processor being used to compute the four inner sums of (2a)-(2d) (i.e., the 1-D decimated convolutions , , , ) and the Row-Adder block performing the outer sums. A top-level scheme of the block is provided in Fig. 3 . Such a scheme takes into account Property A. In fact, because of Property A, the block can be split into two independent and parallel subblocks (say, and ), which process exclusively either even-numbered or odd-numbered rows of and are composed, respectively, by the even-numbered processors ( ) and by the odd-numbered processors ( ). In the design of the processors (see Fig. 4 while is used to feed into , , , . Note that because of the independence of the computations, and can be fed in parallel into the subblocks. The filter units in Fig. 4 have been designed according to a well-known scheme (e.g., [42] ), but any scheme for serial input/output convolution could be used. The serial input processing mode has been chosen since it provides an connecting scheme much simpler than the scheme needed by parallel processing mode. The four gray latches in Fig. 4 have no functional reason. They have been inserted in order to only make the critical paths limited to one multiplier and one adder. By this way, we can satisfy (in a first approximation) the assumption that the clock period (i.e., the duration in time of 1 cc) is bounded by the latency of one multiplier and one adder, which is the clock period generally adopted in DWT architectures [22] - [34] . Additional latches inserted at the output of each multiplier could further decrease the minimum allowed clock period.
DDG 1 is the data dependence graph of computing ( ). From this one, the DDGs of , , and can be straightforwardly derived. In this DDG, and in the other DDGs that will be considered in the following, the columns related to the I/O data lines (e.g., , , , , etc.) show the data present on that line in a specific cc [identified by the clock cycles counter ]. The products computed at any cc in the filters are shown in the related columns. Arrows are used to denote how these products are delayed and added in order to achieve the final results.
From this DDG, we note that the term (i.e., the point-wise sum of the outputs from ) is generated by the adder and output by means of the line at the rate of one sample per cc, starting on . Similarly, in parallel and at the same rate, the terms , , and (i.e., the point-wise sum of the outputs from ; ; and ) are generated by the adders , , and and output by means of the line , , and . According to the above discussion, the block can be achieved by assembling the processors as shown in the architectural scheme of Fig. 6 . Note that we can adopt four distinct input lines since, as a consequence of the joint exploitation of Properties A and B, the 2-D input data set can be sampled at a frequency four times higher than the working frequency of the device. In this way, , , and at the global rate of four samples per cc (see column " "). These terms are added by the Row-Adder block in order to achieve the desired subband coefficients , , , (see columns , , , and ). Such a Row-Adder block can be simply implemented by four -input adder circuits (namely, , , , and ), each one performing the outer sum of (2a)-(2d).
The proposed design of is fully scalable with any value of , , and and has a 100% hardware utilization since at any cc, every processing unit is employed in effective computations.
V. SECOND STAGE
When the particular application requires decomposition levels (as in most cases), the subband produced by has to be transformed further on. In our approach, this requires an additional stage to be pipelined at the output line coming from . In this section, we consider two different designs, respectively, for and . Specifically, we show that if only two decomposition levels are needed, a block (say, ) devoted to compute the second decomposition level and having 100% efficiency for any value of , , and can be designed. Conversely, for those cases needing decomposition levels, we propose an RPA-like block (say, ) performing any decomposition level ( ). 6 Such a block has an efficiency growing with but lower than 100%.
A. Case of : The Block
First, let us consider that the throughput of the block is four times lower than that of because of the decimation. Consequently, a balanced pipeline can be achieved providing with an amount of MACs four times smaller than that employed by . Moreover, we observe that receives the samples of in a row-by-row fashion and as a serial data stream at the rate of one sample per cc via the line coming from . This means that cannot access in parallel even-numbered and odd-numbered rows of the input as can. In order to efficiently take into account this circumstance, we subdivide each computation period of 2 cc into two phases 5 The simultaneous input of adjacent rows has been employed also by other devices (e.g., the serial-parallel architecture proposed in [30] , which, even though it processes in parallel two rows, has a latency of N + N cc). Moreover, also some "separable approach"-based devices have exploited parallel I/O (e.g., [43] and [44] ).
We would like to remark that neither a parallel access to the input dataset nor large buffers are strictly required in order to achieve the desired "four channels" input strategy. In fact, it is sufficient to scan the input image according to a zig-zag path as described in Fig. 5 , using a sampling frequency f four times higher than the internal working frequency of the device f . Nonstandard scanning paths have also been utilized by other devices (e.g., [34] and [45] ). 6 A similar approach has already been adopted by the separable architecture in [28] , having a latency of N + N cc. identified by a binary select signal : phase (i.e., when receives from the even-numbered rows of , say, , ), and phase (i.e., when receives the odd-numbered rows of , say, ). Our idea consists of buffering each row that is input during the phases in order to feed it into in parallel with during the immediately following phase . As we shall show, the processors of will be effectively employed in both the phases achieving a full utilization.
Let us consider the detailed scheme of , which is provided in Fig. 7 
where and denote, respectively, the value held at the cc by the first and by the last cell of the row-delay circuit , respectively. 7 By this way, each row can be input twice into the same processor 7 The clock cycles counter is introduced only in order to keep this context independent from the propagation delay introduced by B [we assumed = 0 when a a a (0) is output by B ]. within a period (once per phase; see column in Table II) , and can be devoted to produce the terms and and the terms and , respectively, during phases and . We observe that each row is received by in a serial fashion. This means that the even-numbered and the odd-numbered samples and cannot be input in parallel, as and can at level 1. Therefore, in order to exploit Property B, the input line of (say, ) is split into two input lines (say, and ; see Fig. 8 ), using the filter coefficients according to an interleaved fashion, as shown in Fig. 8 and clarified by DDG 2. DDG 2 is easily understandable because of its similarity with DDG 1. The reader should only take into account that during the phases , operates using the filter coefficients of in the even-numbered ccs ( ) and the filter coefficients of in the odd-numbered ccs ( ). Conversely, during the phases , operates using the filter coefficients of in the even-numbered ccs ( ) and the filter coefficients of in the odd-numbered ccs ( ). The described block is completely scalable with , , and . Since it "consumes" its input data at the same rate as they are produced by , it is fully balanced with . Moreover, any processing unit of is employed in effective computations at any cc [except the few ccs ]; therefore, the pipe constitutes an architecture performing two levels of 2-D DWT ( ) in ccs with a 100% hardware utilization.
B. Case of : The Block
In conventional pipelined DWT architectures, each decomposition level is performed by a single computing block (e.g., [37] - [39] On the other hand, high efficiency could be achieved paying particular attention in designing each block (e.g., see the architecture described in [34] or Arc proposed in [46] for the 1-D case). Such an approach requires a cost (in terms of time of design) growing with . An alternate strategy consists of designing a pipeline having 8 An alternate way of achieving the desired data flow onto I and I
is to replace the two multiplexers in the Adapter with two latches triggered at a rate twice as slow as the clock signal.
computing blocks for any . In such a device, the th block (say, , ) will compute the decomposition level , while the last and th block (say, ) will generate all the other decomposition levels (
). In order to get a low designing cost for the whole architecture, we consider . Then, we observe that the samples of are output by in a serial data stream at the rate of one sample per cc. This means that can simply be an RPA-based architecture able to decompose in levels a serial input stream of samples in ccs. 9 Even though architectures with these characteristics already exist (see, for instance, the parallel filter proposed in [32] ), in this section we provide a detailed description of a block having the same computing capabilities but requiring a simpler routing and control units because of its serial input/output processing mode.
Let us consider the following relation that straightforwardly derives from (1a)-(1d): (6) where the term denotes the number of multiplications/accumulations required to perform the decomposition level . Also, let each computing period of be subdivided into two phases of equal duration: phase (i.e., when receives from the even-numbered rows of ) and phase (i.e., when receives the odd-numbered rows of ). Then, since (6), evaluated in , implies that the same computing power required for producing the subbands , , , and is even sufficient to produce all the other subbands , , , and ( ), we schedule the subbands , , , and to be computed during the phases ; and all the other subbands , , , and ( ) to be computed during the phases . Consequently, we provide with a number of MACs twice that employed by , since should require half the time with respect to for computing , , , and . Therefore, two couples of filter units (say, and ) are designed in the generic 1-D processor (see Fig. 9 A block computing the required decomposition levels can be implemented assembling the above-described processors as shown in the architectural scheme of Fig. 10 . In such a block, two -input adder circuits (say, and ) and serial-in/serial-out pipes of row delay circuits ( ) are also employed. Specifically, and perform, respectively, the outer sums of (2a)-(2c) and (2b)-(2d); the row delay circuits ( ) are shift registers composed by cells and are used to store the rows of generated by , while the row delay circuits ( ; ) are composed by cells and are employed to hold the rows of that are fed back by . This allows use of them as input for computing the decomposition level .
The production of different decomposition levels is scheduled according to an RPA-like algorithm, which is proposed in the functional table of Table III . Each period is composed by several subperiods where a select signal assumes a value: , 2, 3, . When , is disabled: in these moments, even though is producing an odd-numbered row of , there are no data ready to be processed by . Conversely, a numerical value denotes which decomposition level is currently produced by . During the phases , assumes only the value 2, while assumes the values , 3, during the phases . Any subperiod ( ) takes ccs to produce one row of the four subbands of the decomposition level . During these subperiods, the rows , , and are immediately output; while the row is fed back either to (if is even, i.e., ) or to (if is odd, i.e., ), because of Property A.
Note that during a subperiod ( ), the data stored in (and only them) are employed as input by the processor . Therefore, only those registers have to be triggered at the same frequency of the working clock . By this way, clock signals ( ) broadcast to are generated in a proper Timing Unit (bottom side of Fig. 10 ) by Fig. 10 ). Besides these control signals, two other select signals are needed to control the data flow in the device: the same signal that was used in ( [ ] during the even-numbered [odd-numbered] ccs) and the signal , which is set to zero [1] during the production of even-numbered [odd-numbered] rows. The signal is derived from in the Timing Unit.
It should be remarked that timing units and control systems represent a typical cost for RPA/MRPA-based devices. Nevertheless, because of the serial input strategy of our processors, the control system of our architecture needs an routing network, three ( )-output demultiplexers, one ( )-output demultiplexers, six two-output demultiplexers, and ( )-input multiplexers. Therefore, it appears simpler than control systems of other devices: for instance, the MRPA-based parallel filter described in [32] needs an routing network, -output demultiplexers, and -input multiplexers. Comparative evaluation of the architecture will be provided in the next section.
VI. COMPARATIVE EVALUATION
In order to evaluate the proposed pipelined architecture, 10 we compare it with the parallel filter (PF) described in [32] , which is (on the best of our knowledge) the fastest 2-D DWT "direct approach"-based processor. Therefore, we will consider the latency , the hardware complexity , the complexity, and the efficiency (or hardware utilization)
, assuming levels of decomposition and ( )-taps filter bases, being the dimension of the input image. 11 Since we aim to have measures independent from the specific integration technology, and will be evaluated, respectively, in terms of number of ccs and number of transistors. These figures of merit can be converted in "abstract values" of silicon area and time for a specific implementation when multiplied by the average area needed by a transistor and by the period of one cc, respectively. Note that due to the silicon area needed by the routing and because of some physical phenomena (e.g., load capacitances, fan-in and fan-out, etc.), "actual values" for a specific integration technology may be different from the derived "abstract values." Therefore, they can be known only by means of a detailed analysis performed on the particular implementation.
A. Computing Performance
A relevant remark concerns the computing performance of PA. Because of its particular structure allowing a throughput of 4 data/cc, PA performs a 2-D DWT in ccs, where takes into account the "startup" delay due to the propagation through the pipe. On the other hand, (PF) is approximately ccs [32] . Measures of these performances in time units can be known only taking into account the particular adopted technology. 11 Here we consider square filter kernels (M = L) only in order to reduce the number of parameters. Evaluations for more general cases can be easily carried out. However, in a very first approximation, (considering a latch inserted between the two blocks of PA), we can assume the cc period as the latency of one multiplier plus the latency of one adder, which is the same as for PF and other architectures. Therefore, as we claimed in the introduction, PA results approximately four times faster than PF and other known architectures.
As a result, PA can be employed either for performing a four times faster processing than that allowed by PF (or similar architectures) working at the same clock frequency (high-speed utilization), or for reaching the same performance of other architectures, even using a four times lower clock frequency. This second possibility allows for reducing the supply voltage and the power dissipation, respectively, by a factor of four and 16 with respect to conventional architectures (low-power utilization).
B. Complexity
A comparison between PA and PF in terms of complexity is shown in Fig. 11 for a subset Fig. 11 have been estimated measuring (PF) and (PA) in terms of number of transistors and considering Booth's encoded multipliers and typical implementations of FULL-ADDER, of two-input MULTIPLEXER, of two-output DEMULTIPLEXER, and of D-REGISTER (e.g., by means of 26, 12, 12, and 20 transistors, respectively). As was expected, because of the pipelined structure of PA, in the worst cases, (PA) is approximately 50% higher than (PF) (see Fig. 12 ). Nevertheless, it should be remarked that in DWT applications, the required precision grows with the levels. Therefore, the only processing block of PF must achieve the precision required by the level (i.e., the highest one), while in PA might be realized with lower precision: this possibility could provide further reduction to (PA). Moreover, we can reasonably state that due to its serial and semisystolic structure, PA requires a routing network much simpler than that needed by PF [i.e., complexity versus complexity]: these circumstances have not been encompassed.
C. Efficiency (Hardware Utilization)
Another important issue in parallel architectures is the effective hardware utilization, which can be measured by the efficiency. We can define the efficiency of a parallel architecture having computing units as the ratio between the latency of a nonparallel architecture [i.e., ] and times the latency of the parallel architecture . From this parameter, the speedup can be easily derived as the product eff . In order to simplify our analysis, we define one multiplier and one adder and one product and one sum, respectively, as one "computing unit" and one "basic computation" occurring in the 2-D DWT. Since we assumed 1 cc as the time needed by a single computing unit to perform one basic computation, the latency , measured in ccs, is simply the number of basic computations required by the direct 2-D DWT. Because the production of each coefficient of a given subband requires basic computations, 12 we have ccs
On the other hand, the global number of computing units employed by PA and PF is, respectively,
Therefore, considering PA ccs and PF ccs, eff(PA) and eff(PF) result, respectively eff PA (13a) eff PF (13b) Fig. 13 shows eff(PA) and eff(PF) for different values of . Note that because of the design of , derived from (6), eff( ) is 100% during the phases , less during the phases . More performing values for eff(PA) can be reached choosing higher values of . For instance, an architecture composed by the pipeline ( ) could be designed having a block employing "folded 1-D processors" (i.e., having less than MACs) as those described in [46] for the 1-D DWT.
VII. CONCLUSION
In this paper, we have proposed a fast pipelined architecture (say, PA) for the direct 2-D DWT. Even though, for most of the considered configurations, PA has a hardware complexity 12 Actually, any subband coefficient requires L products and L 0 1 sums.
Nevertheless, we can reasonably approximate these operations to L basic computations.
higher than that of nonpipelined architectures, it presents an complexity (VLSI area times the square of the latency), which is considerably lower than that of other devices (up to 25 times, for , , and ). This outperforming characteristic derives from the fact that PA computes a dyadic (non standard) decomposition of an image in approximately ccs, while classical MRPAbased devices commonly need approximately ccs. This result is twofold: 1) PA can be employed for performing a four times faster processing than that achievable by other architectures when they work at the same clock frequency (high-speed utilization); 2) PA can be employed for reaching the same performance of conventional architectures, even using a four times lower clock frequency. This second possibility allows for reducing the supply voltage and the power dissipation, respectively, by a factor of 4 and 16 with respect to the conventional architectures (low-power utilization).
Moreover, even though pipelined, the proposed architecture is highly efficient. Specifically, we have shown that applications needing decomposition levels can be performed by a 100% efficient approach, and that decomposition levels can be realized by a hybrid pipelined/MRPA-like architecture having efficiency 87.5%.
