Abstract-In this paper, we propose two scalable architectures (say, Arc and Arc 2 ) that perform the discrete wavelet transform (DWT) of an 0 -sample sequence in only 0 2 clock cycles. Therefore, they are at least twice as fast as the other known architectures. Also, they have an AT 2 parameter that is approximately 1/2 that of already existing devices.
I. INTRODUCTION
T HE DISCRETE wavelet transform (DWT) [1] - [4] is a mathematical technique that decomposes a signal in the time domain by using dilated/contracted and translated versions of a single basis function, named the prototype wavelet. In the last decade, the DWT has often been found preferable to other traditional signal-processing techniques since it offers useful features such as inherent scalability, computational complexity of (where are the samples of the processed sequence), low aliasing distortion for signal-processing applications, and adaptive time-frequency windows. Hence, the DWT has been studied and applied to a wide range of applications: numerical analysis [5] , [6] , biomedicine [7] , different branches of image and video processing [1] , [8] , [9] , signal-processing techniques [10] , speech compression/decompression [11] , etc.
In many of these applications, real-time performances are required in order to achieve attractive results. Therefore, the implementation of the DWT by means of dedicated VLSI application-specific integrated circuits has recently captivated the attention of a number of researchers, and many DWT architectures have already been proposed [12] - [25] . Some of these devices have been targeted to have a low hardware complexity, but they require at least 2 clock cycles (ccs) to compute the DWT of a sequence having samples (e.g., the devices proposed in [12] - [14] , the architecture A2 in [15] , etc.). Nevertheless, also a large number of devices, having a period of approximately ccs, has been designed (e.g., the three architectures in [14] when they are provided with a doubled hardware, the architecture A1 in [15] , the architectures in [16] - [18] , the parallel filter in [19] , etc.). Most of these architectures exploit the recursive pyramid algorithm (RPA) [26] or similar scheduling techniques in order both to reduce memory requirement and to employ only one or two filter units, independently from the number of decomposition levels to be computed. This is done by producing each output at the "earliest" instance that it can be produced [26] .
The demand of low-power VLSI circuits in modern mobile/visual communication systems is growing all the time. On the other hand, the running progress of the VLSI technology has strongly reduced the cost of the hardware. Therefore, the possibility of reducing the period, even increasing the hardware, 1 is becoming an important issue. In fact, low-period devices are important also for low-power utilization. For instance, a device having a period ccs could be employed not only for performing a two times faster processing than that allowed by having a period ccs but also (clocked at a frequency ) for reaching the same performance reached by , when this one is clocked at a frequency . This circumstance allows of reducing the supply voltage (linear in ) and the power dissipation (linear in ), respectively, by a factor of two and four with respect to [27] . The above considerations have motivated the work of this paper, which proposes two scalable architectures having an AT parameter that is approximately 1/2 that of already existing devices and performing the DWT of an -sample sequence in only ccs, since they allow the sampling of the input sequence at a three frequency two times higher than the working clock frequency. The proposed devices are a pipeline (namely, Arc ) and a "hybrid pipeline-RPA" architecture (namely, Arc ). Even though the pipeline paradigm already has been exploited by existing DWT architectures (e.g., those in [12] and [23] - [25] ), to the best of our knowledge, none of them reaches the performance claimed in this paper. Also, it is worth noting that the existing pipelined devices are underutilized. In fact, the downsampling occurring in each DWT decomposition level greatly helps in designing RPA-based architectures (e.g., those in [14] - [17] , [19] , [20] , [22] , etc.) but makes the pipelined devices heavily underutilized, since the stage implementing the decomposition level is usually clocked by a frequency 2 times lower than the clock frequency used in the first level [24] . This underutilization comes from a low balancing of the pipelines when they implement the DWT and leads to a low efficiency. On the other hand, the architectures Arc and Arc are highly balanced.
In addition, because the designs of the proposed architectures depend only on the number of decomposition levels and on the length of the filter , we characterize the efficiencies (otherwise said hardware utilization) of Arc and Arc within a 2-D space of coordinates . Such characterization shows that the efficiency of Arc decreases with , while the efficiency of Arc has an opposite behavior. This study suggests defining an architecture Arc , which is highly efficient for any specific application identified by a point . Arc is simply the architecture (between Arc and Arc ) having the highest efficiency in . The efficiency of Arc , evaluated for a very wide subset of points in , has excellent results, 99.1% being the average value.
The outline of this paper is as follows. In the next section, the one-dimensional (1-D) DWT is shortly recalled. The strategy leading to the design of Arc is illustrated in Section III. The computing blocks of Arc implementing the decomposition levels 1 and 2 (namely, and ) are described, respectively, in Sections III-A and III-B. Designs of blocks implementing decomposition levels higher than the second one are described in Section III-C. The "hybrid pipeline-RPA" architecture Arc is motivated and described in Section IV. Evaluations of the proposed architectures in terms of computing performances and efficiency, as well as the definition of Arc , are given in Section V. Conclusions are summarized in Section VI.
II. THE 1-D DISCRETE WAVELET TRANSFORM
The 1-D DWT [1] - [4] is a multilevel decomposition technique. In it, each decomposition level and being, respectively, the length of the input sequence and an integer) can be seen as the further decomposition of the sequence (having samples) into two subbands and (both having samples). Such a decomposition is produced by two convolutions followed by a decimation by two, as depicted in Fig. 1 and formalized by (1a) (1b) where and denote coefficients, respectively, of low-pass and high-pass -tap filters (say, and ), 2 and we have assumed for or . A direct consequence of the decimation by two in (1a) and (1b) is the following Property P, which assumes particular importance in the body of this paper. 2 The filters a a a and c c c may also have different number of taps. Nervertheless, in the literature on DWT architectures, they are considered to have an equal number of taps, since, in any case, the shorter filter can be suitably zero padded. Because of its structure, 1-D DWT can be straightforwardly pipelined into blocks , each block being devoted to compute the decomposition level . Nevertheless, from (1a) and (1b), we observe that the complexity (say, ) of the decomposition level is linear in with a factor depending on . Therefore, because of the decimation by two performed in each decomposition level (2) As a clear consequence of (2) , in order to balance a pipelined DWT architecture, each block should employ a number of processing elements (PEs) (each PE will be basically constituted by one multiplier and one adder) (3) Therefore, our idea is to build a balanced architecture Arc constituted by a pipeline . Each block is designed carefully taking into account (3) . A top-level scheme of Arc is given in Fig. 2 .
Property
We assume since, as we shall see in Section III, this choice leads to design of a block having a period of only ccs that is 100% efficient and scalable with any value of . Because of this choice, in order to balance Arc , from (3), we must have and, in general
where denotes the rounding to the smallest integer not smaller than , and takes into account the discrete nature of the PEs.
Designs of blocks and employing, respectively, and PEs, are introduced in Section III-A and -B, respectively. Blocks having PEs as defined in (4) are described in Section III-C.
A. First Level of Decomposition:
In order to design a high-speed block performing the first level of DWT decomposition, we consider Property P introduced in Section II. For , Property P means that can be computed by the point-wise sum of two different and independent convolutions having, as filters, and [ and ] and, as input, -and -, respectively.
Therefore, four independent filter units (say, and ), globally employing 2 PEs [say,
, and ], can be designed and arranged in order to process in paralleland -, and therefore, perform the first level of DWT at the rate of 2 samples per cc, supposing that 1 cc is the time needed by one PE to compute one product and one sum. 3 Block is shown in Fig. 3 for . The data dependence graph (DDG) of and computing is provided in Table I . From this, the DDG of and can be straightforwardly derived. In this DDG, and in the other DDGs that will be considered in the following, the columns related to the I/O data lines [e.g., and ], as well as the columns related to the select signals (e.g., ), show the data present on that line in a specific cc [identified by the clock cycles counter ]. The products computed in each cc by the multipliers inside each PE are shown in the related columns. Arrows are used to denote how these products have to be added in order to achieve the final results.
The block has two input lines and . feeds -both into and into , while feeds -both into and into . Because of the independence of the computations, -andare fed in parallel during the same cc at the rate of 1 sample per cc. Therefore, the input sequence can be sampled at a frequency twice higher than the working frequency of the device and can be "consumed" by in only ccs (this additional cc is due to the fact that the input ofis delayed by 1 cc). The subband (i.e., the point-wise sum of the outputs from and from ) is generated by the adder and output by means of the line at the rate of 1 sample per cc, starting on . Similarly, in parallel and at the same rate, the subband (i.e., the point-wise sum of the outputs from and from ) is generated by the adder and output by means of the line . The filter units have been designed according to a well-known scheme [28] , but any scheme for serial input/output convolution could be used. The four latches (shown in gray in Fig. 3 ) at the input of the adders and have no functional reason but have been inserted in order to make the critical paths limited to one multiplier and one adder. By this way, we can satisfy (in a very first approximation) the assumption that the clock period (i.e., the duration in time of 1 cc) is bounded by the latency of one multiplier and one adder, which is the standard clock period for DWT architectures [13] - [24] . Additional latches inserted at the output of each multiplier could further decrease the minimum allowed clock period.
The proposed design of is fully scalable with any value of and has 100% efficiency since during each cc, all the PEs are employed in effective computations. 
B. Second Level of Decomposition:
When more than one level of decomposition is needed, as in most of the applications, the subband produced by has to be further transformed. This requires a second block , which has to be pipelined to the output line coming from . In this section, we describe how such a block , having 100% efficiency for any value of , can be designed.
As previously stated, has to be provided with PEs in order to constitute a balanced team with . Therefore, a possible way of employing these PEs is to assign to the computations related both to and to . This strategy is made possible by Property P introduced in the previous section.
In fact, as shown in the DDG of (Table I) [ ] column of the DDG. The multipliers in the even-numbered PEs are provided with two-cell circular shift registers (CSRs) preloaded in such a way that the filter coefficients are used in the even-numbered ccs and the filter coefficients are used in the odd-numbered ccs . Conversely, CSRs connected in input to the multipliers of the odd-numbered PEs operate in the opposite way, i.e., the coefficients of are used in the odd-numbered ccs, and the coefficients of are used in the even-numbered ccs. 5 By this way, and can be pro- 4 An alternative way of achieving the desired data flow onto I and I is to replace the two multiplexers in the Adapter with two latches triggered at a rate twice slower than the clock signal. 5 In practice, these circular shift registers work as multiplexers, but they are more compact. Their use, instead of the use of multiplexers, will be even more convenient in the other levels of the pipe, since, as we shall show in Section III-C, each multiplier in B will need a (2 )-input multiplexer, suitably replaced by a (2 )-cell CSR. duced and output via the lines and , respectively, in the even-numbered and in the odd-numbered ccs.
Block is balanced with since its design has been dimensioned on PE's. Also, it is fully scalable with and 100% efficiency.
C. Higher Levels of Decomposition: Having Less than PEs
In this section, we describe the blocks , which have to be implemented in Arc for any . In order to be "at the best" balanced with the above-described and , a block should employ a number of PEs (4) Since is at least two, for , . Therefore, in this section, we deal with the designs of -tap filter structures employing less than PEs.
To solve this problem, we will consider a semisystolic approach based on arrays similar to those used by and , where, at a given cc , , each PE processes the same input sample . An alternative approach based on arrays can be found in [29] , where at a given cc , , the PEs process different input samples and, specifically, process . Independently from the approaches, a direct consequence of (4) is that in , each single PE will perform the computations related both to 2 coefficients of and to 2 coefficients of . This "folded-like computation" is made possible because of ccs. Therefore, such input data can be suitably replicated, without increasing the period, in order to allow each processor to perform the needed number of operations, before new data are input from . Also, because of this folded-like computation, the communications among the PEs will not be exclusively systolic, since also a feedback of the partial results will be required. In the following, we will explicitly refer to the case of , but we will also describe how other cases can be derived. Table III shows the DDG of block for , . 6 The computation is periodic with a period of 2 ccs (in this case, 4 ccs). The periods are subdivided into two semiperiods identified by a binary select signal . is zero [1] when an even-numbered [odd-numbered] input sample is input into . In the semiperiod , the PE performs "for to (2 1)" the products of the input sample by the filter coefficients and (in this order) and adds these products to the data produced by itself during the previous semiperiod (feedback). In the semiperiod , performs "for to " the products of the input sample by the filter coefficients and 6 When a particular application requires a filter length L < L = W 2 , the exceeding L 0 L coefficients have to be replaced by zeroes. As we shall show in Section VI, such a situation will imply underutilization, leading to an efficiency lower than 100%. (in this order) and adds these products to the data produced by during the previous semiperiod (systolic communication). The correct use of the filter coefficients by each multiplier is simply achieved by storing the coefficients in (2 )-cell CSRs (one CSR for each PE) in the above-specified order. For , the operations performed in each period by are simply those related to and (semiperiod ") and those related to and (semiperiod ). This scheduling takes into account Property P. The final results are produced at the same rate of the input, i.e., two samples per period, which means two samples every 2 ccs (in this case, 4 ccs). The architecture implementing according to this data DDG is shown in Fig. 5 . The Adapter duplicates for one semiperiod, i.e., for 2 ccs (for , 2 ccs) the input samples onto the line , as soon as they are produced by . The multiplexers in input to the adders select (by means of the signal ) the feedback or the systolic communication among the processors.
The proposed scheme assumes a substantial difference for . In fact, since has to be shared by 2 coefficients (which are at least two, for ) for each one of the filters , , and , the same input data will be multiplied (in each PE) at least by two low-pass and at least by two high-pass filter coefficients. Therefore, the composition of the final result needs an external feedback between the first and the last PE of the systolic array. These concepts are clearer from an exam of the DDG of block , which is provided in Table IV . The above-mentioned external feedback (denoted by gray arrows in the graphs) concerns (2 2) every 2 data accumulated by the adder in in the first semiperiod (the remaining two data constitute a sample of and a sample of and do not need further processing). Specifically, these data are delayed by (2 2) ccs and sent in input to the adder of the last PE of the array (i.e., ). Therefore, in the semiperiod , the adder in receives the data that were generated in during the previous semiperiod. This external feedback is enabled by the combination and does not involve the last 2 ccs of the semiperiod , since in those ccs is employing the filter coefficients having index . The architecture that realizes such a data graph is represented in Fig. 6 . From the presented examples and from the above de- for any . This restriction is due to the rounding in (4) that leads to implementation in two filters having taps, where may also be greater than . As we have seen in Section III, when (6) is not verified, 2 filter coefficients, among those preloaded in , have to be zeroes and therefore, will be underutilized.
Another consequence of the rounding in (4) is that some and some may exist such that 7 (7) which means that the blocks , , and globally require more than 2 PEs.
On the other hand, from (2) in Section III, we observe that (8a) for any . In particular, for (8b)
Therefore, it seems reasonable to us to design also a second architecture (say, Arc ), which is constituted by a pipeline having only two blocks for any value of . In Arc , the decomposition level 1 is produced by exactly as in Arc , while all the levels are performed by a new block . Because of (8b), a number of PEs not bigger 7 For instance, (7) is trivially verified for any J > L + 2. than is needed by in order to not deteriorate the performance allowed by . Therefore, we choose . As an effect of this choice, Arc will be more efficient than Arc in any application requiring decomposition levels and employing -tap filter bases when and satisfy (7) .
The design of Arc does not require additional descriptive details, since we observe that the samples of are fed into by the line in a sequential data stream and at the rate of 1 sample per cc. This means that can simply be an RPAbased architecture able to decompose in 1 levels a sequence of samples in ccs, employing PEs. Many of these devices have already appeared in the literature: for instance, the architectures described in [16] - [18] , the parallel filter proposed in [19] , the architecture "A1" proposed in [15] , the three devices described in [14] when, as the same authors suggest, they are provided with a double number of processors, etc.
Therefore, for our purposes, a scheme of Arc can be simply proposed in the form of Fig. 7 , where is the block described in Section III-A and is any RPA-based architecture employing PEs and transforming the samples of in ccs. It is worth noting that Arc (i.e., the above-described "coupled use" of with any already known RPA device ) allows two times faster processing than does the classical RPA-based device .
V. PERFORMANCE EVALUATIONS AND DEFINITION OF ARC

A. Period and Parameter
In order to evaluate the global time required by the proposed architectures to perform the DWT of a sequence having samples, we observe that each block has a period ccs, since it produces a sample every 2 ccs (immediately followed by ). As a consequence, each block needs ccs to perform its task (balanced computation). Therefore, the period of Arc is Arc ccs (9) where takes into account the "startup" delay due to the propagation through the pipe (i.e., the number of ccs needed by to output ). Nevertheless, it can be neglected since , which, in practical applications, is very much smaller than . Approximately 2 ccs is also the value of Arc , when is implemented by devices as those referenced in Section IV.
Measures of Arc and Arc can be known only taking into account the particular adopted technology. However, to be independent of the technology, in a very first approximation (considering a latch inserted between each pair of adjacent blocks), we can assume the cc period as the latency of one multiplier plus the latency of one adder, which is the same assumption made by other authors. Therefore, as we have already claimed in the introduction, Arc and Arc are approximately two times faster than all the other "single chip" known architectures. 8 Our architectures allow approximately the same improvement (i.e., by a factor of two) in terms of the parameter, as summarized in Table V . In such a table, the VLSI area has been characterized only by means of the number of multipliers and adders: such a measure is therefore only indicative. Nevertheless, it should be remarked that in DWT applications, the required precision grows with the levels. Therefore, while the only processing unit in classical RPA-based devices must achieve the precision required by the level (i.e., the highest one), pipelines might be realized by blocks that in lower levels achieve lower precision. This possibility could provide further reduction to the VLSI area of Arc and Arc . Moreover, Arc (and Arc , in part) are semisystolic (i.e., they do not require complex routing, as many RPA devices do) and use a controller simpler than that needed by fully RPA architectures. In fact, the multiplexers and demultiplexers employed in any block of Arc need a set of select signals, which are the least significant outputs of a counter modulo . Therefore, this counter is the only control unit needed by Arc , since it provides all the needed select signals.
B. Efficiency
Another important issue in parallel architectures is the effective utilization of the processors, which is characterized by the efficiency. We can define the efficiency of a parallel architecture having PEs as the ratio between the period of a 8 Even though single-input multiple data realizations have been shown in [19] , performing decompositions either in LJ or in L ccs, these devices require, respectively, 2N and N PEs, and in most applications cannot be implemented in a single chip. ) and times the period of the parallel architecture . From the efficiency eff , the speedup of an architecture (say, speedup ) can be straightforwardly derived as speedup eff . In order to simplify the following analysis, we define one product and one sum as one "basic computation" occurring in the 1-D DWT. Since we assumed 1 cc as the time needed by a single PE to perform one "basic computation," the period measured in ccs is simply the number of basic computations required by the 1-D DWT. Because the production of each coefficient of a given subband requires basic computations, 9 we have As was expected, the only parameters that affect the efficiencies are and (in particular, eff[Arc ] depends only on ), since the designs of both the architectures are independent on . Therefore, we can characterize any specific problem or application by the coordinates of a 2-D space , and plot eff[Arc ] and eff[Arc ] for any point of . Fig. 8(a) and (b) shows such a study for a wide subset , . Eff[Arc ] is 100% in all the points and (2 ), which are denoted by " " in Fig. 8(a) . Nevertheless it is worthwhile to note the following.
1) For a given value of , eff[Arc ] decreases as increases.
2) eff[Arc ] is independent on and increases with . In other words, Arc and Arc compensate themselves each other in terms of efficiency. This circumstance is visually evident in Fig. 8(a) and (b) . In fact, in Fig. 8(a) , the white symbols (corresponding to ) label lower efficiencies than those labeled by the black symbols (corresponding to ), which is the opposite of Fig. 8(b) . This result is twofold. First, Arc and Arc can be employed for performing two times faster processing than that allowed by other architectures working at the same clock frequency (highspeed utilization). Second, they can be employed, even using a two times lower clock frequency, but reaching the same performance of other architectures. This second possibility allows for reducing the supply voltage and the power dissipation, respectively, by a factor of two and four with respect to the other architectures (low-power utilization). These results have been achieved by means of pipelining. Even though other architectures have already exploited the pipelined approach, they do not reach the performance of the proposed architectures, and they result in heavy underutilization. In fact, the stage implementing the decomposition 13 level in pipelined architectures is usually clocked at a frequency 2 times lower than the clock used in the stage performing decomposition level 1. Conversely, our architectures have been designed taking into account the balancing of the pipeline. Therefore, they are highly efficient. Specifically, we have shown that the efficiency of Arc decreases with the number of levels , while the efficiency of Arc grows with . Therefore, as a conclusive result, an impressively efficient architecture has been defined [say, Arc ], which is simply the architecture between Arc and Arc having the highest efficiency when it implements an -tap filter based DWT in decomposition levels. 
