This paper presents a wide range of algorithms and architectures for computing the 1-D and 2-D Discrete Wavelet Transform (DWT), and the 1-D and 2-D Continuous Wavelet Transform (CWT). The algorithms and architectures presented here are independent of the size and nature of the wavelet function. New on-line algorithms are proposed for the DWT and the CWT which require signi cantly small storage. The proposed systolic array and the parallel lter architectures implement these on-line algorithms, and are optimal both with respect to area and time (under the word-serial model). Moreover, these architectures are very regular, and support single chip implementations in VLSI. The proposed SIMD architectures implement the existing pyramid and a'trous algorithms, and are optimal with respect to time.
Introduction
In the last few years there has been a great amount of interest in wavelet transforms, and the interest has been highly inter-disciplinary. The wavelet transform can be viewed as a decomposition of a signal in the time-scale plane. There are several types of wavelet transforms depending on the nature of the signal (continuous or discrete) and the nature of the time and scale parameters (continuous or discrete). In this paper we focus on the realizations of the Discrete Wavelet Transform (DWT) and the Continuous Wavelet Transform (CWT) as de ned in 11]. According to this de nition, the time and scale parameters as well as the input and output signals of both the transforms are discrete; the two transforms di er mainly in the manner in which the time-scale plane is tiled. The CWT and the DWT have a very wide range of applications: from signal analysis and signal coding to numerical analysis 4], 6], 8], 14] . The large application-domain of these transforms makes the study of their implementations in VLSI very important. In this paper we develop e cient algorithms and architectures for computing the 1-D and 2-D DWT, and the 1-D and 2-D CWT. The proposed implementations are independent of the size and nature of the wavelet. This feature makes them very attractive, since the wavelet function can be varied with the application requirements.
There exist only a handful of VLSI architectures for wavelet transforms. The rst architecture for computing the DWT was designed by Knowles 5] . This architecture was not well suited for VLSI since it used large multiplexers for routing the intermediate results. Later, Lewis and Knowles 7] designed an architecture for computing the 2-D DWT with Daubechies 4-tap lter as the wavelet function. A drawback of their architecture was that it was heavily dependent on the properties of the speci c wavelet, and did not work e ciently for other wavelets. Aware Inc. has developed a chip for DWT called the Wavelet Transform Processor (WTP) 1]. WTP essentially consists of a 4-tap lter and some external memory and relies on software for computing the DWT. Recently, Parhi and Nishitani have proposed folded architectures and digit-serial architectures for the 1-D DWT 10] . Their architectures require minimum number of registers, but do not scale easily with the lter size and the number of octaves computed. Earlier, we have proposed systolic architectures for computing the DWT which are optimal (asymptotically) in the word-serial model 17]. These architectures require 2N cycles (versus N cycles in the architectures proposed here) to compute N outputs. As far as the CWT is concerned, we are not aware of any other work on mapping the CWT into VLSI. the parallel lter, and the SIMD array architectures for 1-D and 2-D DWT, while Sections 6 and 7 describe the same set of architectures for 1-D and 2-D CWT.
Preliminaries
The Wavelet Transform (WT) of a signalx(t) is given by, w(u; s) = Zx (t)ĥ( t ? u s )dt whereĥ( t?u s ) is the wavelet corresponding to the`prototype' waveletĥ(t) at scale s and time u. The Wavelet Transform of a sequence x(i) (sampled version of the continuous signalx(t)), discretized on a grid whose samples are arbitrarily spaced both in time b and scale a is given by 11] W(b; a) = 1 j a j 1=2
i=aL+b?1 X i=b x(i)h i ? b a (1) where L is the size of the support of the basic wavelet h, and h is obtained by samplingĥ(t). In addition, N is the number of input samples and J is the number of scales.
In this paper we focus on two special cases of the WT, the Discrete Wavelet Transform (DWT) and the Continuous Wavelet Transform (CWT).
DWT: The DWT can be viewed as the multiresolution representation of a sequence x(n) with N=2 values at the highest resolution, N=4 values at the next resolution, and so on 8]. The structure of the DWT is due to the dyadic nature of its time-scale grid. Here a = 2 k , where k is an integer, and b is a multiple of a. If N is a power of 2, then J log N, and 1 k J. At each scale k, the number of samples in the time dimension is N=2 k .
CWT: The CWT is de ned as a WT with no decimation in time and scale (see equation 1).
However, in this paper we consider a commonly used version of the CWT where the scale resolution is logarithmic as in the case of DWT 11] . In other words, a = 2 k for 1 k J, and J log N. At each scale k, the number of samples in the time dimension in N.
It is to be noted that the properties of the wavelet transform are heavily dependent on the properties of the basic wavelet. The architectures developed in this paper are independent of the wavelet function and are hence exible.
Lower bounds
In this section we present our earlier results on lower bounds for computing the Wavelet Note that under the word-serial model, the period for an N-input N-output function is lower bounded by N. Also, under this model, the computation delay is lower bounded by the period. Lower bounds for the 2-D case can be obtained by simply replacing N by N 2 and L by L 2 in the bounds for the 1-D case (This is in contrast to the 1-D and 2-D DFT case, where a di erent technique had to be used to obtain the bounds for the 2-D DFT). A point to be noted is that for the 2-D case, the word-serial model does not place any restriction on the order of the inputs as long as they are input in a word-serial manner. However, in practical imaging systems, where the digital image is available in a raster scan format, an additional constraint has to be introduced in the derivation of the lower bounds. It is conjectured that for 2-D DWT, when the image is available in the raster-scan format, the area is bounded by A (NLk), and the time is bounded by T N 2 .
2
The period is de ned as the time between the initiation of two consecutive sets of computations. 3 The word-serial model is one in which at any time instant, at most one input (output) word has some, but not all, of its bits already read (written).
The proposed systolic array and the parallel lter architectures for 1-D and 2-D DWT satisfy the word-serial model, and are optimal both in terms of area and time. The proposed systolic and parallel lter architectures for 1-D CWT are also optimal under the word-serial model for the case when only 1 output is computed every cycle.
3 On-line Algorithms
Recursive Pyramid Algorithm
The Recursive Pyramid Algorithm (RPA) is a running algorithm for computing 1-D DWT 16]. The RPA is not only e cient from a VLSI (single chip) point of view but it can also be implemented very e ciently on existing programmable DSP chips without any bu ering. We give a brief description here for completeness. The DWT can be computed in an e cient manner on general purpose computers and SIMD machines using the pyramid algorithm (PA) developed by Mallat 8] , 9] (see Figure 1 ). However, there are a number of applications of the DWT where a`running' (or real time/on-line) implementation is desirable. Running implementations of the PA for an N-point DWT, requires either O(N) storage or a cascade of log N computation units.
Both these alternatives are expensive, especially since the input sequence in real-time applications is quasi-in nite. The Recursive Pyramid Algorithm (RPA) computes the N-point DWT in real time using a single computation unit and L(log N ? 1) cells of storage, where L is the length of the wavelet lter (QMF) and generally L N. The latency of the computation unit is assumed to be one time unit. The RPA consists of rearranging the order of the N outputs such that an output is scheduled at the`earliest' instance that it can be scheduled. The earliest instance is based on the following precedence relation: if the earliest`instance' of the ith octave clashes with that of the (i + 1)th octave, then the ith octave gets scheduled rst. In the resulting schedule, the rst octave computations take place every other cycle, and the higher octave computations are scheduled between the rst octave computations. Let y i (n) be the nth lowpass output of the ith octave. Then the output schedule generated by the RPA for N = 16 and J = 4 is as follows. y 1 (1); ?; y 2 (1); ?; y 1 (2); ?; y 3 (1); ?; y 1 (3); ?; y 2 (2); ?; y 1 (4); ?; y 4 (1); ?; y 1 (5); ?; y 2 (3); ?; y 1 (6); : : :. À ?' signi es that an output is not generated in that clock cycle.
Modi ed Recursive Pyramid Algorithm
The modi ed Recursive Pyramid Algorithm (MRPA) is a version of the RPA that is suitable for parallel lter and other implementations which have a latency that is greater than 1. The main di erence with the RPA is that the earliest instance at which an octave output can be scheduled now depends not only on the scheduling times of the lower octaves, but also on the latency of the computation unit.
In the MRPA scheduling, the lower octaves are scheduled before the higher octaves, in order to avoid possible clashes. If the rst output of any octave is scheduled such that there is no con ict with any of the lower octave outputs, then it is guaranteed that there will be no con icts for all the outputs of that octave. We next derive a general formula for the time when the inputs for the ith octave computation are fed into the computation unit.
Let off(k) be the additional delay in the scheduling of the rst output of the kth octave that is caused by scheduling con icts with octaves smaller than k, and let T be the latency of the computation unit. Then the inputs for the ith octave computation are fed at times We explain this procedure with the help of the following example. Let the latency of the computation unit T = 4. Then off(2) is the smallest positive integer such that off(2) 6 = 2m 1 ? 4. Thus off(2) = 1. Similarly, off(3) is the smallest positive integer such that (i) off(3) 6 = 2m 1 ? 9 and (ii) off(3) 6 = 4m 2 ? 4. Thus off(3) = 2. In this example, the inputs to the rst octave are fed at times 1 + 2m 1 , 0 m 1 < N=2, the inputs to the second octave are fed at times 6 + 4m 2 , 0 m 2 < N=4, the inputs to the third octave are fed at times 12 + 8m 3 , 0 m 3 < N=8, and so on. 
On-line CWT Algorithm
The On-line CWT Algorithm (OCA) is a running algorithm for computing 1-D CWT. Here the NJ outputs (N outputs per octave) are scheduled in such a way that the output of a particular octave is produced once every J cycles. Such a scheduling scheme requires that the input be fed at the rate of one every J cycles. As in the case of RPA and MRPA, the outputs are scheduled at the earliest possible instance. If the latency of the computation unit is 1, then there are no clashes, and the outputs of the various octaves are scheduled one after the other. For instance if J = 4, then the outputs are scheduled as follows: y 1 (1); y 2 (1); y 3 (1); y 4 (1); y 1 (2); y 2 (2); y 3 (2); y 4 (2); y 1 (3); : : : and so on. If the latency of the computation unit is greater than 1, then the outputs have to be scheduled in a way such that in case of a clash, the lower output octaves have precedence over the higher octave outputs. Once the rst output of any octave is scheduled such that there is no con ict with any of the lower octaves, the remaining outputs of that octave are scheduled every J cycles then on. In an example with J = 4 and latency T = 4, the outputs of the various octaves are scheduled as follows. y 1 (1); ?; ?; ?; y 1 (2); y 2 (1); ?; ?; y 1 (3); y 2 (2); y 3 (1); ?; y 1 (4); y 2 (3); y 3 (2); y 4 (1); y 1 (5); y 2 (4); y 3 (3); y 4 (2); y 1 (6); : : :.
Like the RPA, the OCA is not only e cient from a VLSI (single chip) point of view but it can also be implemented very e ciently on existing programmable DSP chips without any bu ering. Pseudo-code for the OCA, when the latency of the computation unit is 1, is included in Appendix B. In that implementation, the last dc term is not computed.
1-D Discrete Wavelet Transform
In this section we describe parallel lter, systolic array, SIMD array and SIMD multigrid architectures for 1-D DWT. The parallel lter and the systolic array architectures are optimal with respect to area and time under the word serial model. We consider both these architectures, since it allows the designer to trade-o (within a constant factor) between period, chip area and clocking frequency.
Parallel Filter Architecture
The parallel lter architecture for 1-D DWT implements the MRPA algorithm (see Section 3.2). The rst octave outputs are computed every other cycle, and all higher octave outputs are computed between the two rst octave output computations. The main components of this architecture are two parallel lters to compute the low pass and the high pass outputs, and a storage unit of size LJ to store the inputs that are required for the computation of the J octave outputs. Figure 3 gives the block diagram of an architecture for computing J octaves.
Parallel lters: Each parallel lter consists of L xed multipliers and a tree of (L ? 1) adders to add the products. The latency of the parallel lter is T m + T a log L, where T m is the time taken to do a multiplication, and T a is the time taken to do an addition. The latency plays an important role in scheduling the octave outputs (see Section 3.2).
Storage unit: The storage unit consists of J serial-in parallel-out shift registers, each of length L. The ith shift register stores the inputs required for the ith octave computation 1 i J. An output is written into the ith shift register every 2 i?1 cycles, and all L elements of the shift register are fed to the parallel lters every 2 i cycles. Since the number of outputs of the ith octave is half the number of outputs of the (i ?1)th octave, the ith shift register is clocked at half the rate of the (i ?1)th shift register. The time when the rst output is written into or read out of a shift register is determined by the MRPA scheduling. For instance, in the example when L = 4, the rst octave outputs are written into the rst shift register at time instants 5; 7; 9; 11; : : :, and fed to the parallel lter at time instants 6; 10; 14; 18; : : :. Control signals fc i g and fa i g are generated at appropriate times to AND the signals into or out of the shift register.
The hardware components of this architecture include 2L multipliers, 2(L ? 1) adders, JL storage units and a control unit to generate the appropriate control signals. The computation time as well as the period is N. Note that the parallel lter architecture can be clocked at high sample rates by pipelining the multipliers and the adders. In such a case, the latency increases, resulting in a di erent MRPA scheduling. The size of the storage unit does not increase since the value of off(k) is bounded by 2 k?1 .
Systolic Architecture
Systolic architectures for computing the 1-D DWT which implement the RPA have been presented in 17]. We give a brief description of the architecture in 17] for the sake of comparisons. The architecture consists of a linear systolic array to compute both the low pass and the high pass outputs, and a storage unit to store the inputs for higher octave computations. The x-inputs that are required for the rst octave computation are fed in alternate cycles to one end of the array, while the inputs that are required for the higher octave computations are fed in parallel from the storage unit. This architecture requires an area O(LJk), and computes the DWT with a delay (and period) of 2N cycles, the rst output appearing only 1 cycle after the rst input. The architecture satis es the word serial model and is optimal both with respect to area and time.
We next propose a modi cation of the systolic architecture of 17] which computes the DWT with a delay of only N cycles. The modi ed architecture consists of two linear arrays, one to compute the low pass outputs and the other to compute the high pass outputs. The x-inputs are rst fed into the storage unit, and then loaded in parallel to the two systolic arrays. The order in which the outputs are computed here is the same as in the previous architecture. The area is also O(LJk).
SIMD Linear Array
We propose an implementation of 1-D DWT on a recon gurable SIMD architecture 3]. The architecture consists of a linear array of N processors where each processor contains a recon gurable switch. If the switch is set to 1, the data passes through it without any delay. We refer to the processors which do multiply-add or which pass data to a neighboring processor with a delay of one cycle as`active' processors. The recon gurable switches can be used to recon gure the N processor array to a smaller array of active processors. Since the con guration patterns are known apriori, they can be stored in each processor. A similar implementation has been proposed in 12].
In the proposed scheme, for the mth octave computation, 1 m J, the N processor array is recon gured to an array of size N 2 m?1 with processors P(2 m?1 j) being active, 0 j < b N 2 m?1 c. Figure 4 describes the interconnection for the computation of 3 octaves on a 8 processor array. In this scheme, all the outputs of any particular octave are computed at the same time. The high pass and the low pass lter coe cients are broadcast to each processor. Each processor computes the products of the data elements and the lter coe cients, and updates the partial results of the low pass and the high pass outputs. The data element is then sent to its right neighbor.
The computation time for computing J octaves in this architecture is LJ and the period is also LJ. The area of each processor is O(k) and the overall area complexity is O(Nk).
SIMD multigrid
The 1-dimensional DWT of N points can be e ciently mapped into a multigrid architecture of size N. The multigrid architecture of size N consists of (log N + 1) levels, with N 2 i processors in level i, 0 i log N. Figure 5 describes the multigrid architecture for N = 8. The mapping is as follows: the ith level of the multigrid computes the (i + 1)th octave outputs (high pass and low pass) and sends the low pass outputs to level (i + 1), 0 i < log N.
The procedure to compute the ith octave outputs is the same as that of the SIMD linear array.
The high pass and the low pass lter coe cients are broadcast to each processor, and each processor updates the partial results of the low pass and the high pass outputs by the products of the data element and the corresponding lter coe cients. The computation time for computing J octaves in this architecture is O(LJ), and the area is O(Nk log N). Since each level of the multigrid operates on a di erent set of inputs, the period of this design is only L, compared to LJ for the linear array architecture.
The period of the multigrid architecture can be reduced to 1 cycle by increasing the number of processors in each level by a factor of L. In the modi ed multigrid, level i consists of L sublevels with N 2 i processors in each sublevel, 0 i < log N. The interconnection among the processors between 2 levels as well as between sublevels is described in Figure 6 . In each level, there are two types of processors, active processors described by label A, and inactive processors. Each active processor has a lter coe cient assigned to it. It computes the products of its input data with the assigned lter coe cients and updates the low-pass and the high-pass partial sums. It sends the partial sums to its active south neighbor and the input data to its inactive south-east neighbor. Each inactive processor passes the input data to its active south-east neighbor. This architecture has an area of A = O(NLk). The computation time is LJ, and the period is 1. Table 1 compares the number of multipliers, the asymptotic area and period for the various architectures for 1-D DWT. Note that both the systolic array and the parallel lter match the area and time lower bounds for 1-D DWT (in the word-serial model), and are thus optimal with respect to both area and time. New data are shifted into the last row of each unit, and when the last row is lled, the data shift up by one row in the array. Each shift register unit has two clocks associated with it: a row clock and a column clock. The ith unit row clock has a period, t r (i) that is 2 i?1 times the sample period t s , t r (i) = 2 i?1 t s , and has a duty cycle that is 2 1?i times that of the sample clock. The ith unit column clock has a period t c (i), that is N times t r (i), t c (i) = Nt r (i), and a duty cycle of 2 1?i . Data is written into a shift register unit when both the row clock and the column clock are high. L 2 data are read out of the unit at time instants when every second row clock and every second column clock are high. The time when the rst output of any octave is written into or read out of the shift register unit is determined by the MRPA algorithm in a way similar to that of 1-D DWT. Control signals are generated at appropriate times to AND the data into and out of the shift register unit.
Comparisons
The hardware components of this architecture include 2L 2 multipliers, 2(L 2 ? 1) adders, NL storage cells, and a control unit. The computation time as well as the period is N 2 .
Systolic-Parallel Architecture
A systolic-parallel architecture for computing the 2-D separable DWT has been presented in 17]. We give a brief description here for the sake of comparisons. This architecture computes the 2-D pyramid algorithm using the RPA. It consists of a systolic lter, a parallel lter, and a bank of registers. It requires area proportional to NLk and computes the 2-D DWT with a delay (and period) of N 2 + N cycles, the rst output appearing 1 cycle after the rst input.
SIMD 2-D array
The proposed implementation consists of a 2-D array of (N N) SIMD processors which are recon gured to form smaller arrays of active processors for higher octave computations. Speci cally, the array is recon gured to form an array of size ( N In this scheme, all the outputs of a particular octave are computed at the same time. One possible way of computing the outputs of a particular octave is to have the pixels move and the partial results stay. Then every alternate active processor needs to participate in updating the partial results. The other possible way is to have the pixels stay and the partial results move. In this case computation should be initiated in every other active processor. The area of this architecture is O(N 2 k). The computation time as well as the period is L 2 J. Table 2 compares the number of multipliers, the asymptotic area and period for the various architectures for 2-D DWT. Note that both the systolic-parallel architecture and the parallel lter match the area and time lower bounds for 2-D DWT (in the word-serial model with raster-scan input), and are thus optimal with respect to both area and time. Note that the systolic-parallel architecture handles the separable case, while the parallel lter architecture handles the non-separable case. 6 
Comparisons

1-D Continuous Wavelet Transform
In this section we describe architectures for computing 1-D CWT which are based on the reorganized \a trous" computational structure of 11]. Figure 9 describes such an implementation. In this implementation there are 2 j?1 cells at each octave, with each cell working at the rate of 1=2 j?1 th that of the cell of the rst octave. This is because the inputs to the jth octave cell are subsampled by 2 j?1 compared to the original input.
Parallel Filter and Systolic Array Architectures
In this subsection we map the algorithm of 11] into systolic and parallel lter architectures. We rst describe architectures where all the elementary cells at the jth octave are combined to form a single computation unit. In such a realization, J outputs are computed simultaneously. Next, we describe word-serial architectures which have only 1 computation unit, and compute only 1 output at a time.
J outputs are computed every cycle:
In this case, both the systolic array and the parallel lter architectures consist of J computation units; each unit consists of one low pass and one high pass lter. The output of the low pass lter of the jth computation unit is sent to the (j + 1)th computation unit.
Parallel Filter: In the parallel lter architecture, the inputs to the jth computation unit have to be subsampled by a factor of 2 j?1 . This is achieved by storing the data in a delay line of size 2 j?1 L, and tapping the delay line every 2 j?1 delay units. Figure 10 The block diagram of this architecture is described in Figure 12a . Both the parallel lter and the systolic array architectures consist of a computation unit to compute the low pass and the high pass lter outputs, and a storage unit to store the inputs required for the computation of the octave outputs. The storage unit consists of J subunits of storage cells; the rst subunit stores L x-inputs, the second subunit stores 2L y 1 -outputs, the third subunit stores 4L y 2 -outputs and so on. The number of cells in the jth subunit is 2 j?1 L, since the inputs for the computation of the jth octave are subsampled by a factor of 2 j?1 . All the cells of the storage unit are clocked uniformly (in contrast to the storage unit for 1-D DWT). A data is read into the storage unit every cycle. In fact, each of the J subunits reads in a data every J cycles. L data are read out from the storage unit every cycle. The way the data are read out is di erent for the parallel lter and the systolic array architectures. Figure 12b describes the storage unit for the case when L = 4. Control signals fa k g and fc k;l g are generated at appropriate times by a nite state machine to read in(out) data to(from) the storage unit.
Parallel lter: In this architecture, all L elements of the same subunit are fed to the low pass and the high pass lters at the same time. Thus for the computation of an jth octave output, control signals c j;4 , c j;3 , c j;2 and c j;1 are set to 1, and L elements of the jth subunit are fed to the lters. Note that these L elements are 2 j?1 apart in the jth subunit.
Systolic array: In this architecture, the weights stay in the processors, the inputs are fed in parallel, and the partial results get updated as they move from one processor to another in the array. This computation scheme demands that the inputs for the computation of a particular output are fed to the array in a skewed fashion. Also, L di erent outputs are computed at the same time in the array, and the L elements that are loaded in parallel from the storage unit come
In an example where L = J = 4, if at time t = T, processors P(4), P(3), P(2), and P (1) 
SIMD linear array
The proposed SIMD architecture consists of a linear array of N processors which can be recon gured to form arrays of sizes N=2, N=4, and so on. The recon guration pattern is di erent from the one described in Section 3.3 for 1-dimensional DWT. For the 1st octave computation, the linear array interconnection is used, and the computation time is L. For the 2nd octave computation, the N processor array is recon gured to form two N=2 processor arrays, once with processors P(2j), and once with processors P(2j + 1), 0 j < bN=2c. The computation time is now 2L. In general, for the (k + 1)th octave computation, the N processor array is recon gured to form a set of 2 k arrays, each of size N=2 k . The mth processor array in this set is formed with processors P(2 k j + m ? 1), where 0 k < log N, 0 j < bN=2 k c, and 0 m < 2 k . Figure 13 Table 3 compares the number of multipliers, the asymptotic area and period for the various architectures for 1-D CWT. Note that both the systolic array and the parallel lter for the case when 1 output is computed per cycle, match the area and time lower bounds for 1-D CWT (in the word-serial model), and are thus optimal with respect to both area and time.
Comparisons
2-D Continuous Wavelet Transform
In this section we describe systolic array and parallel lter architectures for 2-D CWT for the case when J outputs are computed every cycle. Both the architectures are extensions of the ones that we proposed for 1-D CWT. We assume that the input is fed in the line scan mode, and that the output is also produced in the line scan mode. The proposed architecture for computing J octaves consists of J computation units; each unit consists of one low pass and one high pass lter. The output of the low pass lter of the jth unit is sent to the (j + 1)th unit, 1 j < J.
Parallel Filter Architecture
In the parallel lter architecture, each parallel lter consists of L 2 multipliers grouped in L subunits with L multipliers and L ? 1 adders per subunit. The outputs of each of the subunits is added by a tree of adders. The inputs to the jth lter unit have to be subsampled along the rows as well as along the columns by a factor of 2 j?1 . Subsampling along the rows is achieved by storing the data in a delay line of size 2 j?1 L, and tapping the delay line every 2 j?1 delay units. Since the inputs are fed in the line scan mode, subsampling along the columns is obtained by storing the data in a delay line of size 2 j?1 LN, and tapping the delay line every 2 j?1 N delay units. Figure 14 Table 4 compares the number of multipliers, and the asymptotic area and time complexities for the various 2-D CWT architectures.
Comparisons
Conclusion
In this paper, we have presented a wide range of algorithms and architectures for computing the 1-D and 2-D DWT and the 1-D and 2-D CWT. We have presented two on-line algorithms, the MRPA and the OCA, for implementing the 1-D DWT and the 1-D CWT respectively. The systolic and parallel lter architectures implement these on-line algorithms, and are optimal both with respect to area and time. The 1-D DWT architectures are highly amenable to single chip design, since they require area that is independent of the length of the 1-D input sequence. The systolic and the parallel lter architectures for the 2-D DWT can also t on one chip (even though the storage required depends on the square-root of the input size). We have also presented SIMD array architectures for computing the DWT and CWT. These architectures have an area that is proportional to size of the input sequence, but they are very fast (T = O(LJ) or O(L)).
The architectures proposed in this paper are independent of the size and nature of the wavelet. Further reductions in the hardware complexity can be achieved by exploiting the property of the speci c wavelet. For instance, for orthonormal wavelets, the number of multipliers in these implementations can be reduced by a factor of 2 by using polyphase decomposition techniques of 13] . x (n)
Retain every other sample 
P P P P P P q P P P P P P q P P P P P P q P P P P P P q Z Z Z Z P P P P P P q P P P P P P q Interconnection for 3rd octave computation j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j 
