Abstract -This paper presents a VLSI implementation of discrete wavelet transform (DWT). The architecture is simple, modular, and cascadable for computation of one, or multi-dimensional DWT. It comprises of four basic units: input delay, filter, register bank, and control unit. The proposed architecture is systolic in nature and performs both high-pass and low-pass coefficient calculations with only one set of multipliers. In addition, it requires a small on-chip interface circuitry for interconnection to a standard communication bus. A detailed analysis of the effect of finite precision of data and wavelet filter coefficients on the accuracy of the DWT coefficients is presented. The architecture has been simulated in VLSI and has a hardware utilization efficiency of 87.5%.
Introduction
In the last decade, there has been an enormous increase in the applications of wavelets in various scientific disciplines [1] - [5] . One of the main contributions of wavelet theory is to relate the discrete time filterbank with the theory of continuous time function space. Typical applications of wavelets include signal processing [6] , [7] , image processing [8] - [10] , numerical analysis [11] , statistics [12] , bio-medicine [13] , etc. Wavelet transform offers a wide variety of useful features, in contrast to other transforms, such as Fourier transform or cosine transform. Some of these are: Since DWT requires intensive computations, several architectural solutions using special purpose parallel processor have been proposed [15] - [20] in order to meet the real time requirement in many applications. The solutions include parallel filter architecture, SIMD linear array architecture, SIMD multigrid architecture [17] , [19] , 2-D block based architecture [20] , and the AWARE's wavelet transform processor (WTP) [16] .
The first three architectures, namely the parallel filter architecture, SIMD linear array architecture and the SIMD multigrid architecture, are special purpose parallel processors that implement the high level abstraction of the pyramid algorithm. The 2 -D block based architecture is a VLSI implementation that uses four multiply and accumulate (MAC) units to execute the forward and inverse transforms. It requires a small on-chip memory and implements 2 -D wavelet transform directly without data transposition. However, this feature can be a drawback in certain applications. In addition, the block based architecture may introduce block boundary effects degrading the visual quality.
The AWARE's WTP is capable of computing forward and inverse wavelet transforms for 1 -D input data using a maximum of six filter coefficients. It can be cascaded to execute transforms using higher order filters. The WTP has been clocked at speeds of 30 MHz and offers 16 bits precision on input and output data. The DWT computation is executed in a synchronous pipeline fashion and is under complete user control. However, the AWARE's WTP is a complex design requiring extensive user control. Programming such a device is therefore tedious, difficult, and time consuming.
There is a clear need for designing and implementing a DWT chipset that explores the potential of DWT particularly in the area of decomposition algorithm and hardware implementation and which operates in a turnkey fashion. Here, the user is required to input only the data stream and the high-pass and low-pass filter coefficients. This paper presents a design and VLSI implementation of an efficient systolic array architecture for computing DWT [21] . The proposed VLSI architecture computes both highpass and lowpass frequency coefficients in the same clock cycle and thus has an efficient hardware utilization. The design is simple, modular, and cascadable for computation of 1-D or 2-D data streams of fairly arbitrary size. The proposed architecture requires a small on-chip interface circuitry for purposes of interconnection to a standard communication bus.
The paper is organized as follows. A brief introduction to discrete wavelet transform is presented in section 2. The effect of finite precision in DWT computation is presented in section 3. The VLSI systolic array architecture for computing DWT is presented in section 4, followed by the conclusions in section 5.
Discrete Wavelet Transform
This section presents a brief introduction to discrete wavelet transform (DWT). DWT represents an arbitrary square integrable function as superposition of a family of basis functions called wavelets. A family of wavelet basis functions can be generated by translating and dilating the mother wavelet corresponding to the family. The DWT coefficients can be obtained by taking the inner product between the input signal and the wavelet functions. Since, the basis functions are translated and dilated versions of each other, a simpler algorithm, known as Mallat's tree algorithm or pyramid algorithm, has been proposed in [2] . In this algorithm, the DWT coefficients of one stage can be calculated from the DWT coefficients of the previous stage, which is expressed as follows:
where W p q L ( , ) is p -th scaling coefficient at the q -th stage, W p q H ( , ) is the p -th wavelet coefficient at q -th stage, h n ( ) and g n ( ) are the dilation coefficients corresponding to the scaling and wavelet functions, respectively.
For computing the DWT coefficients of the discrete-time data, it is assumed that the input data represents the DWT coefficients of a high resolution stage. Eq. 1 can then be used for obtaining DWT coefficients of subsequent stages. In practice, this decomposition is performed only for a few stages. We note that the dilation coefficients h n ( ) represent a lowpass filter (LPF), whereas the corresponding g n ( ) represent a highpass filter (HPF).
Hence, DWT extracts information from the signal at different scales. The first level of wavelet decomposition extracts the details of the signal (high frequency components) while the second and all subsequent wavelet decompositions extract progressively coarser information (lower frequency components). A schematic of three stage DWT decomposition is shown in Fig. 1 .
Approximate Location of Figure 1 In order to reconstruct the original data, the DWT coefficients are upsampled and passed through another set of lowpass and highpass filters, which is expressed as:
where ′ h n ( ) and ′ g n ( ) are respectively the lowpass and highpass synthesis filter corresponding to the mother wavelet. It is observed from Eq. 2 that j -th level DWT coefficients can be obtained from ( ) j + 1 -th level DWT coefficients.
Compactly supported wavelets are generally used in various applications. Table 1 lists a few orthonormal wavelet filter coefficients (h n ( ) ) popular in comp ression applications [6] . These wavelets have the property of having the maximum number of vanishing moments for a given order, and are known as Daubechies wavelets. The entries in column 2, 3, and 5 provide the filter coefficients with the minimum phase, whereas entries in column 4 and 6 provide the filter coefficients with the least asymmetric phase.
Approximate Location of Table 1 The 2 -D DWT is usually calculated using a separable approach [2] . To start with, the 1-D DWT computation is performed on each row. This is followed by a matrix transposition operation [28] . Next, the DWT operation is executed on each row of the 2-D data. Hence, 2-D DWT can be implemented in a straightforward manner by inserting a matrix transposer between two 1-D DWT modules. Fig. 2 shows a 3-level wavelet decomposition of an image. In the first level of decomposition, one low pass subimage ( LL 2 ) and three orientation selective highpass subimages ( LH 2 , HL 2 , HH 2 ) are created. In the second level of decomposition, the lowpass subimage is further decomposed into one lowpass and three highpass subimages ( LH 3 , HL 3 , HH 3 ). This process is repeated on the low pass subimage to derive the higher level decompositions. In other words, DWT decomposes an image into a pyramid structure of subimages with various resolutions corresponding to the different scales. The inverse wavelet transform is calculated in the reverse manner, i.e., starting from the lowest resolution subimages, the higher resolution images are calculated recursively. We note that nonseparable wavelets have also been proposed in the literature. However, they are not widely used because of their complexity.
Approximate Location of Figure 2 
Computational Complexity of the DWT
It is observed from Eq. 1a-1b that the complexity of each stage of wavelet decomposition is linear in the number of input samples, where the constant factor depends on the length of the filter. We note that for a dyadic wavelet decomposition, the number of input samples decreases by 50% at subsequent stages of decomposition. For wavelet order L , number of decomposition stages J , the
where FLOP corresponds floating point operations and usually refers to multiplications and additions. We note that a simple polyphase decomposition has been assumed in the above calculation. The complexity can be further reduced using more sophisticated algorithms, such as FFT, first running FIR filtering [14] . However, these algorithms need complex control circuitry for hardware implementation and has hence not been considered in the proposed architecture.
In many applications, a regular tree, instead of a dyadic tree, might be more appropriate. The computational complexity at each stage of a regular tree is 2NL FLOP. Hence, the total complexity for a J level decomposition is:
The complexity of an irregular tree, or a wavelet packet algorithm is upper bounded by C regular .
Data Dependencies within DWT
The wavelet decomposition of an 1-D input signal for three stages is shown in Fig. 1 . The transfer functions of the sixth order highpass ( g n ( ) ) and lowpass ( h n ( ) ) FIR filter can be expressed as follows 
As shown in Fig. 1 
Finite Precision Effect
The accuracy of DWT coefficients depend on the precision of both the input data and the DWT coefficients. In this section, the performance of a finite precision wavelet transformer is evaluated. We note that a multistage DWT is calculated recursively. Therefore in addition to the wavelet decomposition stage, an extra space is required to store the intermediate coefficients. Generally, the dynamic range of the DWT coefficients is greater than the dynamic range of the input data, and hence j should be greater than i . In our implementation, we multiply the input data by 2 j i − to make them j -bits precision. We then calculate the first stage DWT and scale down the DWT coefficients to j bits. Subsequent stages of DWT decomposition is executed in the same manner.
We note from Table 1 that the maximum absolute value of a filter coefficient is between 0.5 and 1. Therefore, to represent the filter coefficients in m bits precision, we multiply all the coefficients by 2 1 m − and round to nearest integer value. Instead of rounding, sometimes, we may have Table   1 for a few wavelets. Table 2 shows the maximum dynamic range of DWT coefficients at various stages. We note that in most cases, the value of z in Table 2 is less than 2. Hence, for an 1-D transform, the dynamic range will atmost increase by 2. In other words, the accumulator should have at least m j + bits precision. Since, the precision of the intermediate storage is j bits, all the coefficients are scaled down to j bits by dividing them by 2 m . We note that the input data was scaled up initially by
After the first stage of decomposition, the coefficients will be effectively 2
times greater than the ideal coefficients.
To derive the ideal DWT coefficients, the resulting DWT coefficients should be scaled down by the scaling factors provided in Table 3 .
Approximate Location of Table 2 Approximate Location of Table 3 The inverse transform is calculated using Eq. 2. The wavelet coefficients are upsampled and passed through a set of LPF and HPF synthesis filters (which are different but related to the analysis filters) and the filter outputs are added for reconstruction. We note that there is a difference between the forward and inverse DWT with respect to the precision of the coefficients. In addition, the dynamic range of the forward wavelet coefficients increases with the tree depth because of the summation operation in Eq. 1a-1b. However, in the case of inverse transform, the dynamic range of the coefficients will decrease due to the perfect reconstruction property of To determine the accuracy of finite precision wavelet transformer, 1 -D step (shown in Fig. 3a ) and sinusoidal (shown in Fig. 4a ) input data with 8 bits precision were decomposed for 3 stages using Daubechies 8 tap wavelet. The differences of the DWT coefficients obtained with 12 bits precision, and the ideal coefficients are shown in Fig. 3 and 4. For comparison purposes, the ideal coefficients are also shown along with the error coefficients. It is observed that the error coefficients due to finite precision are small when the data is uniform in nature (Fig. 3) . However, when the input data changes rapidly, the value of the error coefficients becomes high (near the step discontinuity and throughout the sinusoidal signal).
Approximate Location of Figure 3 Approximate Location of Figure 4 A useful measure of the accuracy of DWT coefficients is the signal to noise ratio (SNR). Here, the signal is the floating point DWT coefficient and noise is the difference between the floating point and finite precision coefficients. Fig. 5 provides the performance variation for 1 -D signal with respect to the precision of filter coefficients with a fixed 12 bits DWT coefficients. The data were decomposed with Daubechies 8 tap wavelet for 3 stages. It is observed that a performance of 50-70 dB SNR can be obtained with 12 bits precision of both DWT coefficients and filter coefficients. Two test images Lena and Mandrill (2-D) were decomposed using Daubechies 8 tap wavelet for 3 stages. The performance with respect to various precisions of DWT and filter coefficients are detailed in Fig. 6 . It is observed that with 12 bits precision, a SNR of 40-50 dB can be achieved. We note that in the 2-D case more stages (row and column) of DWT are involved, resulting in a decrease in the SNR (compared to 1-D case).
Approximate Location of Figure 5 Approximate Location of Figure 6 
The Proposed Systolic Array Architecture
A digit serial wavelet architecture using the pyramid algorithm was outlined in [15] . The proposed systolic array (DWT-SA) architecture is an improvement over the above architecture. Here, only one set of multipliers and adders has been employed, in contrast to two parallel computational hardware employed in [15] . The multiplier and adder set performs all necessary computations to generate all highpass and lowpass coefficients. The DWT-SA architecture does not use any external or internal memory modules to store the intermediate results and therefore avoids the delays caused by memory access and memory refresh timing. In addition, since a set of registers controlled by a global clock is employed, the control circuitry does not need to take the intermediate products in and out of the memory. This results in a simple and efficient systolic implementation for 1-D DWT computation.
In order to compute separable 2 -D DWT, two modules (one for row transform and another for column transform) of the proposed architecture can be used along with a transposer. The details of the schematic and the design of the transposer can be found in [21] , [28] . We note that the proposed architecture is a complete setup for computing forward DWT. The inverse DWT can be calculated by replacing the decimator with an interpolator.
DWT-SA Architecture
The design of DWT-SA is based on a computation schedule derived from Eq. 6a -6n which are the result of applying the pyramid algorithm for eight data points (N = 8) to the six tap filter. We note that Eq. 1a and 1b represent the highpass and lowpass components of the six tap FIR filter.
The proposed DWT-SA architecture is shown in Fig. 7 . It comprises of four basic units: Input Delay, Filter, Register Bank, and Control. The following sections present the design of each unit. First, we present the design of the Filter Unit and its subcomponent-the Filter Cell. The design of Storage Units are then discussed, which is followed by the description of the Control Unit.
Approximate Location of Figure 7 
Filter Unit (FU)
The Filter Unit (FU) proposed for this architecture is a sixtap non-recursive FIR digital filter whose transfer function for the highpass and lowpass components are shown in Eq. Computation of any DWT coefficient can be executed by employing a multiply and accumulate method where partial products are computed separately and subsequently added.
This feature makes possible systolic implementation of DWT. The latency of each filter stage is 1 time unit (TU). Since partial components of more than one DWT coefficient are being computed at any given time, the latency of the filter once the pipeline has been filled is also 1 (TU). The systolic architecture of a six tap filter is shown in Fig. 8 . Here, partial results (one per cell) are computed and subsequently passed in a systolic manner from one cell to the adjacent cell.
Approximate Location of Figure 8 Filter Cell (FC)
Eq. 1a-1b show that computations of the highpass and lowpass DWT coefficients at specific time instants are identical except for different values of the LPF and HPF filter coefficients. By introducing additional control circuitry, computations of both highpass and lowpass DWT coefficients can be executed using the same hardware in one clock cycle.
The highpass coefficient calculation is performed during the first half of the clock cycle whereas the lowpass coefficient calculation is performed during the second half. Subsequently, the partial results are passed synchronously in a systolic manner from one cell to the adjacent cell. The proposed filter cell therefore consists of only one multiplier, one adder, and two registers to store the high-pass and the low-pass coefficients, respectively, as shown in Fig. 9 .
Approximate Location of Figure 9 In order to meet the real time requirements of applications such as video compression, a fast multiplier design is required. For this purpose, a high speed Booth multiplier [24] is used in the filter cell. The Booth multiplier uses a powerful algorithm for signed-number multiplication which treats both positive and negative numbers uniformly and is significantly faster than an array multiplier due t o the reduced number of add stages required [24] , [25] . The additional speedup is a result of using only half the number of adder stages The full adders and half adders employed inside the multiplier are variants of those described in [26] . Both designs have been modified to reduce the carry out delay which is critical in achieving the fastest possible multiplication.
Storage Units
Two storage units are used in the proposed architecture: Input Delay and Register Bank. The data registers used in these storage units have been constructed from standard Master-Slave, Edge Triggered, D -type flip-flops [26] , [27] . The following presents the structure of each storage unit.
Input Delay Unit (ID)
Eq. 1a and 1b show that the value of computed filter coefficient depends on the present as well as the five previous data samples (the negative time indexes in Eq. 1a-1b correspond to the reference starting time unit 0). It is therefore required that the present and the past five input data values be held in registers and be retrievable by the FU and the CU. Therefore, five data registers are connected serially in a chain, as shown in Fig. 7 , and at any clock cycle each register passes its contents to its right neighbor which results in only five past values being retained.
Register Bank Unit (RB)
Several registers are required for storage of the intermediate partial results. Analysis of Eq. 1a and 1b justifies the requirement for the register bank, however it does not explain its size. It will be shown in the next section that 26 data registers connected serially are required to implement RB.
Control Unit (CU)
One of the most important aspects of the DWT-SA architecture is its potential for real time operation. The proposed DWT-SA architecture computes N coefficients in N clock cycles and achieves real time operation by executing computations of higher octave coefficients in between the first octave coefficient computations. The first octave computations are scheduled every N/4 clock cycles, while the second and third octaves are scheduled every N/2 and every N clock cycles, respectively.
There are several approaches for scheduling the octave computations. In the DWT-SA architecture, a schedule based on filter latency of 1 TU is proposed to meet the real time requirements in some applications. The computations are scheduled at the earliest possible clock cycle, and computed output samples are available one clock cycle after they have been scheduled as shown in Table 4 . The delay is minimized through the pipeline facilitating real time operation. For example, the computation of d(0) can only be executed after the calculation of c(0) has been completed. The calculation of c(0) scheduled for cycle 1 is completed in cycle 2 due to the filter latency of 1 TU. d(0) is therefore scheduled for computations in a later cycle, i.e. cycle 4.
Approximate Location of Table 4 The schedule presented in Table 4 is periodic with period N, and the hardware is not utilized in cycle kN+2 where k is a non-negative integer. The computation schedule in Table 4 corresponds to a high hardware utilization of 87.5% (i.e. 7/8).
Register Allocation
The next step in designing the DWT-SA architecture is the design of the Control Unit (CU) and the Register Bank (RB). The two components synchronize the availability of operands. There are two schemes which can be employed for this purpose, namely the Forward Register Allocation (FRA), or the Forward-Backward Register Allocation (FBRA). The FRA method uses a set of registers which are allocated to intermediate data on the first come first served basis. It does not reassign any registers to other operands once its contents have been accessed. The FBRA scheme is similar, except that once the operand stored has been used, the register is reallocated to another operand. The FRA method is simpler, requires less control circuitry and permits easy adaptation of the architecture for coefficient calculation of more than 3 octaves. It results however, in less efficient register utilization.
In either scheme, the coefficient computations are periodic and hence, each register containing a specific variable will be reserved for the same variable in the next period. The construction of the register allocation tables for both FRA and FBRA are presented below.
FRA Register Allocation
To demonstrate the construction of the register allocation table using the FRA approach, we will consider the case of computing the coefficients c(0) and c(2).
As shown in Table 4 , coefficient c(0) is computed in cycle one, whereas the coefficient c(2) is scheduled for computation in cycle 3. These two computed coefficients along with other four; c(4), c(6), c(8) and c(10) are needed for the computation of coefficient e(0) (see Eq. 6k). According to Table 4 , e(0) computation is scheduled for cycle 4. However, the six coefficients c(0), c(2), c(4), c(6), c(8), c(10) will not be available until cycle 12, and therefore the calculation of e(0) has to be scheduled for cycle 12 (i.e., 4 + N). Similarly, coefficient e(4) is computed not in cycle 8, but in cycle 16 (i.e.,. 8 + N). The number of registers needed becomes apparent once one complete frame of computations has been scheduled. Systematic application of the described method, yields a complete schedule of computation for all intermediate and final coefficients as shown in Table 5 . Due to its size, the table has been divided into two parts.
Approximate Location of Table 5 In the FRA register allocation approach where data moves systolically in one direction only, it is possible to increase the number of DWT decomposition octaves by placing additional registers in series after register R26. The new registers hold the intermediate coefficients needed for the computation of the next octave decomposition. Hardware utilization of the higher octave decomposition registers is inversely proportional to the order of computed coefficients. Table 5 shows that not all registers in the FRA register allocation scheme are used at every time instant. In fact, close examination of Table 5 , suggests that for the first to second octave calculations, registers R1 to R11 are used 87.5 % of the time, whereas for the second to third octave calculation, registers R12 to R26 are used 25 % of the time. Table 6 shows a complete register allocation table for the DWT-SA using the FBRA approach. It is observed that a higher register utilization could be achieved by applying the reallocation scheme where empty registers are reallocated to other variables once the original variables held in them are no longer valid. This approach, which in the ideal case would make use of every register at each time instance, has a negative impact on the complexity of the architecture. However, it requires complex control circuitry for data reallocation, and results in a less modular architecture compared to the FRA approach. Therefore, FRA architecture, instead of FBRA, has been employed in the proposed architecture.
FBRA Register Allocation
Approximate Location of Table 6 Tables 5 and 6 show that the last pair of coefficients i.e. f(0) and g(0) for the first group of eight samples is available at time instance 38. This implies that coefficient computations are overlapped for 5 periods. Table 5 also shows the time periods when the intermediate coefficients must remain valid in order to produce the subsequent higher octaves of coefficients. For example consider the case for a first octave coefficient c(0). It remains active from the time instant it has been computed to the time of calculation of the next higher octave of coefficients, i.e., from time instance 1 to time instance 12 in the present configuration. Similarly, e(0), a second octave coefficient remains active until the third octave coefficients (in which it is used) are computed, i.e. from time instance 12 to time instance 38.
Activity Periods
All the intermediate results, and the associated periods of activity are listed in Table 7 .
Approximate Location of Table 7 The number of registers required in this architecture is directly proportional to the number of levels of DWT decomposition, and is calculated during the construction of the timetable of computations. For the DWT-SA architecture which computes three octaves of DWT decomposition and employs the FRA register allocation method, the top row of Table 5 indicates the number of registers is 26.
We note that since no variable in Table 5 has a negative time index, a periodic interpretation of that table is required. Consider for example, the variable c(-2) in the computation of d(0) and e(0) in cycle 4. The periodic interpretation of Table 5 implies that the register which holds the variable c(-2) in cycle 4, also holds the variable c(-2+8) in clock cycle 12 (4+8). Table 5 shows that c(6) is held in register R5.
Complete Design of CU
The complete design of the Control Unit for DWT-SA architecture is shown in Fig. 10 .
It schedules the computation of each DWT coefficient as shown in Table 4 .
Approximate Location of Figure 10 CU is a switch, that directs data from the Input Delay (ID), or the Register Bank (RB) to the Filter Unit (FU). CU is a modular switch with a number of subcomponents equal to the number of taps in the FU. The CU multiplexes data from the ID every second cycle, and from the RB in cycles 4, 6, and 8. In cycle 2, CU remains idle, i.e. it does not allow any passage of data. Proper timing, synchronization as well as enabling and disabling of the CU is ensured by the global CLK signal.
Timing Considerations
We have examined the design of each component in the proposed DWT-SA architecture. The timing considerations of the architecture are discussed below with respect to Fig.  7 .
The number of switching inputs in each control subcell is equal to the number of octave computations. The first octave computations are scheduled every second clock cycle and hence the corresponding switch input is labeled 2k, where k is any non-negative integer. Moreover, its inputs are supplied directly by the ID. Second octave computations are executed in clock cycles 4 and 8, which is reflected by the label 4k+4. The third octave computations are scheduled in clock cycle 6, or 8k+6. Both second and third octave computations use partial results from previous octave computations and therefore use inputs from RB. Table 5 determines which register is used as output.
Delay of the DWT-SA architecture consists of the latency period necessary to fill up the filter for the first time, in addition to the number of clock cycles through the registers as described in Table 4 . First results are thus produced 43 (i.e., 5+38) clock cycles after the first input sample has entered the pipeline. Subsequent coefficients are available at the output the pipeline every 8 clock cycles. The DWT coefficients are output from the final filter stage.
Simulation Results

IEEE Transactions on VLSI Systems, Dec 1996
The proposed DWT-SA architecture, has been fully simulated in order to validate its functionality [21] . First, analog simulations on each small cell, and later digital simulations on the larger block and the final chip have been performed. The analog simulations were executed using the Hspice simulation tool, running under the Opus 4.2 design platform.
Once the gate level analog simulations of all subcircuits were completed, digital simulations were performed on groups of subcircuits forming a more complex functional circuit. Larger blocks of circuits were assembled progressively and verified until the entire DWT-SA architecture was simulated. The digital simulator used was Verilog logic simulator running under Opus 4.2. Process parameters used were those of 1.2 µm technology.
The dimension of a single DWT-SA module with 8 bit wide data and filter coefficients will be approximately 10 7 mm mm × with approximately 300,000 transistors on the chip. The power dissipation of the architecture using 5 volt CMOS process at 20 Mhz will be approximately 500mW -1W. The 16-bit architecture would require 4 times more transistors, 4 times more silicon area, and will consume about 4 times more power. However, the simulation results in Section 3 shows that it is sufficient to have an architecture with 12-bit precision. In this case, the required number of transistors and power dissipation will be approximately 2.25 times those required by 8-bit architecture.
One of the most popular applications of DWT is video processing. With a frame rate of 30 frames/sec, the video processor should process a complete frame in less than 33 ms. It has been found that the proposed architecture can execute the DWT computations on a monochrome 512 512 × frame in 13 ms, with 1.2 µm technology and 20 MHz clock rate. The computation on color frame will take about 39 ms, and hence cannot be executed in real time. However, an additional 35% speedup can be expected if 0.8 µm technology or below is employed. This speedup will enable the architecture to perform real time DWT computations for color video sequences.
Conclusion
A systolic VLSI architecture for computing one dimensional DWT in real time has been presented. The architecture is simple, modular, cascadable, and has been implemented in VLSI. The implementation employs only one multiplier per filter cell, and hence results in a considerably smaller chip area. As implemented in a 1.2 µm technology, and running at 20 MHz, this architecture achieves real-time DWT computation for a monochrome 512 x 512 video input. Step Input Table 4 . Schedule for one complete set of computations.
BIOGRAPHIES
= ∑ ( ) ). Data 1-D 2-D input data ( ) −d d , ( ) −d d ,
Init Cycle
High -pass Low -pass Cycle Com R1  R2  R3  R4  R5  R6  R7  R8  R9  R10 R11 R12 R13  1 Tabl Com R1  R2  R3  R4  R5  R6  R7  R8  R9  R10 R11 R12 R13 
