resulting in poor subjective q u t y reconstructed images at high compression (low aansmission bit rates).
In this paper, we present a new, systolic array architecture for implementing DWT. The architecture uses a frequency doubler at rhe input and computes both high and low fresuency coefficients in the same clock cycle. This results in a reduced silicon area due to minimum number of multipliers.
Discrete Wav-
DWT is popular in image processing applications as the basis functions correspond to the human visual system characteristics.
The two dimensional DWT decomposes the original image into a "low pass" subimage which looks like a subsampled version of the original image i.e. retains spatial information, as well as "high pass" sub-images which contain only edges and fine texture information of the original image. This recursive decomposition process is graphically presented in Figure 1 . Since only pixels in closest spatial vicinity contribute to the corresponding "high" or "low" pass coefficient, all geometric information present in the original image is preserved in sub-images [3] . This feature of DWT makes it suitable for use with advanced image processing techniques. where W(n,O)=n(n) and w(n) and h(n) are Quadrature Mirror Filters derived from the wavelet [6]. In other words, it is a multiresolution decomposition of a sequence of length N. It generates two output sequences which are "low pass" and "high pass" series of total length N.
Since at each resolution level of DWT, the outputs are decimated by two, only those coefficients that are needed are camputed. In other words, the frequency of computed samples decreases exponentially (Pyramid Algorithm). NI2 samples are generated at the highest resolution, N/4 samples at the next highest resolution and so on until two samples remain at the lowest resolution level: one "low pass" and one "high pass". filtering operations or convolutions must be performed.
I n i t Cycle:
We note that for image and video applications, N is chosen to be small: typically 4 or 8, as shown in equation 2.
The computational complexity of DWT is of O(N). This contrasts with NZogN order of complexity for DCT.
High Pas2
Low Pass ( 4 ) c ( 4 ) f ( 0 ) g ( 0 ) b ( 6 ) c ( 6 ) d ( 4 ) e ( Since identical computations take place every N cycles, we need to look at a complete set of calculations for one period N. Here, simple, sixth order, non recursive FIR digital filters are considered with the following transfer functions for the high and low pass components.
. High(z) = go + g,z-' + g2z-2 + g3z-3 + g4z4 + g5z-'
. . . 
ProDosed filter cell
We now present the design of the basic cell for DWT. The proposed filter cell contains a multiplier, adder and two registers, one for each high pass and low pass coefficient, as shown in Figure 3 .
In Figure 2 , as well as in equations of section 2, it is shown that high pass and low pass computations are identical, at specific time instances. Therefore by using a frequency doubler, inside the filter cells, both computations can be performed by the same multiplier. However, care has to be taken, to ensure that each multiplication can be executed in half a clock cycle. . In other words, low pass coefficient multiplication is performed during the first haif of tbe clock cycle, and the high pass multiplication is performed during the second half of the clock cycle. Subsequently, the partial results are passed in a systolic manner from one cell to the adjacent cell.
Modified Pvram id Algorithm
In order to achieve the highest hardware utilization and real time execution, a modification to the standard pyramid algorithm of Figure 1 [71 is required. Thus, first octave computations are scheduled every other sample period, and the higher octave computations are scheduled between the fust octave computations as shown in Table I . Also, scheduling of a computation is done at the earliest possible clock cycle. This prevents conflicts behveen the outputs of the ftrst octave and any other octave output. Schedule for the entire set of computations is given in Table I .
Time instances at which computed variables are available can be determined depending on the latency of the filter. In this example architecture the latency of the filter is assumed to be one. Computed output samples are therefore available one clock cycle after they are scheduled. Figure   4 .
Numbers next to the switches signify clack cycles at which inputs are to be taken from that location. Note that "k" is any integer.
One output must be taken from the low pass stream at a specific time instant (for this architecture it is at 8b+3). The other outputs are taken from the high pass filter stream. A systolic VLSI architecture for computing one dimensional DWT in real time has been presented. It can be seen from Figure 4 , that the proposed architecture is simple, modular, cascadable, and is hence suitable for VLSI implementation. It employs only one multiplier per basic cell, and hence results in a compact implementation. We note that this architecture can easily be extended to two dimensional DWT transform (Figure 1 ).
Acknowledgment

