In this note, optimal hardware architectures for the orthogonal and biorthogonal wavelet transforms are presented. The approach used here is not the standard lifting method, but takes advantage of the symmetries inherent in the coefficients of the transforms and the decimation/interpolation operators. The design is based on a highly optimized datapath, which seamlessly integrates both orthogonal and biorthogonal transforms, data extension at the edges and the forward and inverse transforms. The datapath design could be further optimized for speed or low power. The datapath is controlled by a small fast control unit which is hard programmed according to the wavelet or wavelets required by the application.
INTRODUCTION
The use of the wavelet transform in image and video processing is well known [1, 2] . One of its main advantages is that there are very efficient software implementations, such as lifting [3] . Lifting has also been used for hardware implementations, however, it is well known that it is optimal only in the case when filter lengths are large [3] .
Here we shall show that there are also highly efficient hardware architectures for the orthogonal and biorthogonal wavelet transform. Architectures for the wavelet transform designed to minimize the number of low and high pass convolvers have been given, in the one dimensional case in Ref. [4] , and an extended version in Ref. [5] . In Refs. [5, 6] , architectures for the multi-octave two-dimensional wavelet transform using only three convolvers are presented. However, the approach of Ref. [5] reduces the number of convolvers needed, but has major disadvantages for practical implementation. Namely, the forward and inverse transforms use different architectures, so doubling the circuit area, and the problem of boundary conditions for finite data sets has not been considered.
The basis of our approach is a new convolver circuit that generates low and high pass values simultaneously in the forward transform, and combines low and high pass values in the inverse transform to produce even and odd data values. This is possible because of the symmetry of the orthogonal and biorthogonal wavelet coefficients and the decimation and interpolation by two operators. The results extend that of Ref. [7] by including boundary conditions, and biorthogonal wavelets, and are optimal in the sense of the number of multipliers and adders used. The architectures given here are more efficient than those from lifting, for example in the Daubechies 4 case, lifting requires [3, Table I Table I ] 6 multiplications and 8 additions per transformed (H, G ) pair, the method here uses 5 multiplications and 8 additions. Note that the designs given here are fully pipelined and so are suitable for high speed implementation. This will suit future video compression standards, such as Motion JPEG2000.
A prototype circuit of a 4 tap Daubechies 2D wavelet transform has been fabricated and tested successfully as part of a wavelet zero-tree video codec project [1, 2, 8, 9] , and a single circuit implementing all the wavelets used in the JPEG2000 and MPEG4 image compression standards has been designed and simulated in VHDL.
ORTHOGONAL WAVELETS
As a example to illustrate the concepts introduced here we firstly consider a circuit for the Daubechies 4 tap orthogonal wavelet from Refs. [7, 10] , with (sign less) coefficients a, b, c, d. It was shown in there that the multiplications between the filter coefficients and the input data for this wavelet can be performed with a small number of shifts and adds. The filter equations are, for 1 , i , ðn=2Þ 2 1 :
In order to transform a line of data of length n (n even) we need an extra data value at the start and end of a line. Here we use an even symmetric extension at the ends of the line, however, in hardware it is much simpler to change the filters at the start and end of the line than to extend the input data. Accordingly, we define the forward start and end filters,
Hn
Gn
The row convolver circuit is shown in Fig. 1 , the operation of the convolver in the forward transform on a row of length 8 is shown in Table I , and in the inverse transform in Table II . In the MULTIPLY block (not shown) all the filter coefficients are multiplied by the input data value and are then combined in the adder -subtracter units (AS0-AS3). The ZERO block selectively zeros or passes the input value and the DEL block delays its input value by one cycle. We will explain the sequence of operations in Table I and then indicate how these are mapped to the datapath ( Fig. 1 ). Each line in the table represents one clock cycle. For example, in Table I , line 4 the input data value d 3 is multiplied by the filter coefficients a, b, c, d in the MULTIPLY unit, so at the end of the first cycle the output of AS0 is ad 3 , the first term of H 2 . Simultaneously, in AS3 dd 3 is the output. On the next cycle ad 3 from AS0 
is added to bd 3 in AS1 and dd 3 from AS3 is added to cd 4 in AS3. In this way the low pass value H 2 is evaluated in the sequence: AS0 (ad 3 
The start filters G 0 and H 0 are evaluated in three cycles, for G 0 the order is: AS2
The end filters are evaluated in the same way. Note that since all the multiplications are performed in the same cycle, the computation of ðc þ dÞd 0 ; and ða þ bÞd 0 requires only two extra adders. The movement of data described above is mapped to the datapath in the following way. If we consider again the calculation of the transformed values G 2 and H 2 , as before, the filter coefficients are simultaneously multiplied by the input data d 3 in the MULTILPY block. Thus, ad 3 ; bd 3 ; cd 3 ; dd 3 are passed, in the same cycle, to the AS0, AS0, AS2 and AS3 units. Note that the ZERO block cancels the output from AS1 (AS2) so that the output of the adder in AS0 (AS3) is ad 3 ðdd 3 Þ: On the next cycle, the control signal on MUX1 is switched so that the multiplexor takes its input from the left side, and MUX2 takes it input from the right side. Consequently, at the end of the second cycle the output of AS1 is ad 3 þ bd 3 ; and AS2 dd 3 þ cd 4 : For the third cycle, these multiplexors swap their inputs, so that MUX1 takes its input from the right side and MUX2 takes its input from the left side. So now AS1 sums dd 3 þ cd 4 with 2 bd 5 and AS3 sums ad 3 þ bd 4 with cd 5 , and so on. By changing the control signals on of the multiplexers on each cycle, they act like a swinging door pushing the output of the AS units the right on one cycle and to the left on the next cycle, and so the low pass and high pass values ðH; GÞ are calculated simultaneously. In this way a high pass (G ) coefficient and a low pass coefficient (H ) are output on every second cycle. The high pass value is delayed one cycle and the row convolver outputs one filtered value per cycle in the sequence, H 0 ; G 0 ; H 1 ; G 1 ; . . .; the same order as in the lifting algorithm.
The inverse row transform is shown in Table II . The usual formula for calculating the inverse wavelet transform is to interpolate each of the H, G values by 2 and sum the results. However, in hardware it is simpler to combine these operations in two filters, one to give the even data values and another to give the odd data values. For the Daubechies 4 wavelet the even and odd reconstruction filters are, for 0 , i , n=2 2 2; i even,:
The same hardware ( Fig. 1) is used for both the forward and the inverse transforms. In this case the H, G filtered pairs from the inverse column convolver are multiplied by the respective coefficients in the MULTIPLY unit and input into the pipelined adder -subtracter unit. It can be shown that perfect reconstruction is obtained with the inverse start and end filters
The start reconstruction filter [Eq. (9) ] is calculated in the order AS1 ððb 2 aÞH 0 Þ and AS0 ððb 2 aÞH 0 þ ðc 2 dÞG 0 Þ; and the end reconstruction filter in the order AS0 ((c þ d )H 3 ) and AS1 ððc þ dÞH 3 2 ða þ bÞG 3 Þ:
It can be seen that this architecture extends straightforwardly to any even length orthogonal wavelet, once the type of extension at the boundary has been decided, the corresponding start and end forward and reconstruction filters can be used, and a suitable MULTIPLY block designed.
A prototype of this design was simulated in VHDL and synthesized with SYNOPSIS. The gate count was 12 K gates, and easily achieved the target rate of 60 frames a second for an image size of 320 £ 240 for a three octave 2D transform with YUV 4:1:1 video input. The chip was fabricated and successfully tested [9] .
BIORTHOGONAL WAVELETS
In this section we consider the extension of the basic hardware architecture to linear phase, odd symmetric biorthogonal wavelets, such as the popular binomial [11] or spline wavelets [12] . These are the most commonly used biorthogonal wavelets, for example, the wavelets used in the standards JPEG2000 and MPEG4 are of this type.
As an example of this type of wavelet we have chosen the well known 9-7 biorthogonal wavelet [12] from the JPEG2000 image compression standard. Denote h ¼ ðh 4 ; h 3 ; h 2 ; h 1 ; h 0 ; h 1 ; h 2 ; h 3 ; h 4 Þ as the analysis low pass filter and g ¼ ðg 3 ; g 2 ; g 1 ; g 0 ; g 1 ; g 2 ; g 3 Þ as the analysis high pass filter. Then using symmetric extension at the edges, 
e.g. Refs. [13, 14] , the forward start and end filters for the first row of the row convolution become:
The even reconstruction filter is (0, h 3 , 2 g 2 , h 1 , 2 g 0 , h 1 , 2 g 2 , h 3 ) and the odd reconstruction filter is (0, 2 h 4 , g 3 , 2 h 2 , g 1 , 2 h 0 , g 1 , 2 h 2 , g 3 , 2 h 4 ). Perfect reconstruction at the start and end of a line will be obtained with the inverse start and end filters;
For this wavelet the hardware architecture is given in Figs. 2 and 3 . The pipelined convolver is just Fig. 1 increased to 9 AS units to accommodate the 9 tap filter. The multiplier unit (Fig. 3) is where the multiplication of the input data with the filter coefficients is performed. It can be seen that only 5 multipliers and 8 adders are needed, the minimum for a 9 tap symmetric filter, in Fig. 2 the extra adders and muxs are used for the generation of the terms required by the start and end filters. The ZPS block either zeroes and passes or shifts by 2 the input value. For the forward row convolution the data flow through the pipelined convolver is shown in Table III and the inverse in Table IV . This moves through the array as before, in the forward transform the high pass filtered values move diagonally from right to left and the low pass diagonally from left to right. In the inverse transform, the odd data values move diagonally from right to left and the even values diagonally from left to right. For example, H 0 is computed in the order
then the zero in AS0 passes this result to the output. Apart from the start and end filters (which only affect the first and last five values of the line) the filter coefficients in each of the AS cells switch between two values on each cycle. We can see from this that the control unit will be small.
An important example of an odd symmetric biorthogonal filter has been specified in the MPEG4 image compression standard. In this case the high pass analysis filter is 2 13=2 ð3; 6; 216; 238; 90; 238; 216; 6; 3Þ and the low pass synthesis filter is 2 13=2 ð32; 64; 32Þ: Although now the high pass is longer than the low pass, we can still use the same architecture by swapping the role of h and g.
In the forward transform, the low pass values appear from AS1, and the high pass from AS8. By changing the settings of the output mux, and by using the optional delay on AS8 we can still write the values to memory consistent with the order output from the lifting algorithm. In the inverse transform the even data values are output from AS0, and the odd from AS7.
To illustrate these designs a one dimensional biorthogonal wavelet transform using the MPEG4 wavelet and the LeGall 5-3 and Daubechies 9-7 wavelets from JPEG2000, were designed in VHDL, and fully simulated. An important point for the efficient operation of this type of design is the size of the control unit. In this case it was very small, using only a two hundred gates.
Finally, note that for the case of odd symmetry, the same architectures applies for an anti-symmetric extension of the data at the edges. Only in the case that h and g are odd symmetric and have the same number of coefficients to use this type of architecture it would be necessary to zero extend the h filter, increasing its length by 2.
CONCLUSION
We have shown here that the criss-cross pipelined convolver unit of Figs. 1 and 2 provides an extremely efficient, regular architecture for the convolver part of the orthogonal and biorthogonal wavelet transform. It's principal advantage over the traditional convolver architectures is that it combines the decimation/interpolation by two of the filtered coefficients, the simultaneous generation of the high and low pass filtered values, the start and end filters and the symmetry between the forward and inverse transforms in a single simple architecture. A sample orthogonal wavelet transform chip has been fabricated and tested successfully [9] , and a design of the JPEG2000 and MPEG4 biorthogonal wavelets has been implemented in VHDL and fully simulated.
In this work, we have not addressed possible memory architectures for the transform, since this problem has received considerable attention in the literature.
The designs given here fit the memory architectures of Refs. [8, 9] without modification, or they could use those of Ref. [15] .
As mentioned in the introduction, the datapaths used in this note (e.g. Figs.1 and 2) can be optimized for low power or high speed by appropriately sizing the transistors in the datapath. This research is currently being undertaken.
Acknowledgements
The author would like to thank the referee for his invaluable comments. 
