The wavelet transform appears to be an efficient tool for image compression. Many works propose an implementation of the pyramid a1gorithm with some improvement to reduce its treatment time or to increase its performances. However, the pyramid a1gorithm remains silicon area costly, essentially because of its memory needs, and depending on the size of filters used. This paper proposes a new implementation of the wavelet transform using the lifting scheme. This method proposes many improvements such as in-place calculation, small memory needs, and easy inverse transform.
Next, we present the pyramid algorithm's implementation based on filter banks, and our architecture based on the lifting scheme.
WAVELET TRANSFORM AND MULTI-RESOLUTION ANALYSIS
The multi-resolution analysis [1] uses two functions to project a signal on two spaces: the wavelet function extracts the details (high frequency signal); the scaling function keeps the approximation (low frequency signal). By translating and dilating the wavelet, we can analyse the signal all over the time and at different resolution levels. Different wavelet techniques exist for image processing [18, 19] depending on the signal to be analysed. We concentrate our efforts on the pyramid and the lifting scheme algorithms.
Filter banks and pyramid algorithm (PA)
The relationship between the wavelet multi-resolution analysis and filter banks was first shown by Mallat [1] . The image is filtered by both high-pass and low-pass filters along horizontal direction, giving, respectively, an approximation of the original image and its horizontal details. This scheme is re-applied on the two sub-images along the vertical direction, giving us the three horizontal, vertical and diagonal details sub-images, and the 2-D approximated sub-image (Figure 1 , left). The inverse transform is obtained by inverse filtering with the corresponding filters (Figure 1 , right).
The lifting scheme (LS)
The lifting scheme algorithm [7, 8, 9, 10] presents many advantages compared to the pyramid algorithm [9] :
calculations are performed in-place for an important memory saving; the algorithm shows an inherent SIMD parallelism at all scales; inverse transform is obtained easily from the direct transforrn; this scheme can be performed even when the Fourier techniques are no longer suitable (for example when the sampIes are not evenly placed, that causes problems with the filter banks because of the sub-sampling)
Transform
It is composed of the three following steps (see Figure 2 for details): split: this step separates the signal into two parts. The more correlated the sub-signals are, the better the predict; practically, the signal is divided into a set of even indexed sampIes (S), and another one of odd indexed sampIes (D); predict: the first set S is used to predict the second one D, according to a defined function; the difference between the predicted set and the original one is kept as the detail signal; update: the detail signal D is used to update the unmodified set S, in order to keep the average value of the signal. This can be resumed by the following implementation:
Figure 2. Signal decomposition (Jeft) and recomposition (right) with lifting scheme Figure 2 gives the schemes for I-D decomposition-recomposition. For a 2-D transforrn, we simply apply the same method to each output, and process the scheme along the vertical direction.
The implemented wavelet depends on the prediction function. The higher the degree of the prediction function is, the smoother the wavelet. If the prediction is very dose to the signal, the detail coefficients (i.e. the wavelet coefficients) will be very smalI.
Let's take an example with a linear prediction: if the original signal x is split into the detail signal d and the smooth signal s, we have:
Before the prediction:
And after the update:
Camille DIOU, Lionel TORRES, Michel ROBERT After the linear prediction:
The equation 3 gives us the corresponding high-pass filter, and when inserting equation 3 into 4, we get the corresponding low-pass filter:
Inverse transform
The inverse transform scheme contains three steps (undo update, undo predict and merge) which are obtained by reversing the order of the operations and by changing the signs of the operators, as shown in Figure 2 .
We resume it with: even j _ 1 -= Update(odd j _ 1 )
Sj := Merge(odd j _ p even j _ 1 ) The equivalent filter is obtained by putting one detail coefficient to 1, all others to 0, and computing the inverse transform. It gives the high-pass filter. Doing the same with a smooth coefficient gives the low-pass filter.
2"2 Note that the filters we obtain correspond to the wavelet defined by Cohen, Daubechies and Fauveau (CDF) [2] .
ARCHITECTURES
There are many works on the implementation of the WT on FPGA [11, 12] or ASIC [13, 14, 15, 16, 20] . Most of them propose improvements to the pyrarnid algorithm [14, 17] , or to the implementation (systolic or sernisystolic architecture, parallel or serni-parallel design [13, 15, 16] ). But these methods rely on the same algorithm to perform the WT.
In this section, we present how we can implement this wavelet transform, first using filter banks, then with the lifting scheme. Then, we compare the methods to point out the number of needed operators, the silicon area cost and the memory cost of each method. Figure 3 shows a basic implementation of the pyramid algorithm for a 1-D wavelet transform. The number of operators is chosen in order to correspond to the CDF wavelet (the implementation with lifting scheme is shown in Section 3.2). An implementation of the 2-D wavelet transform using the CDF wavelet is shown in [11] . We can see, in Figure 3 that the filters described above are present three times and that an important memory is necessary to perform the transform. At least the two 128x256 memories must be on-chip if we want to keep good performance during the transform. All these memories and the filters need an important silicon area. The inverse transform is not shown, but it needs a consequent memory too: 112 kB are necessary in the architecture presented in [11] . The Table 1 shows the theoretical performances for this system for 5 levels of resolutions on a 256x256 image (see [11] ), taking into account the following values: 250 ns per pixel for the analysis, and 40 ns per pixel for the synthesis. This table does not show the times of the quantizationldequantization process, which are small compared to the transform.
Filter banks architecture
The evaluated number of CLB for an implementation in a XC4005 family FPGA is about 340. Thus, the implementation of the wavelet transform needs around 5 kgates, without the memory.
3.1.1. Lifting scheme architecture I-D structure We saw in the section 2.2 that the implementation of the lifting scheme requires few operators. We need, for each step (prediction and update), two adders in the decomposition and two adders in the recomposition. We can see in Figure 4 that there is no need for memory during the I-D transform: the transformed coefficients overwrite the original ones during the decomposition. Thus the transform block can be seen from outside as a delay line. The transform can be performed in pseudo real-time, depending on the system dock. We get the same condusion for the reconstruction step. Unlike the I-D case, we need, in the 2-D case, some memory to perform the transform. Before starting the transform along the columns, we have to wait until the transform along the rows is achieved. There are different ways of doing this:
we wait that the first I-D decomposition is achieved, we store the image in a memory, rotate it by 90° or address the memory in a different manner, and we perform the second decomposition;
we only wait that a few rows are treated and then start the vertical decomposition on these rows. When the following rows are achieved, we continue the vertical decomposition on them. This last case is the most interesting from a memory point of view, but we also have to define the method to access the memory: we receive a few lines, and we want to transform them along the columns. Figure 6 shows how we can start the vertical decomposition after the horizontal decomposition of the three or four first ones is achieved. We start the horizontal decomposition by computing the detail coefficients, and reuse them to compute the approximation. Once the 3 first lines are treated, we can start the vertical decomposition. When the fourth line (detail) is treated, the line of vertical detail coefficients is achieved too. Both this line and the previously treated one can then be used to compute the vertical average coefficients. While the vertical average is treated, the next vertical detail can be processed concurrently, while the horizontal average is also being calculated. Thus, in order to perform the complete 2-D transform, we only need a memory that can contain 4 lines of the image, i.e. for a 256x256 images at 8 bits per pixel, 4x256x8=1 kBytes.
Comparisons
Number of logic blocks. The lifting is very interesting for integrated systems because of liule need for logic blocks. The table below show the difference between the lifting scheme and the filter banks. We saw that the filter banks architecture needs around 340 logic cells. The lifting scheme only needs around 190 (24 adders x 8 blocks per adder).
Furthennore, to perform the multiplication, using bit shifting, the lifting scheme needs only 12 shifts whereas filter banks needs, for a 1-0 transfonn. For a 2-0 transfonn, this increases to 36 and 102 respectively.
Memory.
One of the main advantages of the lifting scheme is the inplace calculation. We don't need any buffer memory to perfonn the transform: the wavelet coefficients overwrite the original ones (Figure 7) . Figure 7 shows, on a real image, the Mallat and lifting scheme coefficients distribution after a 2-level decomposition. The on-chip memory needed for the filter banks design is approximately the size of an entire image whereas, for a lifting scheme implementation, this memory is reduced to a few lines. Table 3 shows the memory necessary for the filter banks method as described in [11] . Decomposition Recomposi ti on Lifting scheme I kB I kB Filter banks 84kB 112 kB Table 3 . Memory need of the lifting scheme compare to the filter banks for a 256x256 image of 256 grey-levels Thus, the lifting scheme implementation needs two times less logic blocks and very litde memory compared to filter banks architecture. Furthennore, as the lifting scheme computes the wavelet transfonn in-place, we want to evaluate its perfonnance at video rate. We see, in the next section, a first implementation of the lifting scheme.
HARDWARE IMPLEMENTATION

Overview
We have started to validate the lifting scheme architecture in real-time with an APTIX prototyping p1atfonn [21] . This programmable platfonn (Figure 8) contains Altera lOk100 FPGA, used to implement the wavelet transfonn with the necessary memory. A OSP core (ST 0950) is used to perfonn the quantization and the coding of the wavelet coefficients. Figures  10 and 11 show the implementation of the 2-0 lifting scheme. The input image is first decomposed horizontally. We get two sub-images at half the pixel dock frequency. Thus, we can altematively decompose them along the vertical direction, switching between detail sub-image and smoothed subimage. To keep the video rate, we have to interlace the two sub-images' lines -putting one data of the first sub-image in the memory, and then one data of the second sub-image -and compute the resulting image along vertical direction. 
Horizontal decomposition of the input image
Because of the in-place calculation, the horizontal decomposition of the image can be perfonned at the video rate. The original set of data at the frequency F is decomposed into two subsets (detail and average) at the frequency F/2. A simulation of the 1-0 horizontal lifting scheme decomposition shows that this block needs 116 Altera's Flex lOk100 logic cells, that is, 2% of the chip. Thus, there is no noticeable difficulty in implementing the first 1-0 horizontal decomposition block. Thus, wel1 point out the vertical decomposition below.
Vertical decomposition of the sub-images
There are different ways of computing the vertical decomposition: we can use FIFO memories or RAM. The FIFO seem to be more efficient because they need no memory managing, in the strict sense of the tenn. But, the logic necessary to manage the four FIFO increases considerably the complexity. We describe here the two methods.
Using FIFO memories We consider that a first computed odd line
OOn_l is present in the first FIFO Fl. A first not computed even line Sn is in the FIFO F2. A first not computed odd line On+1 is in the third FIFO F3 . When the second not computed even line Sn+2 is ready, we compute the detail coefficient line from Sn, On+1 and Sn+2' We name these coefficients OOn+I .These coefficients are used to compute the average coefficients SSn from OOn.1> Sn and OOn+l. The coefficients SSn and DOn_I are written to the output of the FIFO F2 and FI respectively. The DDn+1 replaces the DDn_1 in Fl and the Sn+2 replace the Sn in F2. Now, we have OOn+1 in Fl and Sn+2 in F2; F3 is free. When On+3 is ready, we stock it in F3 and wait for Sn+4 which is necessary for ca1culating 00n+3. Then, the scheme can be repeated from the beginning. Using RAM The method described above shows the complexity of the FIFO memories managing. We have to use numerous multiplexers and demultiplexers for choosing between the video input (the output of the 10 transform block) or the output of the FIFO. All the logic that must be developed could be used to implement a RAM controller. Thus, instead of using FIFO memories, we could use RAM and address the data in a standard way (Figure 11 ). 
Implementation
A first evaluation of the system presented in Figure 10 shows that the system uses 554 Altera's logic cells (i.e. 25% of the Flex lOk100 chip), and 6144 memory bits, that is 25% of the total memory of the Flex lOk100. All these results are given for 8 bits bus width. With a c10ck frequency about 30 MHz, this will allow the treatment in real-time, of images with a size of 800x800 pixels in 256 grey-Ievel by increasing the size of the FIFO or the RAM. The needed memory can be implemented on the Altera's lOk100 FPGA. By using pipeline techniques, we estimate that we can obtain a c10ck frequency of 60 MHz, allowing us to process images in HDTV format.
First results of using the LS for image compression show a compression ratio of about 10 for a Peak Signal to Noise Ratio (PSNR) of 30 dB, which is the minimal admitted value for a good image quality. We are currently working on improving the compression ratio to get a value around 40, by using a well-adapted quantization and arithmetic coding [4, 5, 6 ].
CONCLUSION
In this paper, we have shown a new way of implementing the wavelet transform. This method combines an efficient processing rate with low area cost and memory use. It can easily be adapted to perform the inverse transform. The design re-use and IP cores concept is becoming increasingly present in the top down design flow; it is changing the way electronic engineers works. The impact on time-to-market can be considerable; for this reason, we intend to develop a complete architecture for wavelet transform which can be used as a wavelet IP core for image processing applications.
