The memory required for the implementation of the 2D wavelet transform typically incurs relatively high power consumption and limits the speed performances. In this paper we propose an optimized architecture of the 1D/2D wavelet transform, that reduces the memory size cost with one order of magnitude compared to classical implementation styles. This so-called Local Wavelet Transform also minimizes the memory access cost, thanks to its spatially localized processing. Furthermore, the proposed architecture introduces concurrency in the data transfer mechanism, resulting in speed performances that are not limited by data transfer delays to/from main (off-chip) memory. Finally, the production of parent-children trees in indivisible clusters, makes an easy interfacing to Zero-Tree encoder modules possible, while keeping Region-ofInterest functionalities.
Introduction
Since the introduction of the Wavelet Transform (WT) as a new signal-processing tool, many new and improved image compression algorithms have been proposed, e.g. the Embedded ZeroTree coder of Shapiro (1993) and the Wavelet Transform with Trellis-Coded Quantization of Sriram (1995) . Compared to the well-known block based JPEG compression, wavelet based image compression schemes often offer better compression performance and have additional functionalities. Unfortunately, the transform is global, which induces impediments on its implementation. Indeed, the transform is applied on the whole image at once, requiring an additional buffer of the same size as the image. This buffer must be random accessible and must provide substantial bandwidth, since every data word is accessed several times per pixel. For example, when reading a 1024x1024 pixel image 30 times in a second, 30 million pixels are transferred during 1 second. If one pixel is represented in 24 bits and the data bus is 8 bits, a bandwidth of 90MHz is needed just to read this image. Hence, the needed bandwidth of the intermediate buffer will be a multiple of 90MHz. This relative high bandwidth makes a straightforward implementation of the wavelet transform power hungry and costly.
In , we propose a Local Wavelet Transform (LWT) algorithm. It offers exactly the same functionality as the classical wavelet transform, but unlike existing architectures (e.g Chakrabarti (1993 Chakrabarti ( , 1995 , Gordon (1994) , Knowles (1990) , Lewis (1991 Lewis ( , 1992 , Limqueco (1998) , Parhi (1993) and Vishwanath (1994) ), which do not address the Zero-Tree coupling, the LWT produces and outputs each parent-children tree as an indivisible set, which can then efficiently be encoded using a Zero-Tree encoder module, similar to Shapiro's (1993) . An additional advantage of approach is that the LWT not only reduces the memory size as for example in Vishwanath's (1994 Vishwanath's ( , 1995 Recursive Pyramid Algorithm, but also drastically reduces the number of memory accesses. The interested reader is referred to for a more elaborate comparative study of the LWT versus existing implementation styles, as those of Chakrabarti (1993) , Gordon (1994) , Knowles (1990) , Lewis (1991 Lewis ( , 1992 , Parhi (1993) and Vishwanath (1994 Vishwanath ( , 1995 .
In this paper, an architecture that implements the LWT is proposed, using a dedicated memory hierarchy. Since parent-children trees are output in clusters, the LWT architecture is well-suited for being interfaced to the Ozone chip of Vanhoof (1999) , which is an optimized Zero-Tree encoder chip consuming wavelet data in a tree-by-tree fashion.
It is important to stress that the LWT architecture still supports the nice features of the WT, such as scalability and Region Of Interest (ROI) functionalities of Ahmadian (1996 ), Capodiferro (1996 , Shin (1997) , Topiwala (1995) and Yu (1997) . They are important requirements for a 3 number of applications and hence are addressed in upcoming multi-media compression standards (e.g. JPEG-2000, MPEG-4) . In particular, Region of Interest offers the possibility of selectively exchanging highly informative portions of images over limited bandwidth networks, without consuming bandwidth for irrelevant image regions. For example, the transmission of a texture in 3D rendering can be performed by incrementally encoding portions of the visible region of the texture, as shown in figure 1 for a 3D object that moves (turns) over time. Obviously, only the parent-children trees that are involved in the decoding of the visible portions of the texture at a particular time stamp, should be transmitted over the network.
The Region of Interest functionality can be partially obtained by the lapped wavelet transform schedule proposed in Denk (1994) , in which the input data is subdivided into data blocks of 2 L x2 L pixels. For each new input data block, successive filtering operations are performed for creating the wavelet data into blocks of 2 L-i x2 L-i in level i of the wavelet transform. Obviously, the amount of pixels produced for each new input data block corresponds to the size of a parent-children tree.
However, due to internal data-dependencies, all pixels created out of one specific input data block are not part of the same parent-children tree: as a consequence of the data-dependencies of the filtering operations between successive levels of the wavelet transform, the creation of wavelet data in level i is delayed compared to the arrival of input data. Algorithm in Vishwanath (1994)) can also benefit from our approach for proper extraction of parent-children trees, but are inferior in the memory access performances, as shown in . This paper is therefore devoted to the analysis of an architecture that reaches the lowest potential memory access cycle and storage size cost, predicted in , enabling the selective extraction of parent-children trees in high-speed Region Of Interest based texture encoding applications. In summary, the proposed architecture supports the following characteristics:
-Parent-children trees can be selectively extracted -Region of Interest transmission -Reduction up to a factor 2 of the signal processing power 4 -The memory size for temporary storage is minimized -The external memory access bottleneck, as described in Patterson (1996) , is alleviated All these features are obtained by matching the locality of the processing along the datadependencies of the wavelet transform with the in-place organization of Swelden's (1994 Swelden's ( , 1995 Swelden's ( , 1996 Lifting Scheme.
The paper is organized as follows:
Section 3 relates to the one-dimensional wavelet transform. Section 4 extends this approach to the two-dimensional wavelet transform.
In section 3.1, we briefly survey the data-dependencies and calculation schedule of the Lifting Scheme. In section 3.2, we show how recalculations can be avoided and memory be minimized by storing the minimal information between the calculation and extraction of successive parentchildren trees. From this discussion, different memories with a broad range of functionalities will emerge. In sections 3.3, 3.4 and 3.5, we analyze the data transfers to/from these different memories and provide solutions to reduce the number of access cycles, as well as to reduce the number of address calculations. Section 3.6 provides the final architecture of the 1D Local Wavelet Transform, which is extended to 2D in section 4.
The 1D Local Wavelet Transform

The Lifting Scheme
In the Swelden's (1994 Swelden's ( , 1995 Swelden's ( , 1996 1) the arithmetic complexity is reduced up to a factor 2 for large wavelet filters, according to the method in Sweldens (1996) .
2) lossless compression is enabled by a rounding operation in the intermediate stage of figure   2b . Since the memory size optimization problem depends on the data-dependencies and not on the arithmetic processing itself, the advantage of using the structure of figure 2b instead of figure 2c is irrelevant with respect to the data transfer analysis. We therefore use, for convenience, the abstracted filtering structure of figure 2c. The input data samples required in the filtering process of figure 2c are stored in a so-called Filtering FIFO. in their corresponding data-dependency cones, i.e. the region between L1 1 and L1 2 for tree T1 and the region between L2 1 and L2 2 for tree T2. The difference between the input regions I1 and I2
The Parent-children Tree extraction
represents the additional 2 L input samples to switch over from tree T1 to tree T2. The intersection of input regions I1 and I2 represents the input data that has been used for the creation of tree T1
and that is reused in the creation of tree T2. Obviously this data must be memorized (see the box 
In-place organization of the Overlap and Tree Memories
Overlap and Tree Memories acces schedule
During the execution of the filtering processes between the data dependency lines L1 and L2 (see figure 4), one Lowpass or Highpass value is created for each new Lowpass value in the previous level, by the filtering structure of figure 2c. For simplicity, the Highpass and Lowpass filtering processes are decoupled -as would be the case in a classical wavelet filtering -in contrast with the merged filtering structure of figure 2b. However, this modification has no essential impact on the discussion, but simplifies the forth-coming analysis.
Before starting the first filtering process in any block at any level of the wavelet transform, the According to the FIFO memory discussion of section 3.3, the memories at L1 and L2 are physically identical, leading to the structure of figure 5b.
Dual-port memories for increasing the memory bandwidth
An obvious measure for reducing the number of memory accesses (thus increasing the memory bandwidth) is the introduction of registers (with a high VLSI area cost) or dual-port RAM (with a more acceptable VLSI area cost, especially for the 2D Local Wavelet Transform that will be discussed in section 4), enabling simultaneous read/write memory accesses in one clock cycle.
Since address calculation can also be a time consuming operation, only one address calculation per read/write operation should be favored. This is obviously not possible with the structure of figure 6a, in which not only the two addresses involved in the read-write operation are different, but also their relative difference increases with the wavelet level to be processed, leading to a possible time-consuming address calculation. For instance, to calculate Highpass sample E in level 1, the Lowpass value 4 of level 0 (input) is pushed into the Filtering FIFO. After performing the filter operation, sample E is written back, overwriting sample 2 (see the downward bold arrow from E to 2), having another address than sample 4.
Skewing the data dependency lines in such a way that the read and write addresses during one filtering operation become equal, introduces the following advantages: 
3.6
The global architecture Figure 8 shows the architecture for the 1D Local Wavelet Transform. We recognize three main loops. The filtering operations are performed in loop A, i.e. input data that is written in the IPM is filtered through the Lowpass/Highpass filter and written back into the IPM. Each filtering on a block of input data must be prepared (respectively finalized) by downloading (uploading) data from the Overlap Memory OM to the Filtering FIFO (and vice versa), through loop B. Finally, the data corresponding to a parent-children tree is extracted from the Inter-Pass Memory IPM and Tree Memory TM, through C1, C2 and C3. As explained in section 3.2, only part of the current parent-children tree is available in the IPM; the other part has been previously stored into the Tree Memory TM. Furthermore, the IPM contains already part of the next-coming parent-children tree(s). Thus, when tree T2 (see figure 4 ) has been calculated in the IPM, the IPM already contains part of tree T3 and the Tree Memory TM contains the first part of tree T2 (created during the calculation of tree T1, one time slot earlier). Figure 9 shows the processing stages through time. In order to minimize the cycle count, concurrency has been exploited between the different 9 modules. Moreover, dual-port memories, satisfying the structure of figure 7d, are used to ensure simultaneous in-place read/write operations.
Suppose that tree T1 has been calculated and that tree T2 should now be processed, i.e. input samples I of figure 4 are loaded. In figure 9a, four main functionalities are then obtained:
-The IPM is filled with input data through R -The filter FIFO is prepared with the Overlap Memory data through loop B.
-Part of tree T2 that has already been calculated during the creation of T1 at the previous time slot, is transferred to the Tree Memory TM, through C1 -Tree T1 is output through C2 (data that is overwritten by the previous action) and C3 (data that is overwritten by the new input data)
Registers D1 and D2 are used to resolve the problem of in-place read-write operations in dualport RAM, as explained in section 3.5, while D3 is used to synchronize the data passing through C2 and C3.
Notice that all the data transfers along path R+C1..C3 are performed by a so-called memory-push mechanism, i.e. new input data pushes out data that was available in the Inter-Pass Memory IPM.
The data that is "thrown out" of the IPM, pushes out data from the Tree Memory TM, that is transferred to the output. partially to the output, using the approach of figure 9a, during which also the next input data block is prepared for processing.
The global cycle cost
To enable a continuous input-output data flow, double buffering (ping-pong buffers) should be added to the input-output ports of the architecture of figure 8, leading to the structure of figure 10.
Since each buffer contains only 2 L samples, a negligible memory size cost is introduced.
A sufficiently high internal clock should be provided to cope with the number of internal processing cycles, due to: 10 1) The preparation of the Filtering FIFO with Overlap Memory data (see figure 9c) .
2) The total number of filtering operations, which is twice as high as the number of input samples Let R be the input-output sample rate and α.R the internal processing rate. Since the 2 L samples of the Inter-Pass Memory IPM have to be accessed twice during the processing and since 2M-1 cycles are needed for each Filtering FIFO update, corresponding to figure 9c, the total number C of cycles during the internal processing is given by:
These C cycles can be performed during the same time interval as required for reading 2 L external input values at sample rate R, as long as:
≤ α yielding the constraint that the internal processing should be performed at a α higher clock rate than the sample rate R, with α:
For a 4-level (L=4), 9/7-tap wavelet transform (M=4), α should be larger than 3.75. For α=4, the situation of figure 10 is reached: a 40 MHz internal clock can process data at a input-output sample rate of 10 MHz, which provides an average of 4 cycles of processing per input sample.
The interested reader should also notice that in practical implementations, the number of cycles required to initialize the Filtering FIFO in figure 9a (path B in figure 10) is not larger than the number of cycles to read the 2 L input samples from the input buffer to the Inter-Pass Memory IPM, at the internal clock rate (path I in figure 10), as long as 2 L ≥ 2M-1. For a 3-level wavelet transform (L=3), using a 9/7-tap wavelet filter (M=4), this constraint is barely satisfied. For a higher number of levels or smaller wavelet filters, the constraint does not present any problem.
Although it seems contradictory to common sense, the reduction of the internal clock with a larger number of levels can be explained as follows. When the number of levels increases, the footprint of the parent-children trees in the input image becomes larger, leading to a smaller number of larger input blocks. This automatically decreases the number of times that the filtering FIFO registers of figure 5(a) must be reloaded, resulting in a smaller total number of cycles. design, especially using dual-port RAM modules or registers. Therefore, the Overlap Memory OM and Tree Memory TM are split into (i) an on-chip dual-port cache memory, with a size within an order of magnitude of one input data block (or equivalently, one parent-children tree) and (ii) an off-chip single-port RAM module containing data that must be stored for later processing (this single-port memory has the size given in table 1). If the input blocks are read along successive horizontal bands, the positions where the Overlap and Tree Memories are used relative to the image, are shown in figure 11b .
As shown in the timing evolution of figure 12, these cache memories transfer data to/from the off-chip memories during the filtering operations in loop A, i.e. during the idle cycles of the Overlap Memory OM and Tree Memory TM in the equivalent 1D Local wavelet transform architecture of figure 8 and figure 9. In figure 12a , new input data is read and the parent-children tree, finalized during the previous execution of figure 12e, is output. During this input-output data transfer, the filtering FIFO is prepared for the first time for this new input data, using the Overlap Memory data that has been previously read from external memory into the module, along figure   12e . After the input data has been read and the Filtering FIFO prepared (figure 12a), the first filtering loop can be performed ( figure 12b) . Meanwhile, the internal Tree Memory cache is prepared for holding the data, corresponding to the current tree to extract, for which the calculations have been initiated. All filtering operations are performed by repetitive convolutions (figure 12d) and Filtering FIFO preparations (figure 12c). Before actually outputting the new parent-children tree (figure 12a), the Overlap Memory cache is prepared for the next input data block (figure 12e). The full cycle can restart at figure 12a. Notice that the external memory accesses are well separated over different processing stages (figures 12a, 12b and 12e), keeping low access rates to external memories, which is also beneficial for low-power applications.
Using an analogous approach as in section 3.7, we can show that when clustering P 2 parentchildren trees into one data block, the internal processing clock should be α times larger than the input-output data rate, with:
For a 3-Level (L=3), 9/7-tap wavelet transform and P=1, this would result in α≥6.1. In Lafruit (1996), we theoretically show that an optimal data processing rate is achieved when increasing P to 4, which corresponds to reducing the constraint to α ≥3.54. Therefore, creating clusters of 16 parent-children trees (P=4), instead of individual ones (P=1), the proposed architecture is able to process 10 Msamples/s with an internal clock of 40 MHz in a very modest 0.7 µm CMOS process.
Conclusion
We have proposed a memory-efficient architecture for the implementation of the 1D and 2D Local Wavelet Transform, i.e. the wavelet transform that extracts parent-children trees as soon as made possible by the data-dependencies in the wavelet transform. Due to the spatially localized processing, the Local WT supports Region Of Interest functionalities. Practical implementations of the 1D and 2D Local Wavelet Transform with up to 9/7-tap wavelet filters and a large number of levels, can process a sample rate of 10 MHz, with an internal processing clock of 40 MHz.
13
The dual-port memory organization of section 3.5, as well as the memory push mechanism described in section 3.6, have the disadvantage of a pipelined, continuous data transfer between separate memory modules, possibly not achieving the minimal overall power dissipation.
Merging small dual-port memories into one physical internal memory module and/or splitting them in single-port memories for power optimization, is thus a topic of future research. Overlap and Tree Memories are introduced to avoid recalculations and for storing the minimal required information to switch over from tree T1 to tree T2. 
Tables
