In this paper, a novel reconfigurable discrete wavelet transform architecture is proposed to meet the diverse computing requirements of advanced multimedia systems. The proposed architecture mainly consists of reconfigurable processing element array and reconfigurable address generator, featuring dynamically reconfigurable capability where the wavelet filter kernels and wavelet decomposition structures can be reconfigured at run-time with little overhead. The lifting-based reconfigurable processing element array possesses better computational efficiency than convolution-based architecture, and a systematic design method is provided to generate the hardware configurations of different wavelet filter kernels for it. The reconfigurable address generator handles flexible address generation for data It0 access in different wavelet decomposition structures. A prototyping chip has been fabricated by TSMC 0.35pm IPJM CMOS process, and at 50MHz. it can achieve at most IOOM pixellsec transform throughput, proving it to be a universal and extremely flexible computing engine for advanced multimedia systems.
INTRODUCTION
Discrete Wavelet Transform (DWT) [I] has been widely used in many multimedia applications, including multimedia coding and signal processing. Recently, emerging multimedia standards such as JPEG2000 still image coding and MPEG-4 still texture coding have also adopted DWT as its transform coder. The computations of DWT can be divided into two parts, one is the wavelet filter operation which performs the signal analysis and subsampling, and the other is the wavelet decomposition operation which recursively decomposes the signal according to specific decomposition structure. These two computational parts flexibly combine to make DWT able to decompose a sig-0-7803-7795-8/03/ $17.000 2003 IEEE 137 nal into different subbands of well-defined time-frequency characteristics.
In the advanced multimedia systems, such as portable multimedia everything device or home entertainment center, the computing requirements must be quite diverse.
Many kernel tools such as DWT, Motion Estimation (ME), and Discrete Cosine Transform (DCT) should be integrated into the systems with flexible functionality to support rich multimedia applications. For instance, a universal and extremely flexible DWT computing engine which can support various wavelet filter kernels and wavelet decomposition structures would become a necessary for advanced multimedia systems.
In the literature, there have been many proposals devoted to the hardware architecture of DWT. However, these proposals usually based on fixed wavelet filter kernel andlor fixed wavelet decomposition structure. There are no flexible enough architectures existing to meet the diverse computing requirements of advanced multimedia systems. This situation attracts us to have the research motivation to investigate a reconfigurable discrete wavelet transform architecture which can be dynamically reconfigured as various wavelet filter kernels and wavelet decomposition structures. In the following of this paper, the proposed architecture is overviewed in section 2. In section 3, the reconfigurable DWT processing element m a y is presented, and then the reconfigurable address generator is detailed in section 4. The chip implementation results are given in section 5, and finally, a brief summary in section 6 concludes this paper.
RECONFIGURABLE DWT ARCHITECTURE

Discrete Wavelet Transform
As mentioned in section 1, one of the computational Dalts of DWT is the wavelet filter ooeration. which is a two channel filter bank as shown in Fig. 1 and Fig. 2 In the DWT analysis, original signal is processed first by two analysis filters, low pass and high pass, and then followed by subsampling to decompose the low pass and high pass coefficients. In the DWT synthesis, low pass and high pass coefficients are processed first by upsampling and then followed by two synthesis filters to reconstruct the signal. This basic operation is called the one-level DWT decomposition (reconstruction). For multi-resolution analysis (synthesis), multi-level DWT decomposition (reconstruction) is performed.
The multi-level DWT decomposition, which is namely the other one of the computational parts of DWT, is very flexible, and according to the original signal characteristic, a specific wavelet decomposition structure can be performed to achieve best-suited multi-resolution analysis result. Among all possible decomposition structures, the dyadic type decomposition as shown in Fig. 3 is the most common case due to its regular and recursive structure. In the dyadic type decomposition, the output low pass coefficients of previous level are treated as current input signal to form a recursive chain. However, beyond the dyadic type decomposition, many other decomposition structures are possible but may be more irregular. Take the 2-D image signal as examples, Fig. 4 shows the 3-level dyadic type decomposition of test image Lena, and Fig. 5 shows the wavelet packet transform of test image Burbura, where the DWT is performed according to image characteristics and special consideration with specific wavelet filter kernel and wavelet decomposition structure to achieve best coding efficiency.
Proposed Reconfigurable DWT Architecture
In order to support various wavelet filter kernels and wavelet decomposition structures in single architecture, a dynamically reconfigurable DWT architecture is proposed as shown in Fig. 6 . The proposed architecture is a general and scalable computational model, and the computational resources inside it can be flexibly scalable according to target application specification. A virtual external frame memory is required to buffer the data signal under processing, and the Input Unit and Output Unit depicted in Fig. 6 act as the interface between reconfigurable architecture and this frame memory. In a multimedia system-on-chip (SOC), this virtual external frame memory can be implemented by a shared system memory or by a local frame memory tightly-attached to the reconfigurable architecture. In addition to the I10 Units, the proposed architecture mainly consists of two functional blocks. One is the reconfigurable processing element array, and the other is the reconfigurable address generator. The reconfigurable processing element array, depicted as Reconfigurable DWT PE Array in Fig. 6 , is responsible for the wavelet filter operation and composed of a I-D linear array of reconfigurahle DWT processing elements (PE). The reconfigurable DWT PE is based on the more computationally efficient lifting scheme rather than conventional convolution approach. Besides, a systematic design method is exploited to derive the reconfigurahle DWT PE architecture and generate the corresponding the hardware configurations of different wavelet filter kernels for it. The hardware configurations of Reconfigurable DWT PE Array are stored in the PE Context Memory, where the PLA part stores several default configurations and the RAM part stores user-programmahle configurations.
The reconfigurable address generator, depicted as Reconfigurable WPT AG in Fig. 6 
RECONFIGURABLE DWT PE ARRAY
Before detailing the reconfigurahle architecture, the lifting scheme and a systematic design method to derive efficient hardware architecture of I-D lifting-based DWT are discussed first in subsection 3.1 and 3.2. respectively.
Lifting Scheme
The lifting scheme is a new method for constructing wavelets entirely by spatial approach [2] . Using lifting scheme to construct wavelets has many advantages, such as allowing a faster and fully in-place implementation of the wavelet transforms, immediately to find the inverse transform, easily to manage the boundary extension, and possibly of defining a wavelet-like transform that maps integer-to-integer. According to [3] , any DWT with finite filter can be decomposed into a finite sequence of simple filtering steps, which is called the lifting steps. This decomposition corresponds to a factorization of the polyphase matrix of target wavelet filter into a sequence of alternating upper and lower triangular matrices and a constant diagonal matrix as below.
Systematic Design Method
In [41, a systematic design method to derive hardware architecture of I-D lifting-based DWT is presented. By this systematic design method, an efficient I-D liftingbased DWT architecture based on systolic array can be easily constructed. As shown in Fig. 7 , this design method consists of several design stages: specific lifting factorization, dependence graph formation, systolic array mapping, and optionally pipelining. Once a finite DWT filter is chosen, four subsequent design stages can then be performed to construct the corresponding hardware architecture. The hardware architecture constructed by this design method consists of several serially-connected basic computing units. The possible three structures of the basic computing unit are shown in Fig. 8 , and the number of basic computing units for a chosen DWT filter depends on the number of lifting steps after specific iifting factorization. For instance, there are four basic computing units in the (9.7) odd symmetric biorthogonal filter and three basic computing units in the (9,3) odd symmetric biorthogonal filter. According to the possible structures of basic computing unit in previous subsection, the core cell, which is called the MCU, of reconfigurable DWT PE is derived as shown in Fig. 9 . This core cell is a three-input (A, B, C ) oneoutput (D) datapath, consisting of one adder/subtractor. one multiplier with coefficient a , and another adder. The datapath can be dynamically reconfigured as one of the three possible structures of basic computing unit.
The Reconfigurable DWT PE Array is composed of a 1-D linear array of several reconfigurable DWT PE, and the number of the PE is scalable according to target application specification. As mentioned in subsection 3.2, since the number of basic computing units is variable for different DWT filter, a folding of systolic array technique can be exploited to fold variable number of basic computing units into equal number of MCU with variable throughout. For instance, the (9,7) filter originally require four basic computing units, after a fold by 2'operation, the required MCU number becomes two while the throughput becomes one half. The folding technique will induce feedback loop from the output to the input, therefore some feedback registers are necessary to buffer the feedback signal. Together with the lifting registers and pipeline registers between each MCU, the reconfigurable DWT PE architecture is derived as shown in Fig. IO. In Fig.10 , the delay chain 0 contains feedback registers, the delay chain 1 and 2 contain lifting registers and pipeline registers, the MCU represents the core cell in Fig. 9 , the Mux selects suitable input data from three delay chains, and the FSM receives configuration signal from PE Context Memory to decode necessary hardware configurations for MCU and Mux. Due to the regularity and modularity of reconfigurable DWT PE architecture, several PE can be cascaded serially to form a 1-D linear array as the Reconfigurable DWT PE Array. By adding an additional design stages, folding of systolic array, into original systematic design method, a modified systematic design method to generate the hardware configurations for the Reconfigurable DWT P E Array is shown in Fig. 11 . By this design method, any finite DWT filter can be mapped onto the Reconfigurable DWT PE Array with specific PE number through the generated hardware configurations.
RECONFIGURABLE WPT AG
Compared to the architecture of Reconfigurable DWT PE Array, the architecture of Reconfigurable WPT AG is much simple and straightforward. As shown in Fig. 12 , there two address generators in the architecture, one is the output address generator which generates the corresponding row or column address to Output Unit as write address to external frame memory, and the other is the input address generator which generates the corresponding row or column address to Input Unit as read address to external frame memory. The start time slot of four FSMs, the initial value of four counters, and the select signal of two Muxs are controlled by the configuration signal from AG Context Memory for specific wavelet packet transform.
CHIP IMPLEMENTATION RESULTS
In order to prove the feasibility of proposed reconfigurable architecture, a prototyping chip has been fabricated by TSMC 0.35pm 1P4M CMOS process. Two reconfigurable DWT PE are adopted to form the Reconfigurable DWT PE Array, and several useful wavelet filter kernels and wavelet decomposition structures are stored in the PLA for default configurations. The performance of this prototype architecture is listed in Table I , including the wavelet filter kernels, number of lifting steps, throughput per clock cycle, and corresponding hardware utilization. At SOMHz, the prototyping chip can achieve at most IOOM pixellsec transform throughput (for (5,3) filter), which is equal to perform the CClR 601 (720x576) format image signal at 30 framelsec with four-level wavelet packet transform. The photograph of prototyping chip is shown in Fig. 13 . ' 
CONCLUSION
We have proposed a novel reconfigurable DWT architecture to meet the diverse computing requirements of advanced multimedia systems. The proposed architecture is dynamically reconfigurable in terms of the wavelet filter kernels and wavelet decomposition structures. The liftingbased Reconfigurable DWT PE Array possesses better computational efficiency than convolution-based architecture, and a systematic design method is provided to generate the hardware configurations of different wavelet filter kernels for it. The Reconfigurable WPT AG handles flexible address generation for data U0 access in different wavelet decomposition structures. A prototyping chip has been fabricated with high performance and proved the proposed architecture to be a universal and extremely flexible computing engine for advanced multimedia systems.
