A litling based I-D Discrete Wavelet Transform (DWT) core is proposed. It is re-configurable for Y 3 and 9/7 filters in JPEG2000. Folded architecture is adopted to reduce the hardware cost and achieve the higher hardware utilization. Multiplication is realized in hard\+ ired multiplier with coefficients represented in canonic signcd-digit (CSD) form. It is a compact and efficient DWT core for the hardware implementation of JPEG2000 encoder.
INTRODUCTION
.. I here has been a long history of the development of uavelet transform [I] . After the demonstration of' the fingerprinting standard. which is the co-operation of FBI i n the US and NIST. the use of \vavelet technolog). as the transform core for image processing gains considerable interest. Discrete wavelet transform is now adopted to be the transform coder in both JPEG2000 121 still image coding and MPEG-4 [3] still texture coding. In this paper. \\e mainl> iOcus on the design of the I-D DWT core for JPECi2000. fPEG2000 is the emerging next generation still image compression standard. Part one (the core) of JPEG2000 is to be delivered and agreed as a full I S 0 International Standard by the end of the year 2000. With the inherent features of navelet transform. it provides multi-resolution functionalit),. and better compression performance at ver) low bit-rate compared \\ ith the IXT-based JPEG 141 standard. To provide efficient loss\ and lossless compression I\ ithin a single coding architecturc. t \ 4 0 \\avelrt transform kernels are provided in part one of JPEG2000. The 513 reversible and 917 irreversible iilters are chosen for lossless and loss>, compression. respectivel!,. A compact architecture for both 513 and 917 filter opcration is. therefore. nccessary for this unified hardware implementation. A number of architectures of DWT based on the classical implementation have been proposed in the litcrature 141. As the ne\\I>, proposed liftingscheme 15-71 fix the computation 01' DWT has lower computational complcsity than the classical implementation. we propose a folded architecture of' I-D DWT core based on the lifting schcme. It is re-conligurable for 513 and 917 filters for the efficient implementation ofIPEG2000 encodcr. Fig. I shows the classical implementation and the litiing based implementation of DWT. Classical implementation is realized by the convolution of the input signals with the low pass filter (h,) and the high pass filter (hi). The convolution kernels of 513 and 917 filters [9] in JPEG2000 are given in Table I . and Table 11 . Both ofthem are linear phase (symmetrical) filters. Lifting scheme is an alternative approach for the computation of the discrete wavelet transform. The block diagram in Fig. I(b) depicted the three steps of lifting scheme. It begins with a trivial wavelet. the "1,aq wavelet". in split phase to split the data into two smaller subsets. even and odd. Then in the second phase. even samples multiplied by the prediction operator are used to predict the odd samples. The difference between the odd sample and the prediction value is the detail coeficient (di). In the third phase. even samples are updated with detail to get smooth coefficient (S,). More algorithm details can be found in the original papers ol' lifting scheme [5-71. The direct mapping of the lifting scheme to the hardware architecture is depicted in Fig. 2 . Fig. 2(a) is the mapping for 5x3 filter. and Fig. 2(b) is for 9x7 filter. There is only one stage (one predict and one update) for the 5/3 filter. but there are two stages fbr 917 filter. This paper is organized as I'ollo\vs. In Section 2. the lifting scheme algorithm is described and compared with the classical implementation. The proposed I-D DW1-architecture is depicted in Section 3. and the 3-D DWT architecture bascd on thc 1-11 1)WT corc is also discusscd. Finall!.. a conclusion is given in Section 4. 
LIFTING SCHEME

PROPOSED ARCHITECTURE
1-D DWT Architecture
To solve the problem of hardnare in-efficient! described i n the preceding section. a folded re-configurable I-D DWT core is proposed. The detailed architecture is shonn in Fig. 4 .
using the similarities between the high and io\\ pass filters. the computation complexity is lower than traditional two-band subband transform scheme. The number of multiplications and additions needed for two points 513 and 917 I-D DWT b! convolution and lifting scheme respectively are listed in Table Ill for comparison.
... x, -x, -I' Table I for 5x3 tilter. Second. the lifting scheme allows in-place computation of the wavelet transform. The original signal will not be used for further computation and. therefore. can be replaced with the calculated wavelet transform coeflicient. lhird. no explicit boundary extension is needed. The symnietr!' mirroring effect is achieved b> a multiplied-by-two operation at proper boundary positions.
I t is teasible to calculating both 5/3 and 9/7 tilter using the architecture in Fig. 3 . It is proposed in [ 101 and redrawn here for illustration. The computation of 5/3 filter can be done b! alternating the coefficients needed for 513 filter. and by taking the Under the assumption that onl!, single read port and \\rite port memory is available. and onl! single-phase clock signal is used for the system. data read from memory one per c!.cle. and write back one per cycle. In the split phase of lifiinp scheme. the data are inputted into two shift regkters. and two samples are read into the predict stage e\ery other cydes. At the output. t\\o output data are available i n every other cycle. and a parallel to serial circuit is also addcd f'or the constraint on single write port memory. That means the input and output data rate to the D W 7 core arc both one sample per clock cycle.
In the 917 filter mode. thc:re are two stages of predict and update operation. Data after the tirst stage computation are feedback (folded) to R I in Fig. 5 Ibr the second stage computation.
l'hs computation of the first stage and the second one itre interleaved. Thc hardware utilization is 10044. While in the 573 filter mode. no lddecl coinputing is neccssar!' sine there is onl! one stage for lifting based operation for 513 filter. Another difference is that the multiplication in 513 filter is in fact only shift-rigtht operation. More specifically. since for JPEG 2000. the filter coefficients are fixed. The number of bits to be shifted right is a constant. and only hardwired shifting with sign bit extension is necessary. The computation load in 513 is much lower than in 917. Also. since no interleaving computation of two stages exists in 513 mode. the computation time in predict and update phase can be equivalently two times of the clock period. Therefore. the pipeline registers of R2 and R3 in Fig3 can be bypassed in 513 tilter mode. with the effect that the latency is reduced without increasing the clock frequency. Fig. 6 illustrated the interleaving operation in 9/7 filter mode. The delay registers are ignored here for ease of explanation. Being a dedicated DWT core for JPEG2000. the filter coefficients are fixed. Multiplications can therefore be further optimized. Hardwired multipliers are used instead of real multipliers to achieve a more compact design. The finite-precision coefficients are chosen to be within reasonable error range. Also. they are represented in their CSD [I I ] form to reduce the number of nonzero digits. Fewer nonzero digits mean fewer adders. Table  IV shows the four coefficients represented in 12-bit CSD form. 
2-D DWT Architecture
The computation style of the entropy coder after DWT will affect thc optimal scheduling of the 2-D DWT computation. Fig. 7 shows the simplified JPEG2000 functional block diagram. Enibedded Block Coding with Optimized Truncation (EBCOT) [ 131 is a block-coding engine. Images after DWT are decomposed into man!' sub-bands. Every sub-band is then partitioned into code-blocks. EBCOT processes these quantized wavelet coefficients code-block b! code-block. After Tier-1 compression of EBCOT. every code-block will generate a sub-bitstream. it is possible to start the EBCOT computation once there is a complete code-block data available. Due to the in-place computing capabilit!, of litiing scheme. the original samples can be replaced directly by the calculated coeflicients. Hence. the original frame-size memor!' is enough. The advantage of this implementation is the ease of data tlo\\ control. Due to the interleaving characteristics of the output. i.e.. one low pass sample followed b! , one high pass sample. the interleaving storage arrangement is illustrated b! an example of a 4 X 4 image show in Fig. 8 . An address generator (AG) is needed to provide the proper access addresses to read samples for nest level navelet decomposition and then write back. The block diagram of the JPEG 2000 system is shown in Fig. 9 . l h e frame memor?. is used for the storage of the data for DWI'. and also Ibr the entrap!' coded sub-bitstreams of each codc-block after EBCOT. Second. il'a i'rame inemor! is not available or not allo\%rd due to the constraint o n the cost ol'the meinor!' s i x . Then. the concept 01' line-based DWI' 1121 can he adopted. Since ERCOT is not line-based. the height ol'the line b u f f r \rill depend o n the height of the code-bloch. The required buffer size for DWT nil1 be smallcr than the framc memorj. Houwer. another memor! space tbr the compressed sub-bitstreams of' ever: code-blocL is necessar) . 
Off-Chip Memory
CONCLUSION
A re-contigurable lifting based I-D DWT core is proposed in this paper. Folded architecture is adopted to reduce the hardlvare cost and to achieve the higher hardware utilization. Multiplication is realized in hardwired multiplier with coefticients represented in CSD form. I t is a compact and efficient DWT core for the hardware implementation of' IPEG2000 encoder. The future work \ \ i l l be the optimization of the scheduling and memory organization of the owrall JPEG2000 s!-stem.
