Abstract-This paper proposes a VLSI architecture of JPEG2000 encoder, which functionally consists of two parts: discrete wavelet transform (DWT) and embedded block coding with optimized truncation (EBCOT). For DWT, a spatial combinative lifting algorithm (SCLA)-based scheme with both 5/3 reversible and 9/7 irreversible filters is adopted to reduce 50% and 42% multiplication computations, respectively, compared with the conventional lifting-based implementation (LBI). For EBCOT, a dynamic memory control (DMC) strategy of Tier-1 encoding is adopted to reduce 60% scale of the on-chip wavelet coefficient storage and a subband parallel-processing method is employed to speed up the EBCOT context formation (CF) process; an architecture of Tier-2 encoding is presented to reduce the scale of on-chip bitstream buffering from full-tile size down to three-code-block size and considerably eliminate the iterations of the rate-distortion (RD) truncation.
I. INTRODUCTION J PEG2000 [1] , [11] , [12] is intended to create a new image coding system that for different types of still image with different characteristics, allowing different image models, preferably with a unified system. It will provide a set of features vital to many high-end emerging applications by taking advantage of new modern technologies.
JPEG2000 is composed of two major parts: wavelet transform and embedded block coding with optimized truncation (EBCOT) [2] . Fig. 1 shows the functional block diagram of JPEG2000. Wavelet transform is a subband transform which transfers images from spatial domain to frequency domain. To achieve efficient lossy and lossless compression within a single coding architecture, two wavelet transform kernels are employed by ISO/IEC 15444-1 [1] , [11] , [12] . The 5/3 reversible and 9/7 irreversible filters are chosen for lossless and lossy compression respectively. After wavelet transform, the coefficients are scalar quantized if lossy compression is chosen. Afterwards, coefficients are entropy encoded by EBCOT, which is a two-tier coding algorithm proposed by Taubman [2] . Each wavelet subband is then divided into code blocks and Tier-1 coding engine encodes these code blocks into independent embedded bitstreams using context-based arithmetic encoding (AE) subsequently. Finally, Tier-2 reorders the code block bitstreams into the final JPEG2000 bitstream with rate-distortion (RD) slope optimized property and the features specified by user.
For the wavelet transform part, this paper proposes a novel architecture of 5-level Mallat [3] decomposition, two-dimensional (2-D) biorthogonal discrete wavelet transform (DWT) based on spatial combinative lifting algorithm (SCLA) [4] with both 5/3 and 9/7 filters. SCLA, first proposed by Meng in [4] , emerges from the lifting scheme where the combinative law of matrix multiplications on the 2-D DWT operator matrix is utilized to combine the vertical and horizontal operations. By utilizing the 5/3 reversible and 9/7 irreversible filters, SCLA substantially reduces the number of multiplications for the 2-D biorthogonal DWT by a ratio of 50% and 42% respectively compared with the conventional LBI [5] , [6] .
Since JPEG2000 recommends a push-pull model between the DWT and EBCOT Tier-1 part, the Tier-1 engine has to preserve a large number of wavelet coefficients. To solve this problem, this paper proposes an efficient memory management strategy called dynamic memory control (DMC), which can substantially reduce 60% of the scale of the on-chip wavelet coefficients memory and ensure the full memory reusability. Moreover, a parallel architecture that processes each subband independently is presented to speed up the entire Tier-1 entropy encoding process.
For the EBCOT Tier-2 part, this paper proposes a novel rate control scheme to execute optimized truncation with the process of AE in parallel. A considerable reduction of computational costs can be achieved without iterative truncations compared with the popular implementations. In addition, the bitstream buffer scale can be reduced from full-tile size to code-block size simultaneously.
The rest of this paper is organized as follows. In Sections II, III, and IV, the VLSI architectures of the three parts are described and analyzed in detail. The experimental results and the performance are depicted in Section V. Finally, a conclusion is given in Section VI. , the forward wavelet transform of the SCLA with decomposition level one by the 9/7 filter can be represented as (1) where is the resulting coefficient vector, is the associated wavelet transform operator matrix, and , , , , and are the constant matrices associated with each step of the SCLA. The expression of these five matrices can refer to [4] .
The SCLA computing process starts from the center and extends to the two sides in (1). Fig. 2 (a) and (b) shows the corresponding SCLA processing operations on matrices and .
In Fig. 2 (a) and (b), the symbol means that the value of the current element does not change during the transform. The operations for the symbol include three steps as follows: 1) sum the four horizontal and vertical neighbors of the current element; 2) multiply the sum of step one by or ; 3) add the product of step two to the value of the current element and leave the resulting sum to the current element. The operations for the symbols and are similar to those of symbol . The only difference is that, in step one, only the two vertical or horizontal neighbors are summed. The operations on matrices and are similar to the ones for matrices and with replaced by and replaced by . Fig. 2(c) shows the SCLA operation for matrix , where the symbol means to multiply the value of the current element by and symbol means to multiply it by , where . The exact value of , , , , and can refer to [4] .
2) Number of Multiplications of SCLA With 9/7 Filter: For (assume , is the natural number) image block with decomposition level J , the total number of multiplications by the LBI is and by the SCLA is . The ratio of the SCLA to LBI is 58%. The comparison is listed in Table I .
The analysis of SCLA with 5/3 filter is similar to the 9/7 case. Its multiplication comparison is listed in Table II . The ratio of the SCLA to LBI is 50%. 
B. Proposed Architecture of SCLA-Based DWT
The proposed SCLA-based DWT architecture reads in a tile of image serially, followed by the SCLA-based DWT and directly outputs of the LH, HL, and HH wavelet coefficients at each level of the decomposition to the following EBCOT Tier-1 part. At the same time, the LL coefficients are written back to the processor in preparation for the next level of decomposition. Line-based transform [7] is utilized to guarantee the tile is read line by line only once. The proposed architecture is shown in Fig. 3 , which consists of three major blocks: DWT filter, input and output register buffer, and on-chip line-buffer memories.
1) Organization of the On-Chip Memory:
The computing process of LBI with 9/7 filter is illustrated in Fig. 4 and represented in (2)- (7), in which is the line-based raw input data, , , , and stand for the temporary data in each processing step, and and stand for the target wavelet coefficients. Because (7) can be calculated in place, it is omitted from the following analysis.
In Fig. 4 , in order to calculate the target wavelet coefficients, for example, , , there are two feasible methods to reserve the raw input data and temporary data in the line-buffer memories:
are processed simultaneously, it only needs to reserve two lines of the raw input data , , and four lines of the temporary data , , , ; 2) If , are processed in sequence, (3)- (6) can each split into two steps as depicted in (8)- (11). Thus, there are only one line of the raw input data or and four lines of the temporary data , , , to be reserved
According to the above analysis, the LBI scheme needs at least five on-chip line-buffer memories. Since SCLA is derived from LBI, its memory organization is the same as LBI's. However, access to such memory organization is inefficient and the corresponding computations are sophisticated. The number of multiplications with different numbers of line-buffer memories are compared in Table III (assuming that the tile size is with one decomposition level), indicating that the organization of 6 line-buffer memories can provide the best tradeoff performance.
The six line-buffer memories are denoted as Line0-Line5 in Fig. 5 for a decomposition level of five, where stands for the image tile width. Line0-Line3 buffer memories are used to store intermediate data for all five levels. Line4 and Line5 buffer memories are split into two pair of segments, respectively, denoted as Line4_Level4, Line4_LL and Line4_Level4, Line4_LL, to satisfy the access timing constraints. Line_Level4 and Line5_Level4 are used for storing the raw image data with level4 decomposition whereas Line4_LL and Line5_LL are utilized to store the LL output of the previous level decomposition for level0-level3. Access to these memories is line-based so that the positions of the stored data can be easily located.
2) SCLA-Based DWT Filter Control: The DWT filter consists of five processing elements, as shown in Fig. 3 . Four elements named as processing elements (PE) marked from to are used for the multiplications of the matrices to in (1), each of which has a 3 3 working region. These four elements are completely identical except their locations in the filter. The other one is used for multiplications of matrix in (1), covering a 2 2 working region.
The numbering of the 3 3 working region for PE , , , or is given in Fig. 6(a) , in which "X" represents the matrices , , or and block "x" stores the old value of . Assume , , or , the operation steps on the 3 3 working region are depicted in (12) The numbering of the 2 2 working region for PE is given in Fig. 6(b) , whose operations are to just multiply the current element by a constant parameter or and leave the result in the current location, hence, it is easy to process matrix in place.
When the processor starts to work, it first reads six lines of data into the input register buffer from the six line-buffer memories at a rate of 1 column (including 6 data) per clock tick. After that, the input data is pushed into the DWT filter. The DWT filter processes SCLA algorithm mentioned above on the data in the pipelined processing elements , , , , and , and simultaneously pops six lines of results into the output register buffer at the same rate as 1 column per clock tick. Four lines of the results are intermediate data, which are written back into the Line0-Line3 buffer memories for the next step of the current level decomposition. The remaining two lines consist of the LL, HL and LH, HH wavelet coefficients, with the LL written back into the Line4_LL or Line5_LL, preparing for the next level decomposition, while the LH, HL, as well as HH, directly outputted. At the same time, the raw image data is pushed into the Line4_Level4 or Line5_Level4 from outside. This operation is repeatedly pipelined until the last wavelet coefficient of the current tile is processed and outputted from the output register buffer.
3) Implementation Precision of the Wavelet Coefficient and the Constants: Several initial tests were made to determine the wavelet coefficient precision with the constants precision fixed at 32 bits. For different compression ratios, the resulting peak signal-to-noise ratios (PSNRs) are listed in Table IV using the "Lena" image bits . From the results in Table IV , the finite precision of the DWT coefficient was chosen to be 17 bits. Several similar tests were then made with this coefficient precision to determine the precision of the constants , , , , and . The corresponding PSNR results are listed in Table V which determines the finite precision of the constants to be 13 bits.
III. ARCHITECTURE OF EBCOT TIER-1 ENTROPY ENCODING
EBCOT Tier-1 uses the DWT-generated subband samples for further processing. Typically, Mallat is performed as the basic decomposition rule, in which all levels contain three subbands except the coarsest decomposition level level0. The subbands LH, HH, and HL form three series, with increasing level index, each of which is called an orientation. Operations on these orientations are almost the same, indicating that three identical encoding cores can be integrated to perform the entire encoding task in parallel and therefore improving the whole encoder performance. The subband of level LL-level0, however, can be attached to any of these three orientations without any degradation in coding efficiency. The EBCOT Tier-1 subband parallel architecture is illustrated in Fig. 7 , containing four main functional parts: cleanup pass (CL), significance propagation pass (SP), magnitude refinement pass (MR), and AE.
According to the Tier-1 algorithm, the code block is the minimum unit in which the original wavelet coefficients are to be compressed. Inside each code block, a key concept of the "fractional bit-plane" is employed in order to acquire a fine embedding, which separates a given quantized bit-plane into three coding passes, i.e., SP, MR and CL. In total, there are four different operations involved in these coding passes, which form the foundation of the embedded block coding strategy. For example, if a sample is not yet significant in the current bit-plane, a combination of the zero coding (ZC) and run-length coding (RLC) is used to record whether or not the sample becomes significant in this bit-plane; otherwise the sign coding (SC) is invoked to encode the sign of the sample. If the sample is already significant, the magnitude refinement (MR) coding is used to code the current bit, refining the sample to a finer precision. During these operations, the sample bits along with their contexts are delivered to the following AE to get further compression.
A. Dynamic Memory Control
JPEG2000 recommends a line-based DWT push-pull model in the analysis filter bank, in which DWT produces wavelet coefficient lines and pushes them into the Tier-1 encoder, one line at a time. Although this model minimizes the memory needed in transform part, all the responsibilities of reserving and manipulating the coefficients have to be undertaken by the Tier-1 part.
Since the coefficients in the DWT line buffers must be transferred into an on-chip wavelet coefficient memory located between the DWT and EBCOT Tier-1 part and preserved there until the code block which they belong to is processed by the encoder, there are lots of difficulties in coefficient locating because sample lines in different wavelet decomposition levels have different lengths. Moreover, the reuse of on-chip wavelet coefficient memory is not easy. Motivated by these problems, a DMC strategy is proposed to arrange the access sequence to the coefficient memory. Under the DMC scheme, the coefficient memory is divided into blocks with fixed size, the same as the maximum code-block size. These blocks are called dynamic memory blocks (DMBs), which are the minimum units that can be reused. A DMB can only be used by one code block at a time. Even if the wavelet coefficients may not occupy a whole DMB (for instance, the code block in subband may be much smaller), the remaining area has to be reserved and cannot be occupied by others. Each DMB has several flags to indicate its current status. Fig. 8 shows the idea of the DMC scheme, in which each box represents a DMB in one of the three decomposition orientations with a specific status.
As shown in Fig. 8 , there are in total four kinds of block statuses in DMB: S1-Full, under processing; occupied by EBCOT Tier-1; S2-Full, waiting for processing; in queue; S3-Not full, data buffering; occupied by DWT; S4-Empty; not in use.
It is obvious that at most three DMBs can reside in either S1 or S3 status, each of which belongs to a decomposition orientation. Since the number of the blocks in status S2 only depends on the difference between the processing capability of DWT and EBCOT Tier-1, it is easy to make a tradeoff between these two parts to minimize the scale and make a full reuse of the on-chip coefficient memory. 
B. Block Encoder
The block encoder, consisting of one context buffer, three coding passes, and one block-encoding controller, is the key function unit in the Tier-1 part. Since each orientation contains only one block-encoding engine, there are in total three DMBs under processing at one time. Coefficients in DMBs are scanned during block encoding, starting from the most significant bitplanes to the least significant one, 1 bit-plane at a time. Inside a bit-plane, coefficient scanning begins at the top-left corner, the first four bits of the first column are coded, as shown in Fig. 9 , and then the four bits of the next column. During the scanning, the statuses of coefficients are modified according to a certain rule. These statuses are recorded with a context buffer, which is the same size as the DMB. Each context in the buffer has 16 bits, according to JPEG2000 standard.
When coding a coefficient, at most one sample and eight neighbors are needed. In Fig. 9 , if four coefficients in a row defined are coded, four coefficients and 32 contexts should be read, which require at least 32 read cycles and maybe a lot of write cycles according to the coding results. Apparently, a number of these cycles are redundant because neighbor contexts may be read and write more than once.
To solve this problem, a 64-bit-width data bus for both DMB and context buffer is selected based on this scanning pattern so that a sample row can be read simultaneously and their neighbors no longer need to be read or written multiple times. Although 36 contexts will be stored in registers, as shown in Fig. 10 , there are only three groups (12 contexts) to be read in at run time. Because this method pre-reads the contexts of the next row, it not only greatly decreases the memory-accessing rate but also gains a throughput increasing. After all the passes in a DMB are coded, all the context buffers will be cleared to zeros, preparing for a new code-block encoding.
C. Arithmetic Encoder
The final step of Tier-1 encoding is the context-based binary AE. As mentioned earlier, binary decisions and their context labels are generated during the previous bit-plane scanning, and then provided to the AE as inputs. Apart from using on-chip memory, in total 19 registers are used in this AE implementation to represent the EBCOT contexts, with 7 bits each for a faster accessing rate. The lowest bit indicates which is the most probable symbol (MPS) and the other 6 bits represent the index of the probability estimation table. These indices are used for accessing the probability estimation ROM, which is composed of look-up tables (LUTs). The two-time table-indexing pattern is shown in Fig. 11 .
It is obvious that one of the fundamental advantages of the EBCOT is the optimized truncation; therefore, the truncated length must be carefully computed in order to match the requirement of correctly decoding all symbols up to the truncation point. In this implementation, the truncated length of each pass is counted in the data output module, and then provided to the RD slope converter for further rate control logic. To ensure accurate decoding using this directly counted length, parallel termination mode [1] , [11] , [12] is selected in which every coding pass is flushed in AE, which can easily get the truncation length and add little complexity to the hardware implementation. The termination pattern is shown in Fig. 12 .
IV. ARCHITECTURE OF EBCOT TIER-2 RATE DISTORTION TRUNCATION
Although the conventional RD optimization strategy such as the one proposed by Li [9] attains fairly good performance, it suffers from high computational costs since the coefficient bit modeling and AE process must be completed ahead of starting rate control processing, which demands a large on-chip buffer for the whole image tile storage in order to locate the appropriate truncation point set. Iterative computations are needed under such scheme as well. Current popular JPEG2000 implementations employ two methods to meet the rate control request: 1) to use a quantization coefficients instead of optimized truncation for rate control; 2) to leave the optimized truncation for the micro-control unit (MCU). Apparently, both methods sacrifice flexibility and the second one even imposes much burden on system throughput and costs. Motivated by this problem, a novel rate control architecture, which executes optimized truncation in parallel with the process of AE, is devised. This architecture first stores code-stream and code-block information overheads of a code block in separate buffers, then estimates RD slope for each truncation point and selects the monotonically decreasing subset. When all the RD slope metrics available, the optimal truncation point for current block can be easily determined. Referring to the information buffer to get truncated block length, the architecture accomplishes rate control by simply shift the buffer address to truncate the block stream. As a result, considerable reduction of computational costs can be achieved with avoiding iterative truncations. At the same time, buffer size is reduced from full-tile size to code-block size.
The proposed architecture for the Tier-2 RD truncation is shown in Fig. 13 , which consists of four main functional parts: code truncation, info truncation, buffer arbiter, and packetization.
The data needed to construct a JPEG2000 code stream can be divided into two categories: code and info. Code means those bytes generated by Tier-1 entropy encoder whereas info stands for code-block information necessary for decoder. Info consists of three parts: zero bit-plane, pass number, and cumulative length for each pass in the code block. RD slope for each pass is also necessary for optimized truncation. It is worth noting that only the truncated block length is necessary not all the cumulative lengths. Such characteristics enable some simplification of the implementation.
In this implementation, separate handlers for code and info simplify and clarify the architecture and minimize memory-locating efforts. The proposed method can be described as follows:
1) When block coding starts, info handler gets zero_bit_plane and pass_number.
2) For each code_byte, code handler writes it into buffer; address increases by 1.
3) When a pass finishes, rd_slope gives out the pass_length and info handler writes it into the buffer; address increases by 1. . Compared with the JPEG2000 Verification Model software architecture, this method provides the same RD performance by choosing the same truncation sets. Moreover, parallel-optimized truncation needs much small buffer and is exempt from searching the entire code stream. Table VI provides numerical results to illustrate the performance of the proposed architecture under a variety of conditions. Results are presented for the well-known USC images, "Lena" and "Barbara", as well as one popular image from JPEG2000 test suite "woman", which is substantially more complex and less blurred than the USC images.
V. PERFORMANCE

A. Chip Implementation of SCLA-Based DWT
The SCLA-based DWT processor was fabricated in DONGBU 0.25 1P4M standard CMOS technology; 25 k logic gates plus 93 k bits on-chip SRAM were integrated in a 2.8 mm 2.8 mm die area, with 0.8-mW/MHz power consumption and 150-MHz maximum processing frequency. This processor is implemented with both 5/3 reversible and 9/7 irreversible filters and the maximum tile resolution supported is bits. The throughput of the DWT processor can reach Mbits MHz s , i.e., under 100-MHz system clock, this processor can transform 60 frames per second with image resolution of bits. This chip has already successfully passed the printed circuit board (PCB)-based verification. Fig. 14 shows the chip microphotograph. 
B. The Experimental Results and FPGA Verification of EBCOT and JPEG2000 Encoder
Implemented in Dongbu 0.25 um 1P4M standard CMOS technology, the total scale of the EBCOT Tier-1 encoding part is 110 k logic gates plus 400 kbits on-chip SRAM, in which 388 kbits are wavelet coefficients and 12 kbits are contexts. However, without performing DMC scheme, at least 1 Mbits coefficients have to be reserved; therefore, the DMC scheme reduces the 60% scale of the on-chip wavelet coefficient memory. The total scale of Tier-2 RD truncation part is 20 k logic gates plus 24 kbits on-chip SRAM. The throughput of this EBCOT processor can reach Mbits MHz s . The proposed EBCOT architecture was combined with the chip-verified SCLA-based DWT part to form a full JPEG2000 encoder, which had already passed the FPGA-based verification and would be fabricated to a chip this year. The estimated scale and other important information of this encoder are listed in Table VII .
Different from other implementations using quantization method such as [10] , the proposed architecture adopts the truncation method in order to obtain three targets:
1) better compression quality, since truncation based on RD slope is more refined compared with traditional quantization; 2) accurate bit-rate control, which (95%) is higher than Yamauchi [10] (80%); 3) allowance for multilayer SNR scalable code-stream feature, which is impossible for quantization method. These three targets are achieved at a sacrifice of the amount of computational efforts which put impedance on high throughput, for example, the throughput of this encoder Mbits MHz s is a bit lower than H. Yamauchi [10] Mbits MHz s . However, compared with quantization method which reduces the entropy of wavelet coefficients before AE, implicit quantization adopted by this architecture leaves much more detail and room for truncation to control the quality.
VI. CONCLUSION
A VLSI architecture of a full JPEG2000 encoder is proposed, which functionally consists of two major parts: SCLA-based DWT and EBCOT. The SCLA-based 2-D biorthogonal DWT, implemented with both 5/3 reversible and 9/7 irreversible filters, substantially reduces 50% and 42% multiplication computations respectively compared with the conventional LBI. The DWT core had already been fabricated into a chip, which is the first chip implementation of SCLA in the world. The EBCOT Tier-1 part employs two functional schemes: the DMC scheme reduces 60% on-chip wavelet coefficient storage and the subband parallel-processing scheme greatly shortens the entropy encoding process times; while the Tier-2 part employs a novel architecture to reduce the scale of on-chip bitstream buffering from full-tile size down to three-code-block size and considerably eliminate the iterations of the truncation. This EBCOT core had already been connected to the chip-verified DWT chip to form a full JPEG2000 encoder and passed the FPGA-based verification. The proposed JPEG2000 encoder is fully compatible with ISO/IEC 15444-1. It can be widely used in the applications of next-generation digital cameras, broadband PDAs, etc.
