Abstract-In this paper, the scalable architecture of the Embedded Although the word-level EBC architecture can achieve low internal Block Coding (EBC) in JPEG 2000, the bit-plane parallel EBC, is memory and high throughput, the area efficiency is decreased in lossy proposed. We provided the analysis and an unified design methodology for coding. As shown in Fig. 1 , the average number of effective bit-planes the bit-plane parallel EBC architecture. To design the bit-plane parallel ci ng EBC, there exists three critical difficulties. To overcome the difficulties, is less than half of the bit-width of a DWT coefficient when bit three algorithms are proposed. By use of the proposed algorithms, the rate is smaller than 2 (compression ratio>4). The effective bit-plane external bandwidth of the EBC is reduced by 55% averagely, and the means the un-truncated bit-planes after rate control. The rate control throughput of a 4 bit-planes parallel EBC is higher than a word-level in JPEG 2000 is a post-compression rate-distortion optimization EBC with 10 bit-plans parallel by 1.5 times at the bitrate of 1 bits per algorithm. The computational power of the EBC is wasted since the pixel.
I. INTRODUCTION
rate. Therefore, the area efficiency of the word-level architecture is decreased since more than half of bit-planes finally are truncated but JPEG 2000 [1] is well-known for its excellent coding efficiency are encoded by more than half of processing elements. For example, if and rich functionalities, such as scalability, region of interest, error a EBC architecture only capable of processing 4 bit planes at the same resilience, and so on. JPEG 2000 adopts the Discrete Wavelet time is used, as shown in Fig. 1 , the throughput of this architecture Transform (DWT) as the transformation algorithm, and the Embedded is the same as that of the word-level architecture, while the area Block Coding with Optimization Truncation (EBCOT) [2] as the efficiency of this architecture is higher than that of the word-level entropy coding algorithm. By use of new coding tools, the quality architecture. of JPEG 2000 outperforms JPEG by 2 dB in Peak Signal-to-Noise
In this paper, an analysis of scalable architecture for EBC is Ratio (PSNR) at the same bitrate.
presented. We provide an unified design methodology of designing
The complexity of JPEG 2000 is much higher than that a bit-plane parallel EBC architecture. The bit-plane parallel EBC of JPEG. In a JPEG 2000 coding system, the EBC occupies is the generalization of the EBC architectures from two bit-plane 53% of total computation [3] . Therefore, hardware implementation parallel to all bit-plane parallel in a word. The designers can choose of the EBC is a must for real-time applications. Many EBC the best fit number of bit-planes to design a EBC architecture for architectures [3] [4] [5] [6] [7] has been proposed. All of them are bit-different applications, i.e. for different range of target bit rate. For an plane sequential architecture, which encode a code-block bit-plane operational range of target bit rate, a EBC capable of processing the by bit-plane. Besides, all of them require an on-chip SRAM to effective number of bit-planes not only has higher area efficiency but store state variables. To implement a JPEG 2000 coding system with also increases the processing throughput. The experimental results these architectures, the on-chip code-block memory is required since shows that the throughput of a 4 bit-planes parallel EBC is higher the DWT is a word-level processing algorithm while the EBC is than a word-level EBC with 10 bit-plans parallel by 1.5 times at 1 a bit-level one. The code-block memory is used to avoid loading bits per pixel (bpp). coefficients multiple times from external tile memory. However, the The paper is organized as follows. An overview of EBC algorithm code-block memory occupy large silicon area. To solve this problem, is reviewed in Sec. II. The difficulties to design a bit-plane parallel a word-level EBC architecture[8] is proposed to encode one DWT architecture is discussed in Sec. III. The proposed algorithms, which coefficient per cycle regardless of bit-width. Therefore, the code-can overcome these difficulties are shown in Sec. IV. The architecture block memory is eliminated and the throughput of the EBC is of bit-plane parallel EBC is shown in Sec. V. level algorithm. As the problem described in the previous subsection, the redundant loading of the bit-planes that finally are truncated by the Post-RDO introduces unnecessary power consumption. The EBC is called the embedded bit stream and is passed to the tier-2 most redundant access is sign bit-plane access. The sign coding of for rate control. Given a target bitrate, tier-2 truncates the embedded a coefficient only occurs at one of bit of this coefficient. In the bit streams to minimize the overall distortion. The EBC algorithm is conventional memory organization, the access numbers of the signelaborated as follows, bit of a coefficient equal to the bit width of this coefficient. However,
The basic coding unit of the EBC is a code-block with typical size only one access is for sign coding. average PSNR only degrades about 0.1 dB in average. By use of this The rate control in JPEG 2000 is a Post-compression Rate-algorithm, the throughput of the bit-plane parallel EBC is increased Distortion Optimization (Post-RDO) algorithm. All of the coding since the effective bit-planes are known before coding. passes in a code-block must be losslessly encoded regardless of target bit rate. A truncation point, which truncates a code-block at B. Bit-plane Grouping Algorithm and Sign Scattering Algorithm a certain pass of a bit-plane, can not be obtained before coding.
To solve the problem of redundant memory access, the Bit-Plane Therefore, those bit-planes, which are truncated by the Post-RDO Grouping (BPG) algorithm and Sign Scatting algorithm (SS) are finally, are still be processed by the EBC. The wasted computation proposed. The BPG algorithm eliminates the access of truncated bitpower dramatically decrease the throughput the EBC.
planes by the Pre-RDO and the SS algorithm reduces the number of According to the scan order defined in the standard, hO, vO, dO Fig. 4 . Bit-plane Grouping Memory Organization and Sign Scattering and d2 are always scanned before C, while hl, vl and d3 are always Algorithm scanned after C. For dl, the relative scan order depends on the position of C. When C is the first coefficient in a column of the stripe, di is scanned before C because di is scanned in previous sign-bit access to one time for each coefficient. These two algorithm stripe, For the proposed algorithm, the 1KB state memory is required to Most Significant Bit (MSB) bit-plane is followed by the first word store the indicators of significant state and the indicators of refinement for the MSB-1 bit-plane. This addressing method is to read and store state. The memory is half of that in bit-plane sequential design. memory as continuous as possible.
Except the codng pass of C is obtained by using above equations, The decoding algorithm for the EBC is simple. If a bit 1 is the context of C is also can be generated by the contributions frOm encountered when decoding a certain bit-plane, the next bit is sign bit the neighbors according to the context table defined iin JPEG 2000 if there is no significant bit in the upper bit-plans, otherwise, the next standard [l] bit is the magnitude bit of the next coefficient. With these grouping methods, the EBC can skip the words that storing truncated bit-planes V. ARCHITECTURE by the Pre-RDO. Therefore, no redundant access is required and the In this section, the architecture of the bit-plane parallel EBC is access power is reduced.
proposed, which is shown in Fig. 5 . In this figure, the SS, means C. Bit-plane Parallel Context Formation Algorithm the decoding circuit for the SS, the DP means the dispatcher, the AE means the one-symbol arithmetic encoder, the TSAE means the twoIn this section, we propose a bit-plane parallel context formation ' algorithm based on the parallel mode defined in the standard. In symbol arithmetic encoder, and the BPC means the bit-plane coder.
parallel mode, the arithmetic encoder is always terminated at end of The dataflow of the architecture is shown as follows. On the DWT e side, the BPG & SS scatters sign bits into bit-planes, and groups bits eachcodingpass andth sme that comefomtext stipe of the same bit-plane into memory words. The pre-RDO decides the truncation points before the EBC coding, and passes the truncation algorithm is elaborated in subsequent paragraphs. show that the bit-plane parallel EBC has a good performance in lossy
coding, since at the same performance it needs less number of bit-70% -plane coders than the word-level EBC. In lossy coding, because the bit-plane parallel EBC allows multiple code-blocks to be encoded by the EBC concurrently, the gain can be larger than 1 in lossy coding. the bit-plane parallel and the word-level EBC are presented.
Firstly, the total bandwidth reduction of the input of the bit-plane parallel compared to the input of the word-level EBC is shown in Fig. 6 . As shown in this figure, by use of BPG and SS, the external bandwidth is reduced by 55%o averagely.
Secondly, the performance gain is shown in Fig. 7 . The performance is defined as the number of DWT coefficients encoded in a cycle averagely by the EBC. Therefore, the performance gain is defined as the performance gain of the bit-plane parallel EBC divided by the performance gain of the word-level EBC. The experiments
