Concurrent bit-plane coding architecture for EBCOT in JPEG2000 by Chiang, Jen-shiun
Concurrent Bit-Plane Coding Architecture
for EBCOT in JPEG2000
Jen-Shiun Chiang, Chang-Yo Hsieh, Jin-Chan Liu, and Cheng-Chih Chien
Department of Electrical Engineering, VLSI Lab.
Tamkang University
Tamsui, Taipei, Taiwan
E-mail: {chiang, cyhsieh, jcliu, chien}gee.tku.edu.tw
Abstract-This work presents a concurrent bit-plane coding bit-plane coding [9]. On the other hand, EBCOT requires
architecture for EBCOT of JPEG2000. The architecture uses 20K-bit internal memory [8], and this amount of memory
two bit-planes at the same time to encode data and this scheme requirements is quite large when considering ASIC
can reduce the requirement of internal memory efficiently. implementation. Therefore the pass-parallel architecture is
Compared with the conventional approach, our concurrent proposed to improve the computation performance and can
architecture can save 8K-bit internal memory. In our proposed reduce 4K-bit internal memory requirements [6] [7]. In order
architecture, it can process data as long as the data of the two to further improve the performance efficiency, we propose a
bit-planes are available, and at the same time the system can concurrent bit-plane coding architecture for EBCOT. This
keep reading data from the external memory. This approach work adopts the technique of parallelism and can process
can increase the computation efficiency and avoid the waiting two bit-planes a time. The proposed concurrent bit-plane
time for reading external data. It can also reduce the access . X . .
o
times of the internal memory. Compared with the conventional coding architecture can not only icrease the computation
context modeling architecture, the proposed concurrent bit- performances,b tialso internal memory.
plane coding architecture can reduce the computation time by The memory access times are reduced and it can reduce the
more than 50%. computation time by more than 5000.
I. INTRODUCTION II. EBCOT ALGORITHMAND SPEEDUP METHODS
The block diagram of the JPEG2000 encoder is shown inJPEG2000 is the latest still image compression standard Fiue1Thdsctewvetrafom ndhecar
deveopedbyIOJECJTCISC29WGI 1]. t isthe Figure 1. The discrete wavelet transform and the scalardeveloped by ISO/IEC 1/S /W 1 [ I s th quantization are first applied for the input image data. The
advacedverionof JEG nd mprvesa lo offeaure in quantized transform coefficients are then entropy coded by
low bit-rate image compression. Furthermore it has more uantext oeliend arepthe entry codetic
novel features such as lossy and lossless compression, using context modeling and adaptive binary arithmetic
progressive image transmissio by qucoding. Finally, the compressed data is organized into a
progressiv imaerest trdansiiond quarelitynor feature-rich code-stream by using post-compression rate-regonof ntret cdig,andgod rro rs iene2] distortion optimization algorithm [2]. The key algorithms of
JPEG2000 is consisted of discrete wavelet transform the entropy coding involved in this paper are described in the
(DWT), scalar quantization, context modeling, binary following subsections.
arithmetic coding, and post-compression rate allocation. The
main scheme of the context modeling is implemented by the - - -COT
embedded block coding with optimized truncation (EBCOT) r-------- -
. . . . , r r ,
~~~~~~~~~~~~~~~Tir IIb Itechnique, which provides a rich set of features, such as Image Sub b .
scalable resolution and SNR with a random access property Input ihei bit
[3]. However, the EBCOT takes more than 50o of encoding Q"PfitQntizati I i stream
time in a software-based JPEG2000 implementation system Tier-2
[4] [5]. Since the EBCOT consumes a lot of computation
time, many researches focus on this module to improve the ..de-Block C. ext lIl
efficiency. Among them, sample skipping (SS) and group-of- Memor Md= ig
column skipping (GOCS) methods focus on skipping the
............
many unnecessary coded samples and columns [5]. Taubman Figure 1. Block diagram ofthe JPEG2000 encoder
et al. proposed a concurrent symbol processing at fractional
0-7803-9390-2/06/$20.00 ©)2006 IEEE 4595 ISCAS 2006
A. Context Modeling columns in a group is a compromise between processing
After the transformation and quantization steps are speed and area cost. The third speedup method, concurrent
performed, each sub-band is partitioned into rectangular symbol processing, uses stripe-column concurrent encoding
blocks (called code-blocks), typically 64x64 or 32x32 in at each coding pass. This approach can increase the speed of
dimension. In the context modeling module, all quantized producing CX-D pairs in a clock cycle. Because this method
transform coefficients of the code-blocks are expressed in has more generated numbers of CX-D pairs in a clock cycle,
sign-magiude representation and divided into one sign bit- it needs buffers between the context modeling and arithmetic
plane and several magnitude bit-planes. Each bit-plane in a coder. On the other hand, the concurrent symbol processing
code-block can be divided into several stripes, and each approach encodes data in each coding pass, and therefore it
stripe is composed of four row samples. In order to improve may encounter wasted computation time in many
the embedding of the compressed bit-stream, each bit-plane unnecessary coding stripe columns.
is coded in three coding passes. Each sample in a bit-plane is D. Pass Parallel Context Modeling
coded in only one of the three coding passes and skipped in
the other two passes. The three coding passes are: significant The pass-parallel architecture was used in PPCM and
propagation (Pass 1), magnitude refinement (Pass 2), and PCCM approaches [6] [7]. Because the inefficiency of the
cleanup (Pass 3). The block coding algorithm is composed of context modeling of EBCOT, the PPCM can increase the
four coding primitives, and they are zero coding (ZC), sign efficiency by merging the three coding passes to a single one.
coding (SC), magnitude refinement (MR), and run-length Moreover, the PCCM can encode a column in each stripe
coding (RLC). The more detail about the context modeling concurrently. PPCM and PCCM require four blocks of
algorithm can be found in [1] and [8]. memory and each block takes 4K-bit. These four blocks are
classified as X (records all signs of samples in a bit-plane), vp
B. Adaptive Binary Arithmetic Coding (records all magnitudes of samples in a bit-plane), ao
The MQ coder is an adaptive binary arithmetic coder (records the significance of Pass 1), and a, (records the
with renormalization-driven probability estimation. To significance of Pass 3) respectively. The refinement memory
reduce complexity, there are only 18 contexts used in can be replaced by ao ®) a,, where ® is the logical exclusive-
JPEG2000, and each coding context is represented by 5 bits or operation. Therefore, the memory requirement of PPCM
of the state information. Since the core of the MQ coder is and PCCM are 4K bits less than that of the conventional
adaptive in nature, the content of the selected context is design. Since the PPCM merges the three coding passes to a
updated based on the probability estimation process defined single pass, it encounters two problems. One is that the
in JPEG2000 whenever a renormalization occurs. A byte of coded sample belonged to Pass 3 may become significant
compressed data is removed and outputted from the high earlier than Pass 1. The other is how to predict neighbor
order bits of the code register C periodically to keep C from significances of the coded samples that are belonged to Pass
overflowing. When all of the symbols have been encoded, 1, Pass 2, and Pass 3 respectively. The PPCM proposed two
the FLUSH procedure is executed to terminate the encoding methods to solve the first problem. Firstly it uses two
operations and generate the required terminating marker. memory blocks a and a, to record the significances of Pass
Several bytes are also generated in the FLUSH procedure. 1 and Pass 3, and then it delays the Pass 3 coding one stripe
column. For the second approach, it uses Table I to predict
C. Speedup Method the neighbor significances. Besides, it uses "stripe causal"
In the EBCOT coder, each sample in a bit-plane is coded mode of JPEG2000 [1] to break the correlation between the
in only one of three coding passes and skipped in the other current stripe and next stripe. By using these techniques, all
twopasses.Therefore cloc cycles may bewsamples in each column can be encoded one by onetwo passes. ore, clocki cycles e wasted in efficiently.
processing sample locations that do not belong to the current
coding pass [5] [6]. Obviously, a large number of clock III. PROPOSED ARCHITECTURE
cycles may be wasted if the straightforward method is used.
Many literatures focus on this topic, such as the sample Based on the pass-parallel architecture, we propose a
skipping (SS) and group-of-column skipping (GOCS) [5] concurrent bit-plane coding architecture that can reduce two
approaches try to skip the unnecessary code samples and 4K-bit coding state memories (bit-plane relationship variable
columns, and the concurrent symbol processing [9] uses the 4[k] and refinement state variable y[k]). The block diagram
idea of parallel encoding of a column in each coding pass.
The key idea of the SS method is to skip unnecessary code TABLE I. THE PREDICTED TECHNIQUE FOR THREE PASS TYPES
samples in a single column. The SS is more efficient
compared with the straightforward method, but a clock cycle Pass Significant Prediction
is still wasted when a stripe column is "empty", which means Type Coded samples Un-coded samples
that none of the samples of the stripe column belongs totothe Pass 1 o[k] o[k] G1_k]
current coding pass. Therefore, the second speedup method, -___ ____________________
GOCS, is designed to further improve the processing speed. -Pass 2 - o[k] Go[k] H Gfk] Hvp[k]
It skips a group of "empty columns" simultaneously at the Pass 3 Go[k] Gffk] Go[k] H Gfk]
cost of an extra GOCS memory. Besides, the number of -___ ____________________
^
''1l''~~~~~~~~~~~~~~~~~H:OR logic operation, k: location ofthe coded sample
4596
to the 4 bit-plane and the encoding of Pass 2 goes to the 3rd
4K-bit SRAM 4K-bit SRAM 4K-bit SRAM4K-bitgSRAM Bit-PlaneMa e Statenricane bit-plane to process the encoding. The concurrent bit-plane
encoder repeats this procedure until the encoding of Pass 1
and Pass 3 finishes the Othbit-plane.
Context window Controlleri
The proposed architecture is based on techniques of
IPipelining
c PISO Context I pass-parallel and concurrent bit-plane coding, and the orders
.0 By" 2": Pass Switch -0 Buff Modeling Engine
'iM9Coder P u er fZC, SC, MR, RLC)i ..........of the three fractional bit-plane coding are no longer one by
Pass Typ,!
----------------------------------- 'one but are mixed instead. Therefore the prediction method
Figure 2. The block diagram ofthe proposed context modeling in the significance parameter must be revised to keep
of the proposed context modeling is shown in Figure 2. The encoding the result correctly. In order to maintain the
proposed architecture requires three 4K-bit internal correctness of the code in the pass-parallel architecture, it
memories y[k], vp[k], and [k]. The context modeling engine adopts the approach of delaying one column for the
performs coding operations (ZC, SC, MR, and RLC) operation of Pass 3 [6] [7]. For concurrent bit-plane coding
according to the information provided by the context the architecture of PPCM must be revised. Here we encode
window controller. During the encoding process, the context three fractional bit-planes coding at the same time. To keep
window controller reads the current magnitude bit-plane data the correctness the significance state value prediction
from the internal memory and the next magnitude bit-plane method of the neighbor samples must be revised in the code
data from the external code-block memory at the double- procedure of Pass 3. Here we introduce the idea of "virtual
deck bit-plane data. In order to improve the efficiency of the significance" av[k]. av[k] is the significance state of the
whole system, we use the pipelined pass switching arithmetic code procedure of Pass 3 after the encoding procedures of
encoder (PSAE) [6] [7] in our encoder. The parallel context Pass 1 and Pass 2. Fig. 4 shows the calculation circuit of
modeling may generate a large amount of CX-D pairs at the av[k]
same time [7]. In order to prevent the overflow of a large
amount of CX-D pairs, a group of parallel-in-serial-out The coefficient of Pass 1 can be calculated according to
(PISO) buffers should be included in front of the PSAE, and its code of the conditions mentioned in [w], and can be
the PISO buffer may send out "halt" signal for context expressedasfollows:
modeling when overflow occurs in the system. CPI[X] !G[X] && GN[X] (1)
A. Concurrent Bit-Plane Coding
Where GN[X] is calculated by the neighbors significance
In order to increase the system computation efficiency state value for location X, which is shown in Figure 4. Theand save interal memory requirement, we propose a new coefficients for Pass 2 and Pass 3 can be calculated by
context modeling architecture that can encode two bit-planes
concurrently. In the two bit-planes, the proposed architecture
encodes the first bit-plane with the coding procedures
belonged to Pass 1 and Pass 3 and at the same time encodes CpAX] -4X] x[X] (2)
the second bit-plane with the coding procedure belonged to
Pass 2. In context modeling, when Pass 1 and Pass 3 encode Cp3[X] = !CP[X] && !Gv[X] (3)
ZC and SC, the next encoding of the bit-plane must belong
to the first refinement, and the first refinement memory can After we finish the three-coding-pass detection, each
be omitted. Totally the new architecture needs only 12K-bit coefficient can be encoded then. However, each coding pass
internal memories. Compared with the conventional needs using ofneighbors significance values for ZC, SC, MR,
approach, the concurrent bit-plane coding architecture can and RLC. In order to predict the significance state value of
save 8K-bit memories. The concurrent bit-plane coding the neighbor samples correctly, it must divide the neighbor
procedure is showed in Figure 3. In Figure 3, Pass 1 and Pass
3 are both encoded into the 5th bit-plane in a code-block at
----====........ [x]
present, and at the same time Pass 2 is encoded into the 4-- Pas Ipath
bit-plane at the same column. As Pass 1 and Pass 3 finish the D0 V0 D1 /
encoding in the 5t bit-plane and Pass 2 of the 4t bit-plane H, X ,
finishes the encoding, the encoding of Pass 1 and Pass 3 goes D2 V D3 D2 V DSignificance2 1 3
Pass I Coding Magnitude
/_ _ _ _ _ _ _ _ _ _ Figure 4. Virtual significance predicts of location X (CYN[k] neighbors
/ ------ --------- - -/ significance for location X)
Pass2Coding (D----CD---o-D---- --o---TD
/ Btt-Plane 4 : F3 0 . Coded~~~~~~~~~~~~~ uren sample
Bit-Plan 4 Strip n sample_
Bit-Pkane2 _ __J__11n-coded sample
Figure 3. Concurrent bit-plane coding Figure 5. An example of the location ofthe predicted sample
4597
Current stripe B. Pipelined Pass Switch Arithmetic Encoder
In order to increase the performance of the MQ coder, we
Stripe _slyly 7
_ ()use a pipelined architecture to divide all the coding
0-1C r1 signific flc procedures into four stages. This architecture is proposed in
Significance bufercr [7]. In pass-parallel technique the CX-D data streams sent
Figure 6. Virtual significance cy, and signifcance buffer Cyb for Pass 3 into the MQ coder from our concurrent bit-plane coding
architecture are interleaved. Therefore more context registers
Jet M 1.85 and coding state registers are used. Based on these
E Sample Based phenomena the pipelined MQ coder proposed in [7] is




Lena 1.280_97 IV. EXPERIMENTAL RESULTS
Processing Time (Mega Clock Cycles) We use three different images, Jet, Baboon, and Lena,
Figure 7. Processing time compared for proposed and convention scheme which are all with size 512x512 as the test images. These
samples of the context window into coded samples or un- three images are processed respectively by using the
coded samples. As shown in Figure 5 the coded samples are proposed concurrent bit-plane coding architecture, PCCM
coded before the current sample, and the un-coded samples [7], PPCM [6], and sample based architecture [3]. The
are coded after the current sample. processing times in mega clock cycles are shown in Figure 7.
The three encoded procedures need neighbor samples The proposed architecture reduces the computation time by
significance state values while encoding, which have more than 5000.
different prediction methods according to the coded and un-
coded samples. For Pass 1 encoding, the coded or un-coded V. CONCLUSION
samples use a[k] to finish the encoding. The significance In this paper we propose a concurrent bit-plane coding
state value a[k] obtained after Pass 3 cannot be used directly. architecture to save 8K-bit internal memory for EBCOT in
Because it may cause the significance state value mistake of JPEG2000. Moreover, the non-delay pass-parallel
the neighbor samples of Pass 1, and will generate wrong architecture is proposed, and it can increase computation
encoding result at Pass 1 and Pass 2 coding. In order to solve efficiently. Compared with conventional architectures, the
this problem, a 4-bit significance state buffer Gb[k] is used. proposed concurrent bit-plane coding architecture can reduce
The significance state value obtained after Pass 3 encoding the computation time by more than 50°O.
must be stored in buffer Gb[k] first, as shown in Figure 6. The
value of Gb[k] is logic "OR" operated with the significance REFERENCES
state value obtained after Pass 1 coding and the result is F EReeS
written back to the significance state memory (4k]. [1] JPEG 2000 Part I Final Committee Draft Version 1.0, ISO/IEC
Therefore the significance state value of the un-coded JTCSC29WG N1646R, Mar. 2000.
samples of Pass 3 is or[k] and the significance state value of [2] A. Skodras, C. Christopoulos, and T. Ebrahimi, "The JPEG 2000 stillsamples (yv[k], significance image compression standard," IEEE Signal Processing Mag., vol. 18,the coded samples are Gb[k]. pp. 36-58, Sept. 2001.
The significance state value can be calculated by [3] D. Taubman, "High performance scalable image compression with
equations (4) and (5) for Pass 2 coding. Equation (4) is the EBCOT," IEEE Trans. Image Processing, vol. 9, no. 7, pp. 1158-
significance state value of the un-coded sample, and (5) is 1170, July 2000.
the significance state value of the coded sample. Finally, the [4] M. D. Adams and F. Kossentini, "JasPer: a software-based JPEG-
significance state values of the three coding pass prediction 2000 codec implementtion," IEEE Int. Conf Image Processing, vol. 2significance state values Of the three codig pass prediction pp. 53-56, Sep. 2000.
iS summarized in Table II. [5] C. J. Lian, K. F. Chen, H. H. Chen, and L. G. Chen, "Analysis and
architecture design of block-coding engine for EBCOT in
2u[k] = 1[k] 11 vp[k] (4) JPEG2000," IEEE Trans. Circuits and Systemsfor Video Technology,
vol. 13, pp.2 19-230, March 2003.
[6] J. S. Chiang, Y. S. Lin, and C. Y. Hsieh, "Efficient pass-parallel
2k] = Y[k] 11 Gb[k] (5) architecture for EBCOT in JPEG2000," IEEE Int. Symp. Circuits and
System, vol. I, pp. 773-776, May 2002.
TABLE II. THE PREDICTED TECHNIQUE FOR THE THREE PASS TYPES [7] J. S. Chiang, C. H. Chang, Y. S. Lin, C. Y. Hsieh, and C. H. Hsia,
_"High-speed EBCOT with dual context-modeling coding architecture
Pass Significant Prediction for JPEG2000," IEEE Int. Symp. Circuits and System, vol. III, pp.
Type Current sample Coded samples Un-coded samples 865-868, May 2004.I______ I______________ I______________ l______________ [8] D. Taubman, E. Ordentlich, M. Weinberqer, G. Seroussi, I. Ueno, and
Pass 1 cG[k] c[k] cG[k] F. Ono, "Embedded block coding in JPEG2000," IEEE Int. Conf:
Pas 2 I[k I~kI[G[] Gk ~k Image Processing, vol. 2, pp. 33-36, Sept. 2000.l___ l_______ l_______ l_______ [9] A. K. Gupta, S. Nooshabadi, and D. Taubman, "Efficient VLSI
Pass 3 G[k] Gb[k] G[k] architecture for buffer used in EBCOT ofJPEG2000 encoder,"1IEEE
____________Int. Symp. Circuits and System, pp. 4361-4364, May 2005.
"H"l': OR logic operation. k: location ofthe coded sample
4598
