We propose a hardware accelerator IP for the " H i~h -S p .~~d M~m " . y -s w l n g Archirecrum,fir rhe Embedded Block Codmg
Tier-I portion of Embedded Block Coding with Optimal Truncation (EBCOT) used in the JPEGZOOO next generation image compression standard. EBCOT Tier-I accounts for more than 70% of encoding time due to extensive bit-level processing. O u r architecture consists of a 16-way parallel context formation module and a 3-stage pipelined arithmetic encoder. W e reduce power consumption by properly shutting down parts of the circuit. Compared with the known hest design, we reduce 17% of the cycle count and reach a level within 5% of the theoretical lower hound. We have implemented the design in synthesizable Verilog RTL with an AMBA-AHB interface for SOC design. FPGA prototyping has been successfully demonstrated and substantial speedup achieved.
I. INTRODUCTION JPEG2000
is the next generation image compressioddecompression standard. Like previous standards such as JPEG, it consists of three major phases: transformation, quantization and entropy coding. JPEG2000 achieves better image quality and hit-rate efficiency at the expense of higher computation complexity. It employs discrete wavelet transform (DWT) instead of the traditional discrete cosine transform (DCT) to overcome the annoying block artifacts at low bit-rate. For entropy coding, a novel method called embedded block coding with optimal truncation (EBCOT) is employed. EBCOT consists of two tiers: Tier-I codes each code-block into a sub-hit-stream and Tier-2 assembles sub-hit-streams into image stream trading-off between bit rate consumption and image quality degradation.
In addition to computational intensive coding tasks, a JPEG2000 codec needs to handle other complicated control function such as rate distortion control. Therefore, it is advantageous to implement it with a hardwareisoftware co-design approach. According to our profiling report, the Tier-I portion of EBCOT accounts for more than 70% of the total encoding time due to its extensive bit-level processing. Hence, it is the most suitable candidate for hardwired implementation.
The EBCOT Tier-I takes as its input a code-block ranging from 4x4 to 64x64 DWT-transformed and quantized pixels. Each pixel is a 9-hit signed magnitude number. The code-block is divided into bit-planes and coded one bit-plane at a time starting from the MSB bit-plane down towards the LSB hit-plane. Each bit-plane is further divided into horizontal stripes of four rows each.
There are two phases in EBCOT Tier-I coding: Context Formation (CF) and Arithmetic Encoder (AE). For each bit, the CF generates a contexddecision (CX, D) pair based on information of its surrounding hits. The AE performs context adaptive binary arithmetic coding on D according to CX. Each bit-plane is scanned by the CF three times called passes. Every bit is coded in one of these passes. The scanning order is bit-plane by bit-plane, pass by pass, stripe by stripe, column by column, and bit hy bit as illustrated in Figure 1 . When a bit is scanned. whether and how it should he coded depends on the status of itself and its eight surrounding neighboring hits. After coding a hit and possibly generating its contexVdecision (CX, D) pair, we have to update its status information for successive coding of the yet to be coded neighboring hits. The AE takes as its input the sequence of 0-7803-863 1-0/04/$20.00 02004 IEEE. a7 contexddecision pairs generated by the CF and adaptively performs binary arithmetic coding and updates its probability estimation tables. Coded result is output byte by byte.
II. PROPOSED ARCHITECTURE
Our proposed architecture consists of six parts as depicted in Figure 2 . The code-memory supports input of code-block in a pixel by pixel fashion and simultaneous output of 16 bits of a bit-plane. The state memory records the status of relevant bits. The address generator is responsible for supplying correct addresses to the above mentioned memory modules. 
A. Context Formation
There are three factors affecting the degree of parallelism in the CF: (1) scanning order, (2) checking neighbors, and (3) changing state. An example is illustrated in Figure 3 . The number annotating a bit represents its scanning order. To generate the context of Bit 6, we should check the states of its 8 neighboring bits enclosed in the context window. However, Bits I, 2 , 3 and 5 have already been coded, and may have their states changed. They would affect the coding of Bit 6 immediately. Similarly, coding result of Bit 6 will be used to code Bits 7,9, 10, and 11.
We depict the data dependency among these sixteen bits as a data flow graph (DFG) shown in Figure 4 . If we process the sixteen bits with a single hardware, the clock period requires ten delay units. This is faster than a sample-based or column-based implementation, which requires sixteen units of delay to process sixteen bits. Therefore, we can use lower voltage to achieve the same level of throughput. We estimate that, with proper voltage and frequency scaling, the 16-bit parallel architecture can save up to 60% of power consumption compared with a sample-based architecture.
Delay unit Figure 4 Data dependency in context formation
If we reduce the frequency of memory access and utilize memory bandwidth efficiently, we can further reduce the power consumption. A memory-saving algorithm and a clever data arrangement scheme 141 are adopted with slight modification for 16-bit parallel processing as depicted in In the interleaved scheme, Module A holds Rows 0, 1,6,7, ..., Module B holds Rows 2,3,8,9, . . ., Module C holds Rows 4,5, IO, 11, ... and so on. The memory modules are byte addressable where each byte consists o f four bits each from two adjacent rows. During memory access, the ordering of memory data depends on which stripe is being coded. For example, the data order is (C, A, B) when we code Stripe n, and the dataorder is (B, C, A,) when we code Stripe n+l. The 3-byte by 3-byte switch can meet this requirement.
We use 24 3-bit shift registers for local data reuse as shown in Figure 6 . During each processing step, via the switch, we can get the data in right order from the three memory modules and place them in the first 24 bits of these shift registers while the old 24 bits are shifted left. The context window enclosed all data needed for parallel processing. It includes sixteen current coding bits in the stripe and twenty neighboring bits. The context window would slide right to the next four columns (sixteen bits) after each processing step. , and (3) all bits have been coded in either Pass 1 or Pass 2 during Pass 3 coding. We adopt the skipping scheme and propose a stripe-skipping method by using three 16-hit registers to record the coding condition of all stripes in all passes. This leads to about 2% saving in total coding cycles.
B. Arithmetic Encoder
The AE takes as its input the sequence of (CX, D) pairs produced by the CF and ordered by the Compress & PISO module. Since the AE data flow defined in the standard has feedback loops, we cannot employ a straightforward pipelined structure. We adopt a Modified Probability Estimation Table  ( MPET) [5] and the operand forwarding method commonly found in RlSC CPU design to overcome this problem.
Our 3-stage pipelined AE architecture is depicted in Figure 7 . In the first stage, we read in a context-decision (CX, D) pair and use CX to look up the Context Table for the probability estimation. Since the probability estimation of CX could be updated by information feedback from the second pipeline stage, we need to deal with the situation when two identical contexts coming in continuously. We use the MPET in which the original PET data and two types of updating PET data are read simultaneously using one index and forward the result of the second stage for selecting the correct data. In the second stage, we calculate the updating values for both the A register and the Context Table as well as dispatch the shift amount value to the third stage. In the third stage, we either calculate the updating values for both the C register and the counter CT or perform the renormalization procedure. After all bits of a code-block have been coded, the AE is terminated and flushed, and a complete sub-bit-stream is generated.
According to a previous study [3], input symbols to the AE have a highly skewed distribution. A simple renormalization strategy [4] will take the BYTEOUT procedure more than one clock cycle to complete the C register operation. We use the operand forwarding method commonly found in RlSC CPU design to overlap the actions of BYTEOUT and register-updating as shown in Figure 8 . Consequently, we are able to achieve about 10% reduction in cycle counts.
EXPERIMENTAL RESULT
We have implemented the proposed architecture in Verilog RTL and synthesized it with Synopsys Design Compiler under the worst case operating environment (WCCOM). We also use the PrimePower software to analyze the power consumption. The results are shown in Table 1 Table 2 shows the results. We reduce the cycle count by 17% compared with the known best column-based architecture. In addition, let the number of context-decision pairs be the lower bound on the cycle count, we have achieved 5% within the optimum. We integrate our design with the Jasper JPEG2000 soflware onAltera'sExcaliburmEPX410DDR, whichisasystem o n a programmable chip (SOPC) consisting of a 32-hit ARM922T RlSC CPU, a IM-gate FPGA module, and a set of commonly used peripherals and memory. We synthesize our design using Quartusll compiler and put it in the FPGA module of the SOPC platform. We also write a device driver so that the software running on the embedded CPU core can easily communicate with the EBCOT Tier-I accelerator. Figure 9 shows the real-time measurement results, Both the CPU and the FPGA are clocked at 12.5MHz. When compared with the pure soflware implementation, our EBCOT Tier-I accelerator delivers a speed up of ahout IOX despite large overhead involving data transfer between SDRAM, CPU, and AMBA-AHB bridge. image coding standard. Our novel architecture leads to 17% reduction in cycle count compared with the state-of-the-art. We have achieved within 5% of the theoretical hound. We have also demonstrated the developed IP in an SOC platform using FPGA prototyping. The proposed IP can he used for any AMBA-based SOC design of IPEG2000 image coding.
