In this paper, a field-programmable gate array (FPGA) based enhanced architecture of the arithmetic coder is proposed, which processes two symbols per clock cycle as compared to the conventional architecture that processes only one symbol per clock. The input to the arithmetic coder is from the bitplane coder, which generates more than two contextdecision pairs per clock cycle. But due to the slow processing speed of the arithmetic coder, the overall encoding becomes slow. Hence, to overcome this bottleneck and speed up the process, a two-symbol architecture is proposed which not only doubles the throughput, but also can be operated at frequencies greater than 100 MHz. This architecture achieves a throughput of 210 Msymbols/sec and the critical path is at 9.457 ns.
INTRODUCTION JPEG 2000
is an image compression standard, which provides excellent compression performance and has features like low bit-rate performance, region of interest, etc. The most computationally intensive components in JPEG 2000 are the discrete wavelet transform (DWT) and the embedded block coding with optimized truncation (EBCOT). The EBCOT engine is made up of two stages, i.e., the context formation (CF) stage and the arithmetic coding (AE) stage. After the DWT, each sub-band is divided into code-blocks, which are independently processed by the EBCOT engine. The CF part generates context (CX) and decision (D) bit pairs, also known as CX-D pairs, which are further entropy coded by the AE engine. There are 19 predefined contexts in the JPEG 2000 standard [1] .
The AE engine is the major throughput bottleneck of JPEG 2000 due to its serial processing nature. The conventional implementation [1] explains the AE algorithm, which processes only one CX-D pair per clock cycle. Since the CF engine generates more than two CX-D pairs per clock cycle most times, it is necessary that the AE engine needs to be fast enough to reduce the computational time as well as to reduce the memory storage at the input of the AE engine. If the AE engine can process more than one symbol per clock cycle, the bottleneck of the system can be reduced dramatically.
Several researchers have proposed architectures to reduce computational time [2] [3] [4] [5] [6] and also for efficient memory usage [7] . In these papers, the processing speed is well below 100 MHz and the throughput is below 62 Msymbols/sec. For a two-symbol architecture, in addition to the requirement of a higher speed, it is also necessary to have a higher throughput, so that the overall system performance can be increased. Furthermore, the critical path is of equal importance for hardware implementation. In this paper, a two-symbol architecture is proposed, which encodes two CX-D pairs in every clock cycle. This proposed architecture is very efficient in terms of prediction process, byteout procedure, renormalization and the flush procedures. The coding speed is nearly 110 MHz and the critical path is at 9.457 ns. The proposed design is implemented on an Altera Stratix FPGA.
The remainder of the paper is organized as follows. Section II gives a general description of the arithmetic coder. Section III elaborates on the proposed architecture, while Section IV presents the implementation results. Finally, concluding remarks are drawn in Section V.
ARITHMETIC CODING
The arithmetic coder in JPEG 2000 encodes streams of data consisting of a sequence of symbols. Each symbol is classified into one of the categories, which are most probable symbol (MPS) and least probable symbol (LPS), based on the probability of their occurrence. In AE, an interval is considered as a probability model. This is further divided into subintervals where each one corresponds to the probability of a symbol. When a symbol occurs, the subinterval associated with that symbol becomes a new interval. The recursive splitting of the current interval continues until all symbols are received from the CF engine.
The AE stage is basically a sequential processing unit, where a series of CX-D pairs generated by the CF engine are coded, using context based probability estimation. There are 19 contexts and each of these contexts has an associated probability state that identifies the MPS and the index (I). The MPS and I point to a probability estimation table, which determines the probability estimation (Qe) for the LPS, the next index values (NMPS, NLPS) and the probable symbol change of the MPS (SWITCH). The AE algorithm mainly deals with updating a set of registers based on the MPS and LPS. These registers are A, C, Ct and B.
Register A is the interval register and contains the value of the current interval as required by AE and register C is the code register containing the partial coded bits at every stage of encoding. Register A is initialized to 0x8000, which indicates the beginning of the interval. Since the AE algorithm is implemented in fixed-point integer arithmetic, the initial value of A equals 0.75. The interval of A is always kept in the range of 0.75 to 1.5 and the register width used is 16 bits. The C register is kept 28 bits, in which the lower 16 bits represent the lower bound of the interval and the upper 12 bits are used as a buffer for overflow.
Whenever the register A value falls below 0.75, the renormalization procedure occurs such that the A and C register are shifted left till the register A value becomes greater than 0.75. Simultaneously, register Ct is decremented by the number of shifts occurred. The initial values of the Ct and B registers are 0x0C and 0x00, respectively. During this procedure, whenever register Ct becomes zero, the previous valid value in register B, if any, is transferred to the output byte stream, which is also the final encoded stream. The byteout procedure is performed and register B is updated with the new value. The renormalization and the byteout procedures are shown in Fig. 1 [1] . Theoretically, the renormalization procedure can have up to a maximum of 15 loops, and hence the byteout procedure can only occur twice. To distinguish the byte stream from the markers, which start with 0xFF, a bit-stuffing procedure is also carried out for register B. During the bit-stuffing procedure, register Ct is updated with a value of 7, since the stuffed bit take up a single bit space.
PROPOSED ARCHITECTURE
The proposed architecture of the arithmetic coder is illustrated in Fig. 2 . This is a two symbol architecture, which processes two symbols, i.e., two CX-D pairs per clock cycle. This architecture increases the performance of the EBCOT engine through minimizing the bottleneck of the arithmetic coding. The arithmetic coder has probability estimation tables and a 
Interval update stage
The interval update block diagram is depicted in Fig. 3 . In this figure, the A register value is predicted beforehand. Since we have two symbols, two prediction processes are performed. Fig. 3 , while the second part predicts the final A. The values used for the table for the first part are based on the first CX-D pair, whereas the second part is from the second CX-D pair.
Code update stage
The block diagram of the code update stage is shown in Fig.  4 . This architecture is similar to of the one described in [6] . As can be seen from Fig. 4 , along with the C register update, the byteout procedure is also carried out, which occurs whenever the Ct register becomes zero. During this time, the byte already available in register B is outputted and the most significant byte of the C register is moved to register B. During the byteout procedure, the carry propagation and bit-stuffing is also handled. In the conventional architecture [1] [7] , the renormalization and byteout procedures take place sequentially and are achieved by cascading several shifters and conditional selection logic for generating bitstreams. If the C register update for the second symbol is performed after the renormalization and byteout procedure, the delay time becomes too long. In this case, having a register A update for two symbols simultaneously and performing sequential C update will be of no use. Hence, to deal with the code update delay, a novel technique where the renormalization and the byteout procedures are undertaken in parallel is proposed.
This module updates registers C, Ct, and B. It generates an output bitstream of maximum 4 bytes i.e., B0, B1, B2 and B3. The Mask Generator module generates the required mask, while the Update C module performs the required C register update for the first symbol. The C register update is determined by R, which provides the information whether the shift amount required has to be the lzeroes value from the table or the value determined by the 3 MSB values of the A -Qe. The C register update for the second symbol takes place in parallel with the byteout procedure of the first symbol and hence the critical path is removed.
The circuit diagram for the Update C module is shown in Fig. 5 . The carry generated from this module is used to generate conditions which are used in mask generation. The mask generation module is shown in Fig. 6 . Based on the value of B, i.e., 0xFE or 0xFF, the decision of bit-stuffing is made. After encoding all the symbols of each code block, a flush procedure is performed. In this procedure, the C register is stuffed with as many as one's as possible before it outputs the final bytes to the bitstream. During the flush, a maximum of 3 bytes can be generated simultaneously. 
IMPLEMENTATION RESULTS
The proposed two symbol architecture is implemented using the Verilog hardware definition language (VHDL) and synthesized on an Altera Stratix FPGA. The implementation cost is detailed in Table I . [5] . The throughput of the proposed architecture at 212 Msymbols/sec is the highest among all the existing methods. The critical path observed is at 9.457 ns. This occurs at the C register update, when the flush procedure takes place. Our implementation is also pipeline efficient with only 4 stages of pipeline. The cost of our implementation is 1267 logic elements on an Altera FPGA. Hence, based upon the comparative results in Table II it can be concluded that our proposed method is more efficient and cost effective than all the other methods. 
CONCLUSION
In this paper, an FPGA-based two-symbol architecture for arithmetic coding is proposed, which encodes two symbols per clock cycle. The prediction process for the upper bound value, Index, efficient renormalization and byteout procedures are proposed. This architecture is highly optimized for timing and cost. It can achieve 210 Msymbols/sec and operates above 100 MHz. The critical path is observed at 9.457 ns. The design is synthesized on an Altera Stratix FPGA. Future work includes further improvement on the throughput of the architecture, which is capable of processing multiple (>2) symbols per clock cycle.
