I. INTRODUCTION The digitisation of visual information is nowadays common practice in large number of consumer products such as Personal Digital Assistants (PDA), digital cameras, palmtop computers and cellular phones as well as portable video game consoles and portable DVD players [1] . This exponential increase in the amount of digital visual information that must be stored, processed and then, transmitted efficiently has motivated a large body of research, both in industry as well as in academia, into advanced video coding techniques. New video coding standards such as the recent H264 video codec (also known as MPEG4 part 10) [2] deliver better quality and lower bit rates but at the expense of an almost exponential increase in the number of CPU cycles required per input frame of video data when compared to previous generation standards [3] . The introduction of advanced entropy coding within the H264 standard, via the pioneering use of context-based arithmetic coding [4] , is one of the reasons behind the increase in the computational cost of the codec. The high-speed arithmetic coder (AC) coprocessor described in this paper has been designed to achieve a significant reduction of the AC computational cost of the H264 standard with modest hardware cost. This paper is organised as follows: Section II reviews hardware-based binary arithmetic coders. Section III presents the motivation for this work based on a study of the J. L. Núñez is with the Department of Electronic and Electrical Engineering, University of Bristol, UK (e_mail: j.l.nunez-yanez@ bristol.ac.uk) V. A. Chouliaras is with the department of Electronic Engineering, University of Loughborough, UK (e_mail: v.a.chouliaras@lboro.ac.uk) computational costs of AC in the H264 video codec. Section IV presents our novel MZ coder and evaluates its efficiency compared with other arithmetic coding implementations. Section V studies the applicability of the MZ coder to entropy coding within the context of the H264 video standard. Section VI presents the hardware architecture of the MZ codec core and section VII describes the required ISA extensions. Section VIII presents the implementation results when targeting high performance FPGA and ASIC technologies. Finally, section IX summarizes our findings, and concludes this work.
II. HARDWARE-BASED BINARY ARITHMETIC CODERS
The IBM Q-coder [5] and the QM-coder [6] are the best known examples of hardware-based binary arithmetic coders. These devices use the renormalization approximation introduced by Rissanen in [7] to avoid the complex multiplications and divisions with the main difference across them being the complexity of the model which is formed by 128 contexts in the Q-coder and 1024 contexts in the QMcoder respectively. VLSI implementations of both the Q-coder and QM-coder are reported in [6] . Both hardware algorithms clocked at 75 MHz with a throughput of around 64 Mbits/second. The device was implemented in 0.35 µm standard cell technology from IBM (CMOS 5S). The adaptive binary arithmetic coding device presented in [8] replaces the division operation by storing the probability values in a lookup table and uses the coder state as a pointer to a particular probability in that table. Similarly to the QM-coder, 1024 contexts are used each of them with its own probability state. Multiplications on the other hand are done explicitly using an 8x8 parallel multiplier. The VLSI implementation was carried out on TSMC's 0.8 µm standard cell CMOS technology and the chip achieved a maximum frequency of 25 MHz. This device needs 8 clock cycles to complete the probability estimation and arithmetic operation phases plus a variable number of renormalization cycles. Renormalization typically is done in a single clock cycle but up to 7 clock cycles may be required. The resulting symbol throughput is approximately 3 Mbits/second. An improved version of that chip is presented in [9] in which a dynamic pipeline architecture is used to deliver 6 Mbits/second throughput at the same clock frequency and technology as the original design but with approximately 30% area overhead.
III. ANALYSIS OF AC IN THE H264 VIDEO CODEC
The H264 video coding standard is a state-of-the-art algorithm which delivers up to 50% reduction in bit-rates when compared to previous standards such as MPEG4 [10] or H263 [11] at equivalent video quality settings. In order to achieve this the H264 uses a number of innovative techniques for extracting the redundancy present in the video stream. One such technique is the use of an arithmetic coding extension and context-based modeling known as CABAC [12] . The underlying architecture of the H264 standard is similar to that of previous standards such as the H263 and MPEG4, and consists of four major computational stages: Motion estimation and compensation (ME), discrete cosine transform (DCT), quantization (Q) and entropy coding (EC). These functions are applied to the data blocks resulting from dividing the image frame into variable size subframes. Finally, EC is applied after the binarization stage to the data produced by the ME and Q functions to further remove redundancy and reduce bit rates and it achieves this by using codes with fewer bits for the most probable parameters. In previous standards entropy coding was based in variable-length-codes (VLC) derived from the well-known Huffman codes due to their simplicity and ease of implementation. H264 can also use VLC codes but also permits the use of arithmetic coding as an improved alternative to VLC. In the approach used by the CABAC algorithm embedded in the H264 standard the probability state includes the probability information and multiplications are replaced by table look-up operations using the probability state and the two most-significant bits (MSB) of the range as pointers. This translates into a single table of 64 probability states times 4 ranges or otherwise, 256 8-bit values. Arithmetic is then carried out via simple additions and subtractions however, the renormalization loop is a sequential process with a cost of up to 6 iterations. The modelling stage in CABAC uses 2 additional tables for the adaptation process, to calculate the new probability state of the active context depending on whether a Most-Probable-Symbol (MPS) or a Least-ProbableSymbol (LPS) has just been coded. To evaluate the complexity of AC in the H264 codec we selected 7 video sequences organized in 3 standard video formats namely QCIF, CIF, and SDTV. The video sequences and the chosen configuration of the H264 video codec are summarized in Table 1: Initial profiling of the algorithm was carried out in the SimpleScalar [13] processor simulation environment. Profiling data indicates that the complexity of AC during decoding is a fraction of the complexity during the coding process. Nevertheless, we chose to support both coding and decoding acceleration in the hardware coprocessor to offer a selfcontained solution. These results are summarized in Table 2 for the QCIF format.
The number of calls to the AC routine increases substantially in higher quality settings and for larger video formats. This is illustrated in Figs. 1 and 2 which depict the number of AC calls per frame and the associated PSNR values, as a function of the quantization parameter QP. One AC call is needed for each bit of data (symbol) that must be coded. The AC costs range from 1 million calls per frame to approximately 60 millions calls per frame for a QP of 16, depending on the input video format. The high computation coding requirements are largely due to the Rate Distortion Optimization (RD) [14] in the H264 standard. In RD optimisation each macroblock is coded with different modes and the one that minimizes the rate-distortion curve is selected. In addition to the number of calls, algorithm profiling indicates that an average of 62 instructions for coding and 42 instructions for decoding are needed for each AC function call. When targeting an embedded, scalar, RISC CPU like the SPARC-compliant Leon2 [15] used in this work this translates approximately into 100 CPU clock cycles per AC call or bit for an average clocksper-instruction (CPI) ratio of approximately 1.6 for the coding phase. We can therefore safely conclude that AC, in the context of advanced video coding, is a very compute-intensive operation and since traditional parallelizing techniques such as SIMD [16] extensions cannot accelerate this essentially sequential process, the introduction of dedicated hardware support in the form of a specialized coprocessor is a suitable solution. 
IV. PROPOSED ARITHMETIC CODING ALGORITHM
The original Z-coder software algorithm was developed by ATT labs [17] as a generalization of the Golomb/Rice coder for lossless coding of bi-level images. Golomb/Rice coding is used to code a run of r consecutive occurrences of a MPS followed by a single occurrence of a LPS, using a parameter m to control how many MPS fit in one bit of code and also how many bits of code are required to code a LPS. The code has two components: the first component is r/m 1's, followed by a single 0, while the second component is r mod m, coded as an ordinary binary number with log 2 (m) bits. Although easy to implement, the limitation of Golomb codes is that the chosen parameter m is only good for a single probability distribution however, a general compression system has to be able to deal with arbitrary sequences of events with different probabilities. The Z-coder aims to solve this limitation. Z-coding is the same as Golomb coding with the advantage that the parameter m can be changed for each symbol being coded. The extra complexity of the algorithm is small and more details can be found in the original paper [17] . Our work has focused on maintaining the simplicity of the Z-coding algorithm while increasing its suitability for hardware implementation and this is where our novelty lies. The resulting MZ-coder balances the complexity of coding the MPS and LPS, simplifies the precision of the arithmetic and handles special hardware borrow conditions while maintaining coding efficiency and achieving high performance via a fully pipelined microarchitecture. In order to validate the efficiency of the MZcoder in a general compression environment a software implementation has been developed in which a sophisticated variable-order Markov-model [18] has been coupled to a selection of 3 arithmetic coders named the Lei coder, the Bmult coder and our own MZ coder. The Lei coder improves the coding efficiency of the Q-coder and a detailed description is available in [19] . The Bmult algorithm uses the standard method proposed in [18] with full precision integer multiplications. These two known arithmetic coders and the MZ-coder have been compared with the information content of the Markov modeler measured by the equation symbol bits = -log 2 (symbol_probability) using floating point arithmetic. This equation bounds the theoretical compression for the given model. Our experimentation is based on two standard data sets commonly used in the literature: the Calgary and the Canterbury [20] data sets. Figs. 3 and 4 show the percentual compression degradation (Y axis) as a function of the block size (X axis). The best performer is, as expected, the Bmult algorithm using full precision integer multiplications. The two multiplication-free coders perform similarly with a maximum degradation of around 1% although the MZ coder outperforms the Lei coder in all block sizes. Additionally, the MZ coder performs very well for small block sizes outperforming the information content of the model given by the floating point arithmetic. The reason is that the MZ algorithm has been designed to predict symbols with a slightly higher level of confidence than that obtained from the probability data provided by the model. Extensive simulation has shown that slight over-predictions are particularly beneficial for small block sizes where the limited amount of data available prevents model construction from entering a stable state.
Apart from offering good coding efficiency, one of the main attractive points of the MZ-coder is its fast renormalization. The original AC algorithm present in CABAC uses a variable cycle (from 0 to a maximum of 6 cycles) renormalization stage to keep the state variables in the required range. This variable renormalization latency is due to the inner dependencies of the state variables and the renormalization loop. that the costs of multiple-cycle renormalization account for a performance degradation of around 15%.
On the other hand, the renormalization process in the MZcoder does not include internal dependencies. As a result, it can be readily accomplished with a shift left operation. This is illustrated in the pseudocode of Fig. 6 which also shows the internal dependencies of low inside the while loop in the CABAC case. This MZ-coder feature guarantees a dataindependent throughput of 1 symbol per clock cycle and simplifies the control data path.
V. VIDEO CODING EFFICIENCY OF THE MZ ALGORITHM
The MZ algorithm has been incorporated into the JM 7.3 H264 reference software [21] and its coding efficiency measured using the video sequences of Table 1 .
Figs. 6 and 7 depict the coding efficiency of the proposed MZ coder and the VLC coder versus the CABAC algorithm. Fig. 6 shows that the performance of CABAC and the MZ coder are virtually undistinguishable. On the other hand, the simple VLC codes increase the bit rates by around 8% for these video sequences with the effect being more noticeable for the large SDTV format.
These results have been verified by decoding the resulting bit files using the corresponding entropy decoders for each of the 3 options tested (VLC, MZ, CABAC). The coprocessor has been coupled to a SPARC V8 [22] compliant embedded CPU [15] which includes a standard, 5-stage RISC pipeline. The Leon2 processor was selected for this work due to its open-source nature which makes the integration of the coprocessor pipeline in the Leon2 data path easier due to having full access to the RTL source code. The following sections describe the main modules of the MZcoder.
MZ Coprocessor

ROM LPS
A. Arithmetic Coding Coprocessor Description 1) MZ coder arithmetic
The hardware implementation reduces the arithmetic precision to 8 bits and the precision of the subend and range registers to 7 bits from the original 17 bits and 16 bits in the Z-coder [17] software algorithm respectively. This precision is sufficient to handle the minimum symbol frequency of 1/128 as fixed by the LPS table, without affecting coding efficiency. The renormalization is done in parallel for range and subend and in the same pipeline cycle as the rest of the MZ arithmetic. The number of bits that must be added to the output code depends on the amount of renormalization needed in the range value so that the range is kept between "0000000" and "1000000". Shifting must be done until the MSB of the range value is 0. The shift value ranges from 0 when no shifting is required to 7 when the input range is "1111111" and 7 shift operations are required to obtain "0000000". The code bits output from the MZ arithmetic stage are buffered in the code buffer stage.
2) Code buffer
The code buffer stage is required to control possible borrow bits originating in the previous stage that could affect the value of the bits contained in the code buffer. A total of 8 bits are buffered in this stage. A number of bits, as defined by the shift value, must be inserted at the least-significant bit in the code buffer. The result from doing the MZ arithmetic means that an overflow is possible in the subend register. As long as the value stored in the buffer is different from 0, the borrow will be stopped in the code buffer register. If the value of the buffer is 0 then the borrow propagates out into bits that have already been sent to the code generator stage (discussed next) and the current output is formed by as many bits set to 1 as specified by the shift signal since borrow propagation, in the code buffer, will swap all the bits from 0 to 1. The code generator stage handles possible borrows originating in the code buffer by not outputting bits until a bit set to 1 has been received from the code buffer stage. The bit set to 1 will behave as a barrier for the possible borrows being propagated out of the code buffer.
3) Code generator
The code generator takes the 0 to 7 bits produced by the code buffer and the zero run count to build a code of up to 14 bits. The zero run register counts the number of consecutive 0 bits in the input. These bits are the equivalent of the bits_to_follow variable used by software arithmetic coders [4] . The output is formed as a bit set to 1 plus zero run bits set to 0 when the first bit set to 1 from the code buffer is received after a run of consecutive 0's. Output is then possible since the bit set to 1 will block any possible borrows originating in any following coding events. In software the bits_to_follow counter is a simple integer variable but in hardware this could leave an undefined and potentially unlimited number of bits in the coder pending to be output. This is undesirable from a latency and complexity point of view so instead of using a large 32-bit register, a 3-bit counter is utilized to keep track of the zero run count. This mechanism means that only a maximum of 7 bits could be left in the coder pending to be output. The maximum length codeword that the bit packer should be able to handle is therefore given by:
Max length codeword = 7 new bits + 7 bits pending = = 14 bits.
It is possible that more than 7 bits set to 0 are received in which case the zero run counter will overflow. To avoid this situation the hardware emits the pending 7 bits set to 0 preceded by a bit set to 1 in a speculative manner. The first bit of the next output codeword will indicate if the bits emitted speculatively must be inverted. The decoder will extract this bit from the data stream, negate it and subtract the result from the previous code, effectively transforming any code bit sequence of "10000000" to "01111111". This process adjusts the code to the correct value and performs a similar function to the stuffing bit suggested by IBM in their Q-coder [5] . The extra bit is part of the next codeword and will have the value 0 if and only if a borrow bit originated in the code buffer stage. Potentially a run of 7 consecutive 0s could be followed by another run of 7 consecutive 0s overflowing the zero run count again. This is not a problem and the hardware will emit codewords as normal. The potential long borrow will not cause the decoder to fail because all the coding events that were coded previous to the event that produced the borrow can be decoded without any borrow propagation. The fundamental requirement to guarantee correct decoding is that the borrow must be propagated in the decoder before the decoder tries to decode the bit that produced that borrow in the first place.
4) Code packer
The variable number of bits produced by the code generator is finally pipelined to the code packer whose function is to pack the variable length codewords into fixed-length 8-bit codewords, ready to be output. Since up to 7 bits can be left inside the code packer without generating any output and up to 14 bits can be forwarded by the code generator stage in every cycle, the width of the packer register has to be at least 21 bits to be able to store all the data bits in this particular case.
B. Arithmetic Decoding Coprocessor Description 1). Process run
The module to process the zero runs checks if 7 consecutive bits are set to zero with the help of the zero run register. If this condition is detected, the next bit corresponds to an extra bit added by the coder. This bit is removed from the coded data stream and a borrow_propagate signal is forwarded to the next pipeline stage to adjust the rest of the codeword bits accordingly, before they are used by the decoder. If the bit is set to 1 no borrow propagation is needed but if the bit is set to 0 a borrow propagation must take place that will be stopped by the first bit set to 1 in the code buffer register.
2) Assemble new data and Shift out old data
This block buffers the codeword bits before they are required by the MZ-decoder arithmetic. To increase the amount of parallelism between the MZ decoder arithmetic and the assembly-shift operation, data is concatenated in the assemble_new data module before the arithmetic logic has determined how many bits must be disregarded. Once this value is known, old data is shifted out and the codeword is rebuilt using the arithmetic adjusted codeword and new data in the shift_out_old_data module. The codeword is then registered in the code buffer, ready to start a new decoding operation. Since assembling of new data is done without knowing how many bits are going to be disregarded by the decoding arithmetic, enough bits must be present so that, in the case that no data is assembled but maximum bits are disregarded, enough valid bits remain for the next decoding cycle. It is also critical to propagate the borrow signal as far as the first bit set to 1. At least 6 bits of codeword must always be valid. If a decoding operation can consume up to 6 bits and at least 6 bits must remain valid for the next decoding operation at least 12 bits must be valid at the start of the cycle. This means that data must be added when less than 12 valid bits remain in the code buffer register. The code buffer register must therefore be at least 19 bits wide to be able to store the total of 8 new bits plus 11 bits of codeword.
3). MZ decoder arithmetic
The decoding arithmetic follows that presented in the Z-coder paper [18] with the benefit of using only 7-bit precision and having balanced MPS/LPS branches, similar to the proposed coding hardware. The arithmetic circuits have been designed to maximise throughput by performing as many operations in parallel as possible. Finally, the DMA/AHB bus controller moves data between the internal coprocessor FIFOs and main memory.
VII. ISA EXTENSIONS
A total of 5 instructions have been added to the SPARC V8 ISA to support the coprocessor. The MZ_code_mps and MZ_code_lps instructions advance the MZ pipeline and are used each time the main processor enters the AC routine. The data transferred to the MZ module when any of these two instructions is executed is the 6-bit probability state. The MZ arithmetic corresponding to an MPS or LPS coding event is then executed and the results are forwarded to the next pipeline stage (code buffer). The MZ state stored in the registers range and subend is also updated. The data path from the MZ arithmetic to the execution stage in the Leon processor returns the number of bits needed by the executed coding instruction. The video codec software uses this information to calculate the current coding bit costs. The rate distortion optimisation accepts or rejects a sequence of coding events depending on the value of this cost and the current PSNR value. This means that two extra instructions are required to accept or reject previously executed coding events: MZ_comit and MZ_reset. Additionally and not shown in the figure but implied, there is a set of equivalent MZ state registers corresponding to the hidden state. These registers are updated by the MZ_comit instruction and are used to update the visible state by the MZ_reset instruction. Therefore, the purpose of the MZ_comit and MZ_reset instructions are to accept or reject previously coding events by updating the coprocessor state registers. The decoder requires another instruction extension called MZ_decode_s. Once the decoding process starts, the coprocessor state machine (not shown) fills up the code buffer register independently of the code running on the main CPU. Once a decode instruction is received, some of these bits are used to generate a decode MPS/LPS signal that indicates if a most probable symbol or a least probable symbol has been decoded. The software running on the main CPU interprets this signal as a bit set to 1 or a bit set to 0 depending on which symbol (0 or 1) is the most probable symbol. A valid signal indicates to the main CPU if the code buffer contained enough bits for the instruction to complete. Otherwise, the main CPU must reset the state of the decoder engine using MZ_reset and execute again the decode instruction. Finally, the state registers are pipelined at each level and move with the data path pipeline with the rest of the codeword data. This is necessary to handle possible exceptions originating in the main CPU data path that would cause the main pipeline to restart and the same instruction to be executed more than once. A restart signal originating in the exception logic unit of the Leon2 CPU will load the pipeline state information into the corresponding state registers in the MZ coprocessor should a software exception happens in the main processor.
VIII. IMPLEMENTATION
To verify the functionality and performance of the AC coprocessor we have integrated the core in a SoPC platform implemented using an Altera APEX20KE PCI development board. The main components of the SoPC platform are illustrated in Fig. 9 .
The AMBA AHB subsystem incorporates a total of 5 masters (Debug Support Unit, Leon2 Processor, AC coprocessor, DMA Engine, PCI Bridge Interface) and 2 slaves (memory controller and AHA/APB Bridge). The Control registers module, instantiated as a slave in the AMBA APB bus, controls the execution of the H264 binary on the FPGA board. There are a total of 5 extra registers added to the standard Leon system for control purposes. The interrupt register is hardwired to the open drain INTAN signal on the PCI bus. When one of the bits in the interrupt register is set to zero the INTAN signal goes low. It is the responsibility of the application driver running on the host computer to remove this interrupt by writing 0xFFFFFFFF to the interrupt register. The debug support unit can be used to help debugging an application running on target hardware. The MZ coprocessor can clock up to 50 MHz in this technology but the Leon2 processor and the Opencores PCI Bridge are limited to 33 MHz. The complexity and performance details of the FPGA implementation are shown in Table 3 . The AC coprocessor reduces the complexity of arithmetic coding by more than an order of magnitude in this configuration . The chosen Silicon technology for the VLSI macro was the UMC 0. 13 Once the routability aspect of the design was achieved, the original logical netlist was read into Physical Compiler once more, but now with real physical constraints applied. These constraints specified the utilization factor, aspect ratio and die size (derived from the previous MPC run), power ring dimensions, power trunks width and number, pin (port) location and finally, the power straps characteristics. It was reoptimized and passed to SoC Encounter for the final Place and Route run. The maximum operating frequency was 330 MHz worst-case (throughput of 330 MSymbols/second) and the complexity of both the coder and decoder is 5600 standard cells. Fig. 10 depicts the final placement and layout of the arithmetic coding/decoding coprocessor. 
IX. CONCLUSIONS
This paper presented an innovative hardware architecture for arithmetic coding based on the simple Golomb codes that enables a data-independent throughput of 1 symbol per clock cycle without affecting coding efficiency. The MZ-coder has been applied to the problem of accelerating the computeintensive entropy coding functions in the state of the art H264 video coding standard and shown to deliver equivalent bit rates while eliminating the need for multiple renormalization cycles. The hardware has been verified using low-cost FPGA technology and shown to have modest requirements in terms of silicon area while achieving good results in terms of clock rate. The SoPC platform utilizes the open-source Leon2 processor with the proposed accelerator and shown to reduce the complexity of AC by more than an order of magnitude. Subsequently, a VLSI implementation was carried out in a high performance 0.13 µm silicon process and the resulting macrocell achieved a throughput of 330 Msymbols/second. The H264 video coding standard is expected to be the enabling technology in the near future for personal multimedia communications. Major efforts are currently active within industry and academia to accelerate the compute-intensive motion estimation, transform and quantization functions through developing fast algorithms and exploiting the available data level parallelism. Entropy coding, based on arithmetic coding, is mainly a sequential process, not well suited to this kind of optimization. Its acceleration with the proposed hardware architecture could play a major role in bringing real-time H264 video coding within the grasp of lowpower embedded devices.
