In addition to coding efficiency, the scalable extension of H.264/AVC provides good functionality for the adaptation in heterogeneous environments. Fine grain scalability (FGS) is a technique to extract video at the best quality level under the available bandwidth. In this paper, an architecture of FGS encoder with low external memory bandwidth and low hardware costs is developed. At most 92% bandwidth reduction can be attained by the proposed scan bucket algorithm, early context modeling with context reduction, and first scan pre-encoding. The area-efficient architecture is implemented by layer-wise hardware reuse, and three design strategies for enhancement layer coder are explored so that the trade-off between external memory bandwidth and silicon area is allowed. This design can real-time encode HDTV 1280×720 video at 130 MHz working frequency.
INTRODUCTION
In the past, coding efficiency is the main target of traditional video coding standards, like MPEG-2 and H.264/AVC. Recently, the scalable extension of H.264/AVC has been developed by MPEG due to the prevalence of streaming applications. In current Joint Scalable Video Model (JSVM) [1] [2], temporal, spatial, and signal-to-noise ratio (SNR) scalabilities are supported. It means that various network streaming applications such as mobile phone, personal computer, and high-definition TV (HDTV) can display the same video with different specifications from one scalable-encoded bitstream. Basically, temporal scalability is derived from hierarchical B-frame coding structure [3] ; spatial scalability is based on pyramid coding scheme; SNR scalability is realized
The work is supported by National Science Council. by embedded quantization approach improved from previous MPEG-4 standard [4] . In current JSVM, SNR scalability can be provided via two strategies, coarse grain scalability (CGS) and fine grain scalability (FGS). CGS is to add a layer with a smaller quantization parameter (QP), like an additional spatial layer. In general, QP of the new CGS layer is much smaller than the original one. However, CGS can only provide several pre-defined quality points. Therefore, FGS is proposed to give any quality points according to users' bandwidth capability. Fig. 1 shows [5] . CABAC encoder is composed of three main blocks. Binarizer transforms syntax elements into binary symbols. Context modeller classifies these symbols with different statistical properties into respective categories, i.e. contexts. Binary arithmetic coder uses a recursive interval-subdividing procedure to compress symbols into bitstream.
In this paper, we propose a high-performance architecture of FGS encoder with CABAC for H.264/AVC scalable extension. In Section 2, the FGS coding flow is introduced. Three main design challenges are discussed in Section 3. Proposed algorithms and architecture to overcome the three challenges are described in Section 4 and 5. Section 6 shows the implementation results, and Section 7 is the conclusion.
OVERVIEW OF FINE GRAIN SCALABILITY IN H.264/AVC SCALABLE EXTENSION
Unlike bit-plane truncation in MPEG-4 FGS and JPEG2000, layer truncation is adopted in H.264/AVC scalable extension. There are one SNR base layer and at most three SNR enhancement layers. The residual coding of base layer is the same as H.264/AVC. After base layer is transmitted, the bitstream can be truncated at any point in any enhancement layer. Fig. 2 shows the block diagram of FGS encoding in case of three enhancement layers. First, four SNR layers are generated from four cascaded reconstruction loops with different QPs. The largest QP is used in base layer, and the QP step size is six between adjacent layers. In each enhancement layer, only the differences between transformed coefficients and accumulated inverse-quantized ones from the preceding layers are coded. Noted that the normalization should be done before subtraction to adjust the level difference. Due to uncertain amounts of bitstream that decoder can obtain, usually, I and P frames choose the reconstructions at the lowest layer to avoid error propagation; B frames choose the reconstructions at highest layer to maintain the prediction precision.
According to the coefficients in preceding layers, the coefficients in enhancement layers are classified into two categories, i.e. new coefficients (NCs) and refinement coefficients (RCs). NC is a coefficient which is never significant in all preceding layers. The whole value of NC should be coded; RC is a coefficient which is significant in any preceding layers. The value of RC is limited in {-1, 0, +1}. In Fig. 2 , the quantized coefficients in enhancement layers are classified into NCs and RCs, and then the values of RCs are truncated within -1 to 1.
To achieve the feature of progressive improvement on entire frame, FGS coding order turns into multiple scans through every macroblocks (MBs) in one frame, instead of the traditional MB-by-MB order. Thus in Fig. 2 , the coefficients in entire frame are stored in frame memory, and the FGS scan and entropy coding of enhancement layers are frame-level operations. Fig. 3 shows the FGS coding order in JSVM 7 with only four blocks in one frame and eight coefficients in one 
DESIGN CHALLENGES
Compared with H.264/AVC and previous video standards, three main challenges arise in FGS encoder in H.264/AVC scalable extension.
High external memory bandwidth
In the case of HDTV 1280×720 with three enhancement layers, direct implementation of the three frame memories in Fig.  2 takes 6 .74 MBytes. They are too huge to be implemented as on-chip SRAMs. However, to employ off-chip memory need 404 MBytes/sec external memory bandwidth (6.74 MBytes × 30 frames per second × read/write). This number will be even higher if more than one spatial layer is considered. Therefore, it is necessary to reduce the large external memory bandwidth so that the bus congestion can be relieved, and the huge power consumption from external memory access can be avoided.
Frame-level irregular data access
Unlike traditional raster MB-by-MB coding order, coefficients in one MB are coded in many scans, and all MBs in the whole frame are examined in every scan. The number, categories, and positions of coefficients coded in one scan are quite irregular, as shown in Fig. 3 (b) . The irregular external data Fig. 4 . Scan bucket algorithm corresponding to Fig. 3 access is difficult to be realized. Moreover, some redundant access might further increase the bandwidth described in 3.1.
High computation
The required computation in H.264/AVC includes only the base layer part in Fig. 2 . Since FGS requires individual reconstruction loop and entropy coding for each enhancement layer, the computation of FGS are several times of that of H.264/AVC. Besides, the computation can further grow as the number of spatial layer increases. Therefore, the FGS encoder should be carefully designed to avoid very high hardware costs.
In Section 4, three algorithms are proposed to overcome the challenges of high external memory bandwidth and framelevel irregular data access. The challenge of high computation is solved with low hardware costs by the proposed architecture in Section 5.
PROPOSED ALGORITHMS TO REDUCE EXTERNAL MEMORY BANDWIDTH
The concept of proposed algorithms is to repartition the operations in MB level and frame level shown in Fig. 2 . Then, some techniques are introduced to further reduce the amount of data transmitted to external memory.
Scan bucket algorithm
Scan bucket algorithm is to move FGS scan in Fig. 2 from frame level to MB level. Although the coding order is by many scans through all blocks in the whole frame, it can be modified to a more convenient way. We decide the coefficients in one block to be coded each scan in advance, store partial data in internal memory, and then exploit external memory as transpose memory to conform to the correct coding order. Fig. 4 shows the concept of this method following the example in Fig. 3 . When block 0 is processed in MB level, the coefficients are analyzed based on the scan rule, and then put {0,0,0,1} into bucket 0, NC end into bucket 4, and A into bucket 5. Then, block 1 to 3 are processed in turn. Whenever 6 . Example of position auto-detection the bucket is full, the data in that bucket are transferred to external memory. After all blocks are processed, the external memory is accessed according to scan-by-scan coding order, as shown in Fig. 4 . Since the data are well arranged in MB level, external data access becomes regular and simple. This method can also apply to FGS decoder in the reverse procedure with a little change. In addition, to transfer binary symbols with contexts rather than coefficients may reduce the external memory bandwidth. The number of contexts can be decreased to lower the amount of data transmission. Two simple approaches are adopted. The first one is luma/chroma context reuse. Most of luma and chroma syntax elements use respective sets of contexts. But in some cases, the type of syntax element is undoubtedly the same as previous one and unnecessarily to be distinguished by different contexts. Accordingly, some contexts can be shared by luma and chroma to reduce the number of contexts. The second approach is position auto-detection. When coding the significant map, the contexts depend on the positions of transformed coefficients. If the positions of skipped RCs, e.g. E, F, G in Fig. 3 (a) , are recorded (by an additional self-defined context), the positions of all coefficients can be inferred in frame level by the scan number and the end of block (EOB), Fig. 7 . Proposed architecture of FGS encoder which must be right after RC or significant NC. An example is shown in Fig. 6 . By this way, contexts of significant map are greatly merged before transmitted to the bus, and can be reproduced after accessed from external memory. Note that context reduction is only to lower external memory bandwidth, contexts must be recovered before frame-level arithmetic coding, so there is no information or quality loss.
Early context modeling with context reduction
! " " # $ % & ' ( " ) $ 0 ' 1 ) $ 1 " ' ( " ) 2 $ 0 3 ) $ 4 1 5 6 & ! 6 0 1 1 ( 0 7 3 & ( & ! ( 0 8 9 9 % # @ 7 3 ' 8 @ 3 @ 8 8 9 9 % # @ 7 2 ( " 5 0 9 % # @ 7 A B 1 0 ! 7 6 C D E A 0 F % @ ( 1 F 6 1 ( $ 8 " 6 & ( & ! ( 0 G H I P Q H R P S T U V W Q H R P S I
First scan pre-encoding
Among all scans, the first scan encodes the largest amount of data because most significant transformed coefficients appear in low-frequency region, and most residual headers are coded in it. In addition, the data in scan 0 is originally coded first, so we can pre-encode them when the block is processed in MB level, and only data in other scans are sent to external memory and entropy-coded in frame level. Although entropy coding is partitioned into two phases, the coding order remains unchanged. Due to the large amount of data in scan 0, first scan pre-processing is very effective to reduce external memory bandwidth. However, it is optionally implemented because of some hardware overhead, which is explained in 5.3. Fig. 7 shows the proposed architecture. The main feature is layer-wise hardware reuse to reduce silicon costs. Layerfolded reconstruction loop, scan bucket implementation, and enhancement layer arithmetic coder are discussed in the following subsections. Base layer CABAC was developed in several literatures [6] [7] [8] and not introduced here. The coefficient RAMs and side information RAM are all single port. The two coefficient RAMs, operated in ping-pong mode, store the coefficients of base layer and enhancement layers. The side information RAM stores the required data of last row. It stores only the information of above MB for context modeling. The other information of left MB is stored in registers from last MB, so its size is only one MB instead of one row. 
PROPOSED ARCHITECTURE TO REDUCE HARDWARE COSTS
: ; < < = > ? @ A B C : ; < < = > ? @ D B : ; < < = > ? @ E B : ; < < = > ? @ F B G C H I J K L I M K N N K J O P Q K N K R < S J ? N Q R ? K P N K J O P Q K K S T U L I M K N D K S T U L I M K N E K S T U L I M K N F A V W X Y Z X Y [ \ ] ^ _ ` a b c d ` e e f g h i j \ k l m n o o p (a) Folding structure (b) Q loopq r s t t u q u s v w x y z y { | } ~ { } } } | } ~ } ~ s s ¡ | ~ ¢ } } £ £ £ £ ¤ ¥ ¦ § ¨ © £ £ £ £
Layer-folded reconstruction loop
The layer-folded reconstruction loop is depicted in Fig. 8 . Base layer and three enhancement layers share one Q loop module with different QP by timing separation. About the inputs of base layer, the inverse-quantized coefficients are all zeros, and the categories of quantized coefficients are all NCs. Different from Fig. 2 , the reconstruction is selected from all layers because arbitrary number of FGS layers are supported in our design for largest flexibility. Fig. 9 shows the scan bucket implementation, where the binarizer and context modeller in Fig. 5 are integrated into scan bucket algorithm. First, the residual header and coefficients in zigzag scan are sent to early context modeling in turn. Generated contexts are reduced by luma/chroma context reuse and position auto-detection. Binary symbols along with contexts are then assigned to the appropriate scan bucket judged by the analysis of coded coefficients. These buckets are realized as FIFOs. When one bucket is full, the contents are put into output buffer. Except three sets of FIFO buckets for three enhancement layers, the hardware is shared in the order of block-by-block in one enhancement layer and then layer-bylayer.
Scan bucket implementation

Enhancement layer arithmetic coder
The input of enhancement layer arithmetic coder is loaded scan-by-scan from external memory. The core is the same as base layer with different contexts. Only the issue of hardware reuse is discussed here. The extents of hardware sharing depend on the schemes about first scan pre-encoding. Three different implementations are developed as follows.
Without first scan pre-encoding
If first scan pre-encoding is not implemented, only one arithmetic coder in frame level is needed. Since the cycles in frame level are not as tight as MB level, one arithmetic coder is sufficient for three enhancement layers. Because of the property of progressive refinement, the schedule is in the order of layer 1, layer 2, and then layer 3, according to the importance of bitstream. In the extreme worst case, incomplete coding of enhancement layer 3 is more tolerable than that of layer 1 or 2.
With complete first scan pre-encoding
If first scan pre-encoding is implemented, the first bucket in Fig. 9 is replaced by an additional enhancement layer arithmetic coder. After the first scan is coded in MB level, the temporary interval information and context memory of arithmetic coding must be stored and continued by other scans coded in frame level. If this is used in all three enhancement layers, the most reduction in external memory bandwidth can be achieved, but the most arithmetic coders are required. Three enhancement layers need six arithmetic coders in total, three for MB level and three for frame level, which are interchanged like ping-pong mode.
With partial first scan pre-encoding
To avoid too large hardware overhead, first scan pre-encoding can be partially implemented in one layer rather than all three layers. The best selected layer is enhancement layer 3 because it contributes the highest external memory bandwidth due to the smallest QP. Only two arithmetic coders are needed in this case. The one is for enhancement layer 3 in MB level, and the other is shared by all three layers in frame level. Different from the order in 5.3.1, the frame-level one must first encode Fig. 10 . Reduction of external memory bandwidth Table 1 . Implementation results of three schemes for HDTV 1280×720 at 130 MHz working frequency
layer 3 to continue the unfinished coding. This violates the order of importance, which is a drawback of this scheme. Fig. 10 shows the reduction of external memory bandwidth from simulation of HDTV 1280×720 sequences. The number beside the bar is the reduced percentage out of its above bar. Totally, at most 92% bandwidth reduction is attained when all methods are used. This work is implemented in TSMC 0.13μm technology. It can real-time encode base layer and three enhancement layers for HDTV 1280×720 with 30 frames per second at 130 MHz working frequency. The specification can be higher when parallel processing is used between the enhancement layers. Table 1 lists the gate counts (in unit of two-input NAND gate) and external memory bandwidth of three proposed schemes. Because the coefficient parallelism of reconstruction loop depends on other tasks processed in the same MB pipeline stage, we exclude it in Table 1 . The listed gate counts also include bus control and input/output buffers.
IMPLEMENTATION RESULTS
Four-symbol arithmetic coders [6] are used to support the extreme case of very high bit-rates. The three schemes uses one, two, and six enhancement layer arithmetic coders respectively, which results their differences in gate counts. The three schemes about first scan pre-encoding allow trade-off between external memory bandwidth and silicon costs.
CONCLUSION
A high performance architecture of FGS encoder with CABAC is presented in this paper. It achieves at most 92% reduction in external memory bandwidth, and largely exploits layer-wise hardware reuse to reduce hardware costs. It can real-time encode HDTV 1280×720 sequences at 130 MHz working frequency, and the specification can be higher by more parallelism in FGS layers. Scan bucket algorithm, early context modeling with context reduction, and first scan pre-encoding are introduced to avoid irregular data access and reduce external memory bandwidth. Although some proposed methods are used for CABAC, the concepts can also apply to CAVLC design. The proposed architecture is flexible to support arbitrary number of FGS layers chosen by designers. Besides, three design strategies for enhancement layer coder are explored so that the trade-off between external memory bandwidth and silicon area is allowed.
