In this paper, we propose a 4x4-block level pipelining architecture with instantaneous switching scheme and optimal decoding ordering of H.264/AVC decoder. Compared with conventional H.264/AVC video decoders [1] [2], which adopt macroblock level pipelines, our proposed 4x4-block level pipelining architecture of H.264/AVC decoder achieves better hardware utilization. Moreover, our proposed decoding ordering can effectively save memory access and reduce processing cycles, which results in 260,000 MB/s under 100MHz clock frequency. By adopting these two techniques, our proposed design supports real time decoding with 1080HD (1920x1088) video sequence in 30fps (244,800 MB/s required) and level 4 of baseline profile.
INTRODUCTION
H.264 is the new video coding standard developed by MPEG (Moving Picture Experts Group) and VCEG (Video Coding Experts Group) that promises to outperform the earlier MPEG-4 and H.263 standard, providing better compression of video images [1] . Based on the H.264 draft standard [3] , its low data rate is achieved by several complex techniques and algorithms such as up to 1/4 resolution on motion vector, several block size from 4x4 to 16x16, several modes in inter/intra prediction, CAVLC, or CABAC context-adaptive entropy coding. The tradeoff to its extremely low data-rate is the computational complexity comes with high computational power. For the complexity issue, an ARM-platform based H.264/AVC decoder with the throughput up to 10.3 fps for slow motion QCIF (176x144) video sequence proposed in [1] is not enough for real time decoding. Thus, to reduce the computational time to meet the higher level of the standard and to reduce the power consumptions are the main issue for the implementation of H.264 decoder.
In this paper, we present our proposed 4x4-block level pipelining architecture with instantaneous switching scheme of the H.264/AVC decoder. Compared with macroblock level pipelining architecture in conventional designs [1] [2], the 4x4-block level pipelining architecture greatly reduces the decoding time, increase overall throughput, improve the utilization of many functional blocks, and eliminates the bubbles that exist in some macroblock-level pipeline stages. Additionally, for the intra prediction unit, our proposed decoding ordering reduces 17% of memory access; for the inter prediction unit, this decoding ordering not only saves 28% of memory access but 28% processing cycles reduction by matching the decoding ordering to the 4x4-block scanning sequence.
Analysis on the performance shows that our proposed 4x4-block level pipelining architecture with instantaneous switching scheme reduces the bubbles in the 4x4-pipeline stages to minimum. Combined with the optimal decoding ordering which reduces both the memory access and processing cycles, the overall throughput can achieve 291,817 MB/s for I-frame and 257,731 MB/s for P-frame at the working frequency of 100 MHz. This paper is organized as the following. In section 2, we will make an overview of the H.264 standard and the decoding algorithm. Our proposed decoding schedule, algorithm and the hardware architecture will be stated in section 3. In section 4 we will give an analysis of our proposed design, show our performance, and comparison results. Simulation results and discussions are in section 5. Conclusions will be made in section 6.
OVERVIEW OF H.264 STANDARD
H.264 is a macroblock-based system. The size of each macroblock is 16x16. The standard defines that the decoder receives macroblock in the ordering of row-by-row. Each macroblock will be further split into 24 4x4-blocks, the residual will be transformed and quantized, then sending in the order as Fig. 1 shows. Besides transformed and quantized residual data, the prediction modes of the intra4x4 predicted macroblock and the motion vectors of the inter predicted macroblock will be sent in the order as Fig. 1 shows in the encoder side.
Based on H.264 standard [3] , Fig. 2 shows the block diagram and the data flows of the H.264/AVC decoder [4] .In Fig. 2 , each group of 4x4 residual data will be entropy decoded firstly in the decoder. After reordering, inverse quantization, and inverse transform, the decoder adds each group of 4x4 residual data with the predicted pixel values of inter prediction or intra prediction to reconstruct the 4x4 block. The MB will be reconstructed by inverse scanning the reconstructed 4x4-blocks as Fig. 1 shows. The un-filtered frame will be reconstructed by combining the reconstructed macroblocks row by row, and the decoder can obtain the reconstructed frame after performing loopfilter. In Fig. 3 , we can see that performing loopfilter and interpolation process in inter prediction (motion compensation) occupies much of the execution time in highly optimal software simulation [5] . Thus, in our VLSI design, we adopt parallel architecture and pipeline to balance the cycle count and maximize hardware utilization.
THE PROPOSED ARCHITECTURE AND DECODING SCHEDULE
In Fig. 4 , we propose a high throughput system platform of H.264/AVC decoder. In this work, our decoder can decode from input bit-stream to the decoded frame as output. The (Nx2/4)x32 single port memory in Fig. 4 is used to store a row of unfiltered pixel values for the intra predictor in order to fetch the upper neighboring pixels during intra prediction process. And the (Nx2)x32 single port memory in Fig. 4 is used to store a row of filtered pixel values in order to fetch the upper neighboring pixels for boundary strength calculation and filter process. Two 96x32 single port memory are used as a macroblock sized buffer between residual adder and loopfilter. We put the frame memory off-chip because it is quite large. By adopting the 4x4-block level pipelining architecture with instantaneous switching scheme and optimal decoding ordering, a high throughput decoder can be realized.
4x4-block level pipelining architecture with instantaneous switching scheme
The 4x4 block is the smallest group of pixels that the H.264/AVC standard adopts. Compared with conventional macroblock-level pipelining architecture [1] [2], our 4x4-block level pipelining architecture are more suitable for the 4x4-block scanning sequence as Fig. 1 shows of H.264/AVC standard.
In our 4x4-block level pipelining architecture, the adder adds a group of 4x4 residual data and the predicted 4x4 pixel values directly when both the two predicted 4x4 pixel values and the residual 4x4 pixel values are ready. In our design, since the output of the IDCT and inter/intra prediction unit can be added directly after computation, no additional storage are needed compared with the macroblock-sized storage adopted by the macroblock-level pipelining. Besides the saving in storage, our 4x4-block level pipelining is especially suitable for decoding intra4x4 macroblock because the intra4x4 prediction process and the IDCT transform process are iterated during decoding intra4x4 macroblock.
An example of the pipelining schedule is as Fig. 5 shows. In our design, the residual adder (see Fig. 4 ) adds the group of predicted 4x4-block pixel values and the residual 4x4-block pixel values just after both are ready and the 4x4-block level pipeline stage switches instantaneously after the operation of this residual adder completes. This instantaneous switching scheme improves our decoding throughput.
For the deblocking filter, because the standard defines that all the pixels of an intra predicted macroblock must be predicted prior to the deblocking process, the 4x4-block level pipelining can not extend to the deblocking filter. Instead, the conventional macroblock level pipelining are used for the implementation at the deblocking filter side. Thus, the two 96x32 single port memory are needed here, same as a conventional design [6] .
Optimal decoding ordering
The key traffic between modules in the decoder is the adder that adds the residual data at the output of the IDCT with the predicted pixel values by inter prediction block or intra prediction block as Fig. 2 shows. To minimize the temporary data storage and the waiting time around that adder, we try to synchronize the IDCT unit, the intra prediction unit and motion compensation unit, and make the throughput of these 3 blocks equal. We use four-parallel architecture in these 3 blocks that make the throughput of them equals to four pixels per cycle. That is, in one cycle, the predicted 4 pixel values and the 4 residual pixel values can output concurrently from the intra/inter predictor and the IDCT unit.
Based on the 4x4-block scanning sequence defined by H.264/AVC standard and our 4x4-block level pipelining architecture, we reconstruct the 4x4-block one by one in the ordering as Fig. 1 shows. And because four pixel values can output concurrently in our design, we can choose 4x1 row-byrow decoding ordering or 1x4 column-by-column decoding ordering as Fig. 6(a) and Fig. 6(b) show.
(a) (b)
Figure 6. (a) 4x1 row-by-row decoding ordering (b) 1x4 column-by-column decoding ordering
We choose 1x4 column-by-column decoding ordering which reduces memory access and processing cycles. We give an analysis and comparison on these two decoding order in section 4.
ANALYSIS ON PERFORMANCE AND COMPARISON

Analysis of the 4x4-block level pipelining architecture
Depending on the input data, the total coded bits of all the residual data in a macroblock might vary from 100 bits to 600 bits that result in the variance of the decoding cycle needed for a CAVLC decoder in a macroblock. As Fig. 5 shows, with the additional cycles that the CAVLC decoder decodes the coefficient in each 4x4-block for luma level or 2x2-block in chroma level, the total decoding cycles N for an un-filtered macroblock ranging from 200 to 800 cycles. For the 'Forman' QCIF video sequence, our simulation result shows that our proposed 4x4-block level pipelining architecture can achieve the throughput of 350cycle/MB in average. Compared with our loop filter, which needs about 250 cycles to do filtering on each macroblock in the worst case, performing CAVLC operation of a macroblock is usually slower but sometimes faster than the loop filter in a macroblock level pipeline stage. In our design, because the CAVLC decoder is usually the critical operation in throughput, the proposed 4x4-block level pipelining architecture which improves the utilization of the CAVLC decoder can also improve the system throughput greatly. Moreover, because our instantaneous switching scheme applies both in 4x4-block level pipelining and macroblock level pipelining, we reduce the bubbles to minimum wherever the system bottleneck lies in.
Unlike some conventional design which might have macroblockperiod bubbles [2] , our 4x4-block level pipelining architecture combining with instantaneous switching scheme can effectively minimize the latency and the bubbles in the decoding time.
Analysis of our optimal decoding ordering
As Fig.6(a) and Fig.6(b) shows, these two different decoding ordering affect the number of memory access and the processing cycles of the output. The analysis on the intra prediction unit and inter prediction unit are as the following.
Analysis on the intra prediction unit
For an Intra4x4 predicted macroblock, the neighboring pixels including upper 8 pixels and left 4 pixels must be loaded before performing prediction. And if we choose the 1x4 column-bycolumn decoding ordering as Fig. 6(b) shows, we can see that the 4 pixels of the 3 rd output is just the left neighbor of the next 4x4-block to be predicted as Fig. 7 shows. Thus, the fetching operation for these left neighbor 4 pixels from memory can be saved by feeding the 4 pixels directly from the previous output. The situation occurs at the 11 
Analysis on the inter prediction unit
For an inter predicted macroblock, each group of 4 predicted pixel values calculated by first loading the neighboring 9x6=54 pixel values in 18 cycles on the reference frame (9 pixels per 3 cycles), doing 6-tap interpolation then 2-tap interpolation. After a group of 4 pixel values are calculated, its neighboring 4 pixel values can be calculated by loading 9 pixels in reference frame continuously in the following 3 cycles. In this situation, if we choose 4x1 row-by-row decoding ordering as Fig. 6(a) Fig. 8(a) shows. However, if we choose 1x4 column-by-column decoding ordering as Fig. 6(b) ordering, the 1x4 column-by-column decoding ordering saves 8 times content switches, which in turn saves memory access from 3x6x16+3x3x16=432 times to 3x6x8+3x7x8=312 times (28% saved) and reduce processing cycles from (18+3x3)x16=432 cycles to (18+3x7)x8=312 cycles (28% saved).
*Arrow represent content switch 
SIMULATION RESULTS AND DISCUSSION
Based on our proposed high-throughput 4x4-block level pipelining architecture with instantaneous switching scheme and optimal decoding ordering, our VLSI design and simulation shows that the throughput of the proposed design can achieve 260,000 MB/s at 100MHz, which can support real-time decoding of 1080HD (1920x1088) video sequence in 30fps (244800 MB/s required). Table 1 shows our implementation and simulation result compared with available conventional design. For different video formats, for example, 720pHD, 525SD, or CIF video sequence, our proposed decoder can support real-time decoding in different clock rate as Table 2 shows.
CONCLUSION
In this paper, we propose a novel H.264/AVC decoder with 4x4-block level pipelining technique. In the 4x4-block level pipelining with instantaneous switching scheme, we minimize the latency and bubbles in decoding each 4x4 block, achieve 260,000 MB/s throughput under 100MHz clock rate. Further, the proposed 1x4 column-by-column decoding ordering which matches the 4x4-block scanning sequence defined in standard, can greatly reduces 17% memory access in intra prediction unit, 28% of memory access and 28% of processing cycles in inter prediction unit. By adopting these two techniques, our proposed design supports real time decoding of 1080HD (1920x1088) video sequence at 100MHz, 720pHD at 50MHz, 525SD at 20MHz, and CIF at 5MHz in 30fps. 1  7  1  8  1  9  2  0  2  1  2  2  2  3   2  4  2  5  2  6  2  7  2  8  2  9  3  0  3  1   3  2  3  3  3  4  3  5  3  6  3  7  3  8 
