Abstract-This paper presents a high-profile intra predictior for H.264 video decoder. To alleviate the starved bandwidth of intra compensation in high-definition video, we reuse the neighboring pixels and optimize the buffer size and access latency. In particular, a dedicated pixel buffer reuses neighboring pixel for realizing MB-adaptive frame-field (MBAFF) decoding in intra compensation. Moreover, a basemode predictor is explored to optimize the area efficiency for reference sample filtering process (RSFP) in intra 8x8 modes. Simulation results show that the proposed data-reused intra prediction module requires 14K logic gates and 688 bits SRAM, and operates on 100MHz frequency for realizing 1080HD video playback at 30fps.
INTRODUCTION
H.264/AVC [1] [2] is the latest international video coding standard from MPEG and ITU-T Video Coding Experts Group. It consists of three profiles which are defined as a subset of technologies within a standard usually created for specific applications. Baseline and main profiles are intended as the applications for video conferencing/mobile and broadcast/storage, respectively. Considering the H.264-coded video in high profile, it targets the compression of high-quality and high-resolution video and becomes the mainstream of high-definition consumer products such as Blu-ray disc. However, high-profile video is more challenging in terms of implementation cost and access bandwidth since it involves extra coding engine, such as macroblock-adaptive frame field (MBAFF) coding and 8×8 intra coding, for achieving high performance compression.
In the MBAFF-coded pictures, they can be partitioned into 16x32 macroblock pairs, and both macroblocks in each macroblock-pair are coded in either frame or field mode. As compared to purely frame-coded pictures, MBAFF coding requires two times of neighboring pixels size and therefore increases implementation cost. To cope with aforementioned problem, we propose neighboring buffer memory (include upper/left/corner) to reuse the overlapped neighboring pixels of an MB pair. Furthermore, we present memory hierarchy and pixel re-ordering process to optimize the overall memory size and external access efficiency. On the other hand, H.264 additionally adopts intra 8×8 coding tools for improving coding efficiency. It involves a reference sample filtering process (RSFP) before decoding a Luma intra_8x8 block. These filtered pixels are used to generate predicted pixels of 8×8 blocks. Hence, the additional processing latency and cost are required, and they may impact the overall performance for the real-time playback of high-definition video. In this paper, we simplify the RSF process via a basemode predictor and optimize the processing latency and buffer cost. Compared to the existing design [5] without supporting intra 8×8 coding, this design only introduces area overhead of 10% and 7.5% of gate counts and SRAM.
An architectural choice advocated for dealing with long past history of data is a memory hierarchy [7] . In the intra prediction, it utilizes the neighboring pixels to create a reliable predictor, leading to dependency on a long past history of data. This dependency can be solved by storing upper rows of pixels for predicting current pixels but is a challenging issue in implementation cost and access bandwidth. To optimize the introduced buffer cost and access efficiency, we use two internal Line SRAM1 and Line SRAM2 to store the Luma and Chroma upper line pixels, as illustrated in Figure 1 . By the ping-pong mechanism, the upper neighboring pixels of current MB (or MB pair) are stored to one of them, and the other Line SRAM is used to store next MB of upper neighboring pixels. This memory hierarchy facilitates the internal Line SRAM size and the decoding pipeline schedule. The rest of this paper is organized as follows. In Section 2, the architecture of an intra prediction module and memory hierarchy will be introduced. In Section 3, MBAFF decoding flow will be described. The Luma intra_8x8 decoding, RSFP, and latency analysis are illustrated in Section 4. Figure 2 shows the block diagram of the proposed highprofile intra compensation architecture. A pixel rearranging process, which is located on the bottom-left of Figure 2 , is proposed to reduce the complexity of neighbor fetching when MBAFF coding is enabled. The signal, Line SRAM1/2 data_out, is directly connected to the intra prediction block for replacing the last set of upper buffer memory. As for 8×8 intra coding, a dedicated pixel buffer memory is used to store the filtered neighboring pixels and reuse the overlapped pixel data. According to the relations between Luma intra_8x8 modes and numbers of filtered pixels which are needed in each mode, we minimize the number of stored pixels to 17 (i.e. 136 bits). The output of predicted pixel is interfaced to the filtered pixel buffer memory because RSFP is embedded in the intra prediction generator. 
II. PROPOSED HIGH-PROFILE INTRA PREDICTOR

III. MBAFF DECODING WITH DATA REUSE SETS
MBAFF is proposed to improve coding efficiency for interlaced video. However, it introduces longer dependency than conventional frame-coded picture. Moreover, there are no reports discussing the implementation of MBAFF. In this section, we analyze and realize it via upper, left, and corner data reuse sets (DRS) to reuse the pixels and improve the cost and access efficiency.
A. Upper DRS
For decoding an MBAFF-coded video, upper buffer memory is used to store the constructed upper pixels of current MB pair. These upper buffers are updated with the completion of prediction process on every 4×4 block. For each updated sub-row(s), they can be reused by the underside 4×4 blocks. According to the different prediction modes of MB pair, the upper buffer will store data from different directions. If current MB pair is frame mode, it only needs to load one row of upper buffer (16 pixels) at first, and when a 4×4 block is decoded, updating the two sub-rows in two rows (8 pixels) of upper buffer from top to down at one time, as illustrated in Figure 3 (a). In Figure 3 (b), a fieldcoded MB pair needs to load two rows of upper buffer (32 pixels), two times of frame-coded MB pairs. Then, only one sub-row of upper buffer memory will be updated when a 4×4 block is decoded. However, considering the fifth 4×4 block, it still needs a sub-row of upper buffer to predict, as shown in Figure 4 . In order to reduce the upper buffer memory size, the Line SRAM data_out is directly used. The only penalty to this scheme is that the Line SRAM data_out must hold the value until fifth 4×4 block is decoded. 
B. Left DRS
The updated direction of the left buffer is similar to that of the upper one. The direction ranges from left to right. When the left buffer is located on the right hand side of MB pair, the next MB pair can reuse these new pixels for the following prediction procedures. However, when the modes of current and previous MB pairs are different, the left neighbors of a 4x4 block will become complicated. To reduce computational complexity of this intra predictor, pixel rearranging process is exploited. If current MB pair is frame mode, each sub-column of left buffer will be updated when each 4x4 block is decoded. On the other hand, if current MB pair is field mode, first and third buffers in each sub-column of left buffer will be updated when each 4×4 block is decoded in the top MB. Second and fourth buffers in each sub-column of left buffer will be updated when each 4×4 block is decoded in bottom MB. Hence, we only need to consider what the mode current MB pair is instead of handling four coding modes for the combination of current and previous MB pairs, and therefore the complexity can be reduced. 
C. Corner DRS
Using corner buffer memory can efficiently reuse the upper left neighboring pixels. We change the positions of corner buffer from left [5] to top. Therefore, the total corner buffer size can be reduced by 38% (i.e. 64bits 40bits, because the MB number in horizontal is less than that of vertical MB pair). In particular, Figure 5 shows the updating directions of corner buffer. However, because the upper neighboring pixels will be either the last row or the row prior to the last row in upper MB pair, so the first corner of current MB pair has two processing states: reuse and reload. The first corner is reused when 1) the mode of current MB pair is identical to that of previous (left) MB pair or 2) before decoding the bottom MB of frame-coded MB pair. On the other hand, the first corner is reloaded when 1) the current MB pair has the different modes as previous (left) MB pair or 2) before decoding the bottom MB of fieldcoded MB pair. In summary, using neighboring buffer memory and their different directions of updates according to different modes of MB pair can reuse the neighboring pixels and improve the access efficiency. The associated pipeline structure of MBAFF decoding is shown in Figure 6 . We can see that during a MB pair decoding process, the interaction between buffers and Line SRAM can be completed easily and efficiently, and the communication between another Line SRAM and external SDRAM can be done at the same time. 
IV. INTRA 8X8 DECODING WITH BASE-MODES
Luma intra_8x8 is an additional intra block type supported in H.264 high profile. Before decoding an intra_8x8 block, there is an extra process that is different from intra_4x4 and intra_16x16, which called reference sample filtering process (RSFP). Original pixels will be filtered first, and then using these filtered pixels to predict subsequent 8×8 blocks. For an intra_4x4 and intra_8x8 block, 13 neighbors and 25 filtered neighbors are needed, respectively. According to the Luma intra_8x8 modes, the maximum number of filtered neighbors is 17, as illustrated in Table I . Hence, only 17 (i.e. 136 bits) filtered pixels need to be stored. In the intra_4x4 process, the prediction formula of each mode except DC mode has the same form: prediction_out = (A+2B+C+2) >> 2. Compared with the share-based [3] [4] [5] intra prediction generator, the proposed base-mode predictor not only reduces area cost (due to elimination of four adders) but also guarantees that four predicted pixels will be generated in one cycle of each intra_4x4 modes, including DC mode. In particular, we use this base mode predictor to generate the four predicted pixels in one cycle. In the RSFP, the form of formula is identical to that in intra_4x4, and also can be rewritten to the same form filtered_out = (A+2B+C+2) >> 2. Hence, we can share the hardware resource to generate filtered pixels, as shown in Figure 7 (a). Notice that an additional process, neighbor distribution, is needed to add in intra_8x8 process because we only store 17 filtered pixels instead of 25.
, where 0, 8, 16, 17
In a four-parallel intra prediction module, the latency of an 8x8 block will be increased to 0~5 cycles according to the different modes of 8×8 blocks. In order to reduce the latency penalty, we reserve filtered pixels when the mode of the first/third is equal to 3 or 7 and second/fourth 8x8 block is equal to 0, 2 (if upper is available), or 3~7 (i.e. the value N = 6. Otherwise, N = 0). Then these filtered pixels are directly used to predict second/fourth 8×8 block. To clarify the extra latency, Eq. (1) lists the decoding extra latency in an intra_8x8 MB, and TABLE I summarizes the # of filtered pixels in each 8×8 intra coding mode (i.e. the value of M). In particular, we list some examples to clarify the processing behavior of an intra 8×8 block in Figure 7(b) . If the modes of first and second 8×8 blocks are 3 (diagonal bottom left) and 7 (vertical left) or 3 (diagonal bottom left) and 4 (diagonal bottom right), only 10 and 11 pixels are needed to be filtered while decoding the second 8×8 block, as described in Figure 7 
V. SIMULATION RESULTS
To enhance system performance, our proposal is designed to optimize area, buffer size, and latency. We use two 0.64kb Line SRAMs which are connected to a 32-bit system bus to make decoding pipeline simple, and 0.688kb SRAM to store reused neighboring pixels. TABLE II shows the average cycles for decoding an I-MB in different video sequences of our proposed design for 30fps HD1080 video format at working frequency of 100MHz with MBAFF and Luma intra_8x8. The overhead of latency is less than 5% compared to preliminary architecture [6] . The overall area and buffer memory size for supporting H.264 BP/MP/HP are 14063 gates in UMC 0.18um technology and 688 bits, as shown in TABLE III. The overheads for supporting Luma intra_8x8 are 10% and 7.5% compared to [5] . VI. CONCLUSION
In this paper, we propose a high-profile intra predictor to support MBAFF and Luma intra_8x8 decoding. The proposed memory hierarchy includes upper, left and corner memory buffer which reuses the neighboring pixels for follow-up prediction procedures. In Luma_8x8 decoding process, we propose base-mode predictors to minimize the additional hardware cost, latency penalty, and filtered pixel buffer memory size. Compared to the existing design [5] without supporting intra 8x8 coding, this design only introduce 10% and 7.5% of gate counts and SRAM overheads. The proposed design can achieve real-time processing requirement for HD1080 format video in 30fps under the working frequency of 100MHz.
