In this paper, we propose an area-efficient design approach to cover both in-loop and post-loop filtering processes for multiple video coding standards. In addition, we propose a hybrid filter scheduling to improve system throughput. Compared with available designs [1] [2], the proposed approach saves about onehalf of processing cycles, and hence reduces power dissipation. Compared to the original loop-filter, the proposed loop/post filter only incurs 20.7% of extra cost. Simulation results show that our proposal can easily achieve real-time decoding for 1080HD when the working frequency is 100MHz.
INTRODUCTION
Various video coding standards are in use recently. Traditional MPEG standards support the features of backward compatibility. However, H.264/AVC [3] is the newest video standard, and there is no backward compatibility of H.264/AVC to the former H.263 and MPEG-4 (Part-2) video coding standards. Therefore, the development of combined video coding standard is a must to meet the different system requirements. Both H.264/AVC and MPEG-4 adopt the de-blocking filter to eliminate the blocking artifacts. However, the H.264/AVC adopts the de-blocking filter as an in-loop process and the other standards adopt it as a postloop process. The detailed features of de-blocking filter are listed in Table 1 . To provide the unique architecture for multiple video standards, we propose a hybrid scheme to integrate the standardized in-loop filter and the informative post-loop filter. We call it as loop/post de-blocking filter in this literature.
Due to the non-standardization of post-filter [4] , it provides high freedom to develop a certain suitable algorithm for the integration with loop-filter. Based on the original algorithm of 4x4 loop-filters, an 8x8 post-filter has been developed. We modify the filtered ordering and the number of related pixel. Therefore, the modified post-filter can easily be integrated with the 4x4 loop-filter. Simulation results also show that the proposed loop/post filter incurs the penalty of slight PSNR loss (0.03dB) and extra 20.7% cost compared to the original loop filter.
The de-blocking filter is the system bottleneck based on the single-port architecture in [1] (see Figure 1 ). Therefore, a high throughput de-blocking filter is essential to improve the system throughput. In traditional de-blocking filter architecture, vertical edges are filtered first, and then horizontal edges are filtered. Unfiltered data should be fetched in each direction. Therefore, memory accesses are doubled for one 4x4 sub-block or 8x8 block. We modify the processing order of filtered block boundaries without affecting the pre-defined data dependency. Compared to available designs [1] [2], the proposed loop/post filter architecture can save about one-half of processing cycles.
The rest of this paper is organized below. Section 2 presents the combined scheme of loop/post filter. Section 3 shows our hardware architecture for high throughput design. Section 4 shows our simulation results. Finally, section 5 summarizes our work and draws the conclusions. 
LOOP/POST DE-BLOCKING FILTER
To reduce the cost overhead of de-blocking filter for multiple standards, it is required to develop a hybrid algorithm and unique architecture of de-blocking filter. The video standards of H.264/AVC and former MPEG adopt de-blocking filter as inloop and post-loop process respectively. However, the performance improvement is very mild when applying the loop filter as the post filter in MPEG-4. Therefore, we propose a hybrid algorithm to make a compromise between the integration cost and the performance loss. Figure 2 shows the decision of *Work supported by the National Science Council of Taiwan, R.O.C., under Grant NSC 93-2220-E-009 -010. our proposed loop/post filter. The proposed hybrid algorithm retains the original loop filter due to the standardization in H.264/AVC. In addition, we modify the post filter (marked as the underline in Figure 2 ) to easily integrate into original loop filter design. The proposed algorithm exploits the features of loop and post filters. It can be partitioned into three main parts as identified in Table 2 . In the filtered control, we retain the filtered edge of 4x4 and 8x8 respectively. The reason is that the basic transformation unit is located on the 4x4 sub-block and 8x8 block. Further, we modify the filtered ordering in post filter to unify into a hybrid structure. These filtered controls will be addressed again in section 3. Then, we introduce the algorithm of loop/post filter in terms of mode decision and filtering mode.
Mode Decision
There are several differences between the mode decision of loop and post filter. The loop filter is performed in the DPCM loop and controlled by the syntax parser. However, the post filter is applied after the video decoder and can be considered as a postprocessing unit. The post filter is controlled by the neighboring pixels. To merge the mode decision, we retain the mode decision features of loop and post filter. Further, we modified the mode decision of post filter into the 8-pixel related algorithm. This modification leads to reduce hardware complexity, making it suitable for integration with the loop filter. Therefore, the loop and post filters are the same in terms of 8-pixel-related algorithm instead of 10-pixel-related in post-filter.
Filtering Mode
To combine the edge filtering between in-loop and post-loop filters, we modify the default mode of post filter and apply the process of "bS=4" into the DC offset mode of post filter. In Table 2 , the filtering mode can be partitioned into strong and weak mode. The strong filtering mode in post filter is similar to the loop filter, we apply the filter process of "bS=4" instead of the original DC offset mode in MPEG-4 Annex F.3 [4] . Further, we modify the approximated DCT kernel (i.e. [2 -5 5 -2]) into the [2 -4 4 -2]. Therefore, we can exploit shifter instead of constant multiplier. We also apply the folding scheme to reduce the hardware cost. The three parallel operations (see Equation 1 [4] ) are folded into the single operation within three cycles. All the modification of post filter design can be summarized in the underline of Table 2 . In Figure 2 , we show the architecture of weak filtering strength for the detailed descriptions. 
Pixel-in-Pixel-out Edge Filter (P-i-P-o Edge Filter)
We implement a Pixel-in-Pixel-out edge filter to integrate the loop and post filters into unified architecture. From Figure 2 , the incurred MUX is exploited to switch different filtering functions. In the loop filter, the filtering algorithm of H.264 is implied. The modified filtering algorithm of MPEG-4 Annex F.3 [4] is also realized in the loop/post filter architecture. Therefore, the proposed loop/post filter is suitable for the implementation of multiple video standards. We implement an area efficient de-blocking filter by exploiting the computational redundancy between the loop and post filter. The proposed loop/post filter with weak strength has been depicted in Figure 2 (only p0', q0' shown) . The extra shaded regions are allocated to perform the post filter on the original loop filter design. We partition the proposed loop/post filter into three main phases. They are the phases of difference generation, delta generation and the pixel generation respectively. The difference generation phase is the pixeldifference initialization of edge filtering. After that, the delta generation phase is performed to generate the delta metric between the un-filtered and the filtered pixel. They use the CLIP operation to limit the delta value between UB (Upper Bound) and LB (Lower Bound). The phase of pixel generation adds the unfiltered data to obtain the final results. 
HIGH-THROUGHPUT ARCHITECTURE

Intra/inter Prediction
Proposed Hybrid Scheduling
To reduce the overhead with the reloaded data when switching the filtering edge from horizontal to vertical, we propose a hybrid filter scheduling to re-schedule the standard-defined edge. The proposed loop/post de-blocking filter is performed in the vertical edge first, and then the horizontal edge. Based on the standard-defined filter ordering, we can deduce the filter order on each 4x4 sub-block as Figure 3(a) . In the filter ordering of one 4x4 block, left edge is filtered first and lower edge is the last one. We propose novel filter ordering to schedule filter process on each edge in Figure 3(b) . All the numbers of filter ordering except for the shaded number are performed for the 8x8 post filtering. Therefore, there are totally 48 and 24 edges to be filtered in the loop and post filter operation respectively. Each filter order of one macro-block obeys the rules of the left edge first and the lower edge last. Compared to the traditional scheduling [1] [2], the proposed method prevents the re-access for the different direction and combine the vertical and horizontal filter at the rules of hybrid loop/post filter algorithm (see Table 2 ).
The main problem of the de-blocking filter in H.264/AVC is the considerable amount of memory access and processing cycles. To apply the proposed hybrid scheduling into the overall system and enhance the system throughput, we propose a highthroughput architecture design of de-blocking filter. Figure 4 (a) shows the proposed design with block diagram and data flow representation. In Figure 4 (b), we choose Line-ofPixel (LoP) [2] data arrangement to improve the pixel data utilization and reduce the memory access. The single-port SRAM module is exploited for the low cost issue, and stores the adjacent and the current macro-block data. Before performing the loop/post de-blocking filter, we have to load the current data (i.e. 16-luma and 8-chroma 4x4 block, see the blank square of Figure 3 (b)) and adjacent data (i.e. 8-luma and 8-chroma 4x4 block, see the grid square of Figure 3(b) ). Further, the external frame buffer is an off-chip memory. The external memory size is decided by the frame-size and the frame-number for the longterm prediction. In Figure 4(a) , the shaded-arrows denote the data flow inside the de-blocking filter unit, and the black-arrows denote the data flow outside. The pixel buffer is used to store the intermediate pixel value when applying the proposed hybrid scheduling. It contains four 4x4 pixel values. The control unit has been implemented as Table 2 shows. In addition, the edge filter is a simple pixel in and pixel out process which is already discussed in section 2.2.1.
Proposed Architecture of Loop/Post Filter
P-i-P-o Edge
The detailed architecture for the de-blocking filter unit of Figure 4 (a) is given in Figure 5 . All data signals are 32-bit wide. There are four input signals {wt_B_0, wt_B_1, wt_B_2, wt_B_3} to write the buffers with 4 blocks. Further, there are three output signals {rd_B_0, rd_B_1, rd_B_2} to read three of them to perform the P-i-P-o edge filter, and then write to the frame buffer, pixel buffer or adjacent MB memory. By the same naming rule, each data flow represents the writing/reading to/from the storage module including adjacent memory, current memory or frame buffer. After the behavioral illustration of pixel buffer, we use one MB with 48 edges of loop filter as an example to illustrate the other behavior of Figure 5 . We omit the illustration in 24 edges of post filter since it can be deduced easily on the similar operations. The behavior of Figure 5 can be partitioned into two main parts.
• Write Process is a writing mechanism through the signal {wt_A_0~2, wt_F_0~1, wt_B_0~3}.
• Read Process is a reading mechanism through the signal {rd_A_0~1, rd_C_0, rd_B_0~2}.
For writing to adjacent memory, wt_A_0 is used to write the filtered data into the adjacent memory, and it will be activated only on the edge 6, 10, 14 and 16 (see Figure 3(b) ). Further, the wt_A_1 will be activated on the edge 31, 32, 40 and 48. For the writing signal of frame buffer, wt_F_0 is used to write filtered data into the external frame buffer. It will be activated on each filtering of horizontal edge except for the edge of activated signal wt_A_1 and wt_B_0, since wt_F_0, wt_A_1 and wt_B_0 have the same root-signal of P'_Pixel. For the edge 6 as an example, the upper block of edge 6 is the P'_Pixel of edge filter's output. This block will write to the external frame buffer since it has been filtered completely for all the edges of {1,3,5,6}. The wt_F_1 is performed in the same way except that the input signal comes from the output of pixel buffer.
For the reading process of adjacent memory, rd_A_0 is only activated on the edge of {1,2,17,18,31,33,34,39,41,42,47}. In the edge 1, the rd_A_0 is the input of pixel buffer. However, for the vertical filtering of edge {5,9,13,15,21,25,29,37,45}, it can directly feed through the edge filter by rd_A_1. Finally, compared to the existing approaches, the current memory of our proposed design is only used for read. There is no need to store the filtered result into the current MB memory in one direction, and read them in another direction. By our proposed hybrid scheduling, we combine the horizontal and vertical filtering process in one filtering flow. Therefore, we need 4 sub-blocks at most to perform the hybrid filtering. 
SIMULATION RESULT
Simulation results are summarized in Table 3 . The target technology is 0.18μm, and the synthesized gate count is 22.5K excluding the adjacent and current MB memory. Two single port SRAM is organized to store the current and adjacent MB data. They contain the size of 96×32 and 64×32 respectively. We modify the post filter algorithm in [4] and make a compromise between the integration cost and the performance loss. We use "Foreman" and "Stefan" as our test sequences. In Figure 6 , the performance loss of the modified post filter is only 0.03dB compared to the traditional post filter [4] . Moreover, the incurred gate count for post filter processing is about 20.7% (i.e. 4.69k/22.5k, see Table 3 ).
In the loop filter operation of Table 4 , the evaluated cycle counts are 151 cycles for Luma block and 90 cycles for chroma block. Specifically, there are 4x32 cycles to filter each horizontal and vertical edge in one luma MB. Finally, we need 12 cycles to write the filter result and incur 3 cycles due to the structural hazard in our filtering process. Totally, we need 151 (i.e. 8+4x32+12+3) cycles to filter horizontal and vertical edge of luma MB. By the same analysis, we need 90 (i.e. 4+4x8+8+1=45 for each chroma) cycles in chroma block. Therefore, there are 243 cycles with extra 2 cycles for data hazard (i.e. 151+90+2). There is no need to consider the cycle overhead in control logic since it acts as a pipelined fashion. In addition, the processing cycles of post-loop filter is identical to in-loop filter since they share the same hardware architecture and data control flow. Compared with available approaches [1] [2], the proposed architecture saves about one-half of processing cycles per MB. Originally, the de-blocking filter is a system bottleneck in terms of processing cycles (see Figure 1) . Based on the proposed architecture, we can greatly reduce the processing cycles into 350cycles/MB (i.e. the processing cycles of CAVLC in I-frame) and improve system throughput (i.e. 350cycle/MB = 9523MB/frame with 30fps@100MHz). Therefore, this processing capability can real-time decode 1080HD (1920x1088, i.e. 8160MB/frame) or higher with 4:2:0 format when the working frequency is 100MHz. 
CONCLUSION
An area-efficient and high-throughput de-blocking filter has been presented to meet the different requirements for multiple video coding standards. We modify the post filter algorithm to make a compromise between hardware integration complexity and performance loss. Further, we propose a joint scheduling to reduce processing cycles and improve system throughput. Finally, the proposed loop/post de-blocking architecture can easily achieve real-time decoding performance with 1080HD@30fps in 4:2:0 format.
