As the successive video compression standard of H.264/AVC, High Efficiency Video Codec (HEVC) will play an important role in video coding area. In the deblocking filter part, HEVC inherits the basic property of H.264/AVC and gives some new features. Based on this variation, this paper introduces a novel dual-mode deblocking filter architecture which could support both of the HEVC and H.264/AVC standards. For HEVC standard, the proposed symmetric unified-cross unit (SUCU) based filtering scheme greatly reduces the design complexity. As a result, processing a 16×16 block needs 24 clock cycles. For H.264/AVC standard, it takes 48 clock cycles for a 16 × 16 macro-block (MB). In synthesis result, the proposed architecture occupies 41.6k equivalent gate count at frequency of 200 MHz in SMIC 65 nm library, which could satisfy the throughput requirement of super hi-vision (SHV) on 60 fps. With filter reusing scheme, the universal design for the two standards saves 30% gate counts than the dedicated ones in filter part. In addition, the total power consumption could be reduced by 57.2% with skipping mode when the edges need not be filtered.
Introduction
Digital video compression technology has been developed enormously during the last twenty years. At present, there are some common used video coding formats such as MPEG-2, H.263, MPEG-4 and H.264/AVC. The H.264/AVC is currently the most powerful video coding scheme and has been deployed widely in various applications [1] , [2] . With rising pursuit of visual effects for consumers, high definition (HD), quad full high definition (QFHD), even SHV displaying systems come into our daily life. For example, NHK and BBC have put SHV to use for public screenings during the 2012 London Olympic Games [3] . HEVC is proposed to satisfy this urgent demand and aims to substantially reduce the bit rate by half with comparable image quality compared to advanced video codec (AVC) High Profile [4] . In the last two years, Joint Collaborative Team on Video Coding (JCT-VC) has investigated many proposals and adopted promising ones to implement them in HEVC Test Model (HM) for reference. The performance report shows that HM 5.0 could maintain an average bit rate approximately 50% of AVC with the same subjective quality [5] . HEVC inherits the block-based hybrid video coding framework from H.264/AVC and it inevitably introduces block artifacts mainly caused by block-based discrete cosine transform (DCT) and motion compensation (MC) process. As in H.264/AVC, deblocking filter is employed in HEVC and adaptively applied to block boundaries to minimize block artifacts while preserving the true edges, which could provide better visual quality and bit rate reduction. The use of deblocking filter in H.264/AVC saves 5-10% bit rate with equal objective quality [6] . However, due to the high adaptability and the small 4 × 4 basic processing block, the deblocking filter is one of the most computational intensive components and becomes the bottleneck of VLSI design. There are many works focused on efficient deblocking design of H.264/AVC, such as [8] - [11] . In these works, parallelization is the most common technique to enhance the throughput. Together with a certain processing order and memory update structure, some works could achieve high performance. For example, design of [11] organizes 4 filters in 2 groups to simultaneously process vertical and horizontal edges. With the proposed zig-zag processing schedule, it takes 48 cycles for one macro-block (MB) in a pipeline, which can support QFHD on 60 fps. For the edges need not to be filtered, design of [8] utilizes clocking gating techniques to save the power dissipation. The design of [10] applies two skipping patterns in the MB level to avoid unnecessary memory access and filtering.
All the aforementioned works are designed for H.264/AVC. There is no literature for HEVC deblocking filter yet. After comparing the deblocking filter in HEVC and H.264/AVC, we find that the filters for H.264/AVC and HEVC are very similar in the filtering computations. It is feasible to design a dual-standard deblocking filter to support both of them. Since the devices and media contents with current popular H.264/AVC standard would not be replaced immediately, HEVC should be backward compatible with H.264/AVC. In implementation, it is obvious that simply combining individual designs dedicated to each standard will lead to high power and unworthy hardware cost. Thus it is necessary to design a universal decoder capable of supporting multiple video coding standards. This paper proposes dual-mode architecture based on this idea.
In order to achieve real time performance for videos with larger frame dimension, HEVC significantly reduces the complexity mainly from the use of larger processing unit of 8 × 8 block. Firstly, it effectively reduces the amount of de-blocked pixels. Samsung's report shows that the num-ber of de-blocked pixels is reduced by 41% on average for 1080p sequences [12] . On the other hand, HEVC employs the coding block tree (CBT) structure and frame-based loop filtering. Under the synergy of the novel processing order and 8 × 8 filtering block, HEVC removes some data dependency and reduces the design complexity to some extends. However, these changes bring problems as well. The frame-based processing is not suitable for hardware design and flexible coding unit (CU) size also adds difficulty to the architecture design. All the aforementioned works are MBbased design and could not simply apply to the HEVC specification.
In this paper, 4 filters are utilized to process pixels in 4 lines simultaneously to enhance the throughput. By carefully studying the data dependency of the processing order of HEVC, symmetric unified-cross unit (SUCU) based processing scheme is introduced to solve the problems brought by the new features of HEVC. We also implement skipping mode to avoid unnecessary memory accesses and filtering operations, which could save the power consumption by up to 57.2%. This design is a dual-mode architecture that also supports H.264/AVC. The filter is fully reused, and 30% gate counts are saved compared with the individual implementations for the two standards. Finally, the proposed architecture processes one 16 × 16 block with 24 cycles for HEVC and 48 cycles for H.264/AVC. In synthesis result, the proposed architecture occupies 41.6k equivalent gate counts at 200 MHz in SMIC 65 nm library, which could satisfy the throughput requirement of QFHD on 60 fps for H.264/AVC and SHV on 60 fps for HEVC.
The rest of this paper is organized as follows. Section 2 introduces and compares the deblocking filter algorithm defined in HEVC and H.264/AVC standard. Section 3 presents the system architecture and function module design. In Sect. 4, the implementation results and power analysis discussion are addressed. Finally, Sect. 5 draws the conclusion.
Deblocking Filter Algorithm in HEVC
In HEVC, each frame is divided into a sequence of largest coding units (LCUs), which are processed in raster scan order as shown in an example frame in Fig. 1(a) . Each LCU can be further split into CUs based on quad-tree structure and the size of each CU is not less than 8 × 8. The leaf CUs within an LCU is processed in Z scan order, as shown in the right part of Fig. 1(a) . Each CU can be subdivided into prediction units (PUs) and transform units (TUs) that are dotted in dashed line. Deblocking filter takes place on a CU basis. For every CU, all the edges of 8 × 8 blocks are checked and the edges of PUs, TUs or the CU are involved in the deblocking filter except the boundary of picture or slice.
A CU consists of luma and chroma (Cb, Cr) components as shown in Fig. 1(b) . Vertical edges (V 0-3 ) and horizontal edges (H 0-3 ) need to be filtered in the 16 × 16 CU are bolded. For each minimum unit of edge, 4 pixels (p 0-3 , q 0-3 or p 0-3 , q 0-3 ) on each side are involved in the filtering. For the vertical edge, the filtering is horizontally performed on each row of pixels, which is so-called horizontal filtering. For the horizontal edge, filtering is vertically performed on each column of pixels, which is so-called vertical filtering. Basically, the vertical edges in the block are processed from left to right and the horizontal edges are processed from top to bottom. Deblocking filter is applied to luma and chrominance components (Cb, Cr) successively.
The algorithm of deblocking filtering of H.264/AVC has been described in details in previous literature [6] . This section mainly states the different filter orders and data dependency for the two standards based on HEVC draft 6 [14] and HM6.0 [15] . The filter on/off decision in HEVC that related to the proposed skipping mode is also presented.
Frame-Based Loop Filtering and Data Dependency
The basic filter orders of HEVC and H.264/AVC are illustrated by the neighboring LCUs and MBs in Fig. 2 , where the symbol V n represents all the vertical edges in #n MB or #n LCU, while H n represents all the horizontal edges in #n MB or #n LCU.
HEVC has already adopted the frame-based filtering proposed by Sony Corporation [14] . As illustrated in the upper part of Fig. 2 , vertical edges in the whole frame are filtered firstly, and then the horizontal edges are filtered later. Filtering in each direction obeys the raster scan and Z scan order mentioned before. The filtering in H.264/AVC is MBbased loop processing. As illustrated in lower part of Fig. 2 , the vertical edges in one MB are filtered prior to the hori- zontal edges. Horizontal and vertical filtering in the same MB should be completed before moving to the next MB. Compared to the deblocking filter in H.264/AVC, the process priority of the vertical edges not only exists in the same LCU but also exits in the frame level.
In both standards, for each minimum edge unit, 4 pixels on each side are involved and finally up to 3 pixels will be modified. Since the H.264/AVC filter is basically performed on 4 × 4 blocks, the filtering operations on the neighboring edges in the same direction are dependent to each other. For example, v 0n , v 1n , v 2n , v 3n in #n MB are dependent to each other. This kind of dependency is removed in HEVC because the filtering is performed on larger 8 × 8 blocks. For example, v 0n , v 1n in #n LCU are independent to each other. Moreover, with the frame-based processing, the vertical edges in the whole frame re also independent to each other and it is the same for the horizontal edges.
Filter Decision for HEVC
Filter decision for an edge includes two levels. The first is whether the filter is applied; the second is how strong the filter is applied. Figure 3 depicts an example of 4-pixel edge and the involved pixel samples in the filter decision. Boundary strength (BS) is one of the parameters that determine how strong the filter is applied. BS is valued from 0∼2 according to the coding information of the blocks on both sides [15] . Compare to the BS valued from 0∼4 in H.264/AVC [6] , the BS calculation for HEVC is much simpler [14] . The BS of any chroma edge is identical to its corresponding luma edge. For chorma edge, when BS is not equal to 2, the edge needs not be filtered. For luma edge, there are two cases that the filter will not be applied. Case (1) is BS = 0. Case (2) is BS is non-zero and condition (3) is not satisfied. In condition (3), parameter β is based on quantization parameter (QP) value [6] , and d0 + d3 is named difference in the following paper. The edges need not to be filtered are called skipped edges and processed with skipping mode proposed in the hardware implementation to reduce dynamic power.
After the filter is determined, strong filter or weak filter will be conditionally applied [14] .
System Architecture and Function Module Design

SUCU-Based Processing
It is impractical to implement the frame-based process directly. Firstly, a mass of intermediate values from horizontal filtering need to be stored before the vertical filtering is performed, which brings high hardware cost. Secondly, the other components in the decoder are usually organized on CU basis. Performing deblocking filter on frame basis will degrade the performance of the whole decoder.
On the other hand, deblocking filter is not suitable to be performed based on CU. Because CUs may be of various sizes but the deblocking filter in HEVC is always performed on 8 × 8 blocks. LCU size is relatively fixed and it could be downward compatible to H.264/AVC when the LCU size is set to 16 × 16. Thus, LCU-based processing seems the most reasonable method.
In order to make the LCU-based processing realize the results of frame-based processing, the problems with the LCU-based implementation are considered. Figure 4 gives an example for LCU-based processing. To obey the basic processing order, the process of right most horizontal edges in the current LCU could not be started before the process of left most vertical edges in the next LCU are completed. For example, edge 21, 22 should be processed after edge 17, 18, 19, 20. From the time slot, it is easy to find that the filtering for #n+1, #n, #n-1 LCU is not sequential but alternative, which introduces 3 drawbacks.
1. The control of this order is quite complex, which leads to significant hardware cost.
2. Filtering of each LCU involves the data from its upper, left and right neighboring LCUs. It increases the cost of buffers or memory accesses too much.
3. There is latency in the processing for each LCU. Because the filtering of current LCU could not be completed before the data of next LCU is available. The latency will decrease the performance of the whole system. These drawbacks are mainly from the data dependency between neighboring processing units. In order to solve this problem, it is better to combine the different blocks to construct a sequence of new processing units with lower data dependency even no data dependency between them. As showed in Fig. 5 , a novel processing unit named SUCU is proposed and it consists of blocks from current LCU, left LCU and upper LCUs. Each SUCU is symmetric and independent to its neighboring ones, so the SUCUs in the same frame can be processed sequentially in raster scan order, which could avoid the disadvantage of normal LCU-based processing.
For each SUCU in this proposal, the hybrid processing order shown in Fig. 6 is used instead of the conventional order in Fig. 5 . By using this hybrid processing order, SUCU can be further split into smaller basic units. The filtering on the larger SUCU is a simple loop of filtering on the smaller basic units, which greatly reduces the complexity of control. The basic unit shown in Fig. 6 is named cross unit since 4 edges are arranged in a cross shape. The cross unit based process can apply to both luma and chroma components no matter how large the LCU is, which further reduces the control complexity. The cross units are independent to each other. For a cross unit, only four 4 × 4 blocks are involved in the filter. The data is fully reused in the cross unit and intermediate data life time is the shortest, which reduces the memory bandwidth and hardware cost. Figure 7 depicts the block diagram of the whole deblocking filter.
Block Diagram
The memory part contains two kinds of memories. Two Single-port static random access memories (SRAMs) are used to provide raw pixels of current processing block and store the final results. Each of them is 128 bits in width to provide two 4 × 4 blocks at one time. In order to make the design adaptive to the largest size of LCU, the depth of each SRAM is 192. The Line buffer is only for H.264/AVC to hold 4N (N is the frame width) pixels, providing necessary pixels not only for current MB but also for the MBs in the same line.
Since the neighboring filter operations in the same edge are independent with each other, it is feasible to process them in parallel. In order to achieve high throughput to satisfy the application of SHV, 4 filters are employed to process 4 lines of pixels simultaneously. Eight buffers (T 0 ∼T 7 ) in the operation unit are made up of registers and each can store 16 pixels. The buffers are used to store the temporal data for edge filters. Four buffers are enough for the application of HEVC. Another four buffers are arranged to store the left four blocks from the previous MB, which are just for H.264/AVC.
The controller is used to control the filter order, and the skipping mode designed for the skipped edges described in Sect. 2. The controller detects the skipped edges according to the values of BS and difference. With the skipping mode, the memory accesses and operation unit are disabled respectively according to the different skip states. The details will be illustrated in Sect. 3.4.
Processing Flow for HEVC
For a cross unit, two vertical edges and two horizontal edges need to be filtered. Four buffers (T 0 ∼T 3 ) are used to store the temporal data. The filter order, memory update scenario and usage of buffer for a cross unit are illustrated in Fig. 8 .
In the first clock, the vertical edge V 0 is processed. 4 × 4 blocks B 0 and B 1 are read from memory S 0 and S 1 respectively; B 0 and B 1 are filtered and the temporal data is stored in buffer T 0 and T 1 . In the second clock, B 2 and B 3 are fetched and edge V 1 is processed. The temporal data is stored in T 2 and T 3 . In the third clock, the horizontal edge H 0 is processed, blocks stored in T 0 and T 2 are filtered and the results are written to memory. Finally, H 1 is processed. Blocks in T 1 and T 3 are filtered and written to memory. A cross unit can be completed in 4 clocks and 24 clocks are needed to process a 16 × 16 SUCU, among which 16 clocks are for luma component and 8 clocks are for chroma components.
The purpose of splitting the memory into two banks is to arrange the final filtered results in proper sequence. For example, B 0 and B 1 are fetched from data bus and sent to the filters in pairs. But the process of B 0 is completed before that of B 1 and they could not be written to the SRAM in the same clock. If there is only one bank, B 0 and B 1 will be stored in the different address after the filtering. They could not be output to the data bus in pairs. In our design, shown as in Fig. 8 , final B 0 and B 1 are stored in different memories and it is possible to output them together by selecting the addresses.
Skipping Mode for HEVC
As described in Sect. 2, there are some edges need not to be filtered in several cases. part based on cross unit. Basically, the skipping scheme for de-blocking filter includes two parts: edge filter skipping and memory access skipping. Edge filter skipping can realized by add clock gating, shown in Fig. 9(a) . The memory access skipping is just giving the memory the disable signals, shown in Fig. 9(b) .
This design includes both of the edge filter skipping and the memory access skipping. The edge filter skipping part is almost the same with [8] and [10] . The implementation method for memory access skipping part is also similar with [10] . The difference is how to generate the control signals and how to make the skipping as efficient as possible. HEVC has less data dependency than H.264/AVC, and it is possible to achieve more efficient memory accesses skipping.
Since each edge has two states, skipped or filtered, there are 16 states for a cross unit. The state that all the edges need to be filtered is the normal state. For the normal state, filtering operations on 4-line pixels and memory access for 2 blocks are completed in 4 clocks. The other 15 states are called skip states. For each skip state, there are 1∼3 skipped edges in the cross unit. If skip states could be detected in advance, unnecessary filtering and memory accesses could be disabled.
For filters, the situation is quite simple. Clock gating is used to terminate the filtering operations when the edge is skipped.
For memory, different states brings complex situation. These 15 skip states are classified into 3 types according to the number of skipped edges. AS shown in Fig. 10 , in Type1, four edges are skipped. In Type2, three edges are skipped. In Type3, one edge or two edges are skipped. For Type1, 100% memory accesses and filtering operations could be saved. For Type2, two involved blocks need to be loaded from the memory for single filtered edge. In the right instance of Fig. 10 , two blocks of B 0 and B 2 are loaded from memory for edge H 0 , 50% memory accesses are saved compared to the normal state that four blocks need to be loaded. For Type3, all the blocks in a cross unit are involved. They should be loaded from the memory and written back after the filtering. Memory accesses could not be saved.
As described in Sect. 2, the filter is decided by BS and difference. For vertical edges, the difference is calculated based on raw data, while for horizontal edges, the difference is calculated by the intermediate data of the last filtering. The final state of vertical edges could be decided before the filtering on cross unit is started. But it is different for horizontal edges. For skipping mode design in controller, these different cases are analyzed in the following way.
The Fig. 11 according to different situations. Figure 11 illustrates three kinds of conditions. Solid edge is filtered edge and dash edge is skipped edge. (BS 0 BS 1 ) represent BS situation of H 0 and H 1 , where 0 represents zero value while 1 represents non-zero value.
1. Both V 0 and V 1 are skipped. Values of pixels will not be changed and D 2 and D 3 are valid to decide the state of H 0 and H 1 . Together with their BS values, the memory accesses could be determined before the whole filtering. For example, the BS situation of H 0 and H 1 is (0 1), if D 3 does not satisfy the condition (3) in Sect. 2, all the edges are skipped and 100% memory accesses will be saved, otherwise, 50% will be saved.
2. Either V 0 or V 1 is skipped. For (0 0), the final state is Type2. For (0 1), (1 0), (1 1), the final state could not be decided and all the blocks needs to be loaded to filters. These conditions could be taken as Type3 and the memory accesses could not be saved. D 2 and D 3 will be corrected after the horizontal filtering and used to determine the filtering operations of H 0 and H 1 .
3. Neither V 0 nor V 1 is skipped. These conditions are also taken as Type3 and the memory accesses will not be saved. But the filtering operations of H 0 and H 1 could be saved if the BS values and corrected D2 and D3 satisfy the skip conditions. In order to minimize the amount of such buffers, the order should release the data in one buffer as fast as possible. Based on this principle, the consideration of the order for H.264/AVC is as follows: Buffer T 4 ∼T 7 are used to store the blocks of rightmost column from the previous MB. In order to release these 4 buffers, edge 0∼3 should be filtered first. Then T 0 should be released in the next step, thus it is better to filter the edges around T 0 as early as possible. Horizontal filtering is prior than the vertical filtering, so edge 4 is then filtered. One clock cycle is needed for data update of T 0 and T 4 , thus we cannot immediately go to edge 6, and instead edge 5 is filtered next. After edge 6, for the same reason, the next edge is not edge 8 but edge 7. When all the edges around T 0 are finished, the data could be written to the memory and this buffer is released for the next unit.
Based on the above consideration, the following order is decided in the same way. It looks irregular for some part, but the order is basically from left to right and from top to bottom. Moreover, it should guarantee that blocks of rightmost column should be stored by T 4 ∼T 7 for the use of next MB.
Finally, the proposed order takes 32 clock cycles for luma filtering and 8 clock cycles for chroma filtering. Thus, 48 clock cycles are needed for each MB in H.264/AVC on average.
Filter Reusing Scheme
Because of the inheritance property between H.264/AVC and HEVC, the filtering equations for two standards are similar. The proposed architecture explores this similarity and gives the reusing scheme in structure of edge filter.
According to the filtering equations listed in the two standards [7] , [14] , all the reused components are listed in Fig. 13 . There are two kinds of reusing:
1. Some parts of the equations are the same and the same parts are extracted as the intermediate results, which could be reused for several outputs. The inter 1 and inter 2 in Fig. 13 are the examples of this type.
2. The final equations of outputs are exactly the same. The p 1 and p 2 in Fig. 13 are the examples of this type.
We have synthesized the reused design and the design that simply putting HEVC and H.264/AVC filters together. The reused design saves 30% gate counts in the implementation.
Implementation and Analysis
Synthesis Results
The proposed dual-mode deblocking filter architecture for HEVC-H.264/AVC is synthesized with SMIC 65 nm 1.08 V library. It is implemented with equivalent gate counts of 41.6k at the frequency of 200 MHz. To the best of our knowledge, there is no implementation for HEVC deblocking filter in the open literatures. In order to clarify its performance and specification, the comparison with the stateof-art work for H.264/AVC [11] is listed in Table 2 .
As shown in Table 2 , the major advantages of our design are supporting two standards and realizing the skipping mode. The cost of these advantages is that the gate counts are 37.7% higher than [11] . This hardware cost increase includes three aspects: 1. The operation units are much more complex. Although the filter reuse can reduce the gate counts of edge filter by about 30% than simply stacking two filters for two standards, the edge filter of this design is more complex than the design [11] .
2. The eight register based buffers are another hardware consuming part. As explained in previous sections, these buffers are used to store the temporary data. However, the memory accesses are greatly reduced on this way.
3. The controller becomes bigger because of the skipping mode control.
On throughput, our proposed architecture needs 24 clock cycles for a 16 × 16 block for HEVC mode. Thus, it realizes 2.13 × 109 pixels per second that could easily satisfy the requirement of SHV (7680 × 4320) with 60 fps case [16] . For H.264/AVC, the throughput is half because of double edges to be filtered and it still can satisfy the requirement of QFHD (4096 × 2160). Some SHV and QFHD frames are tested in formal verification with the synthesized netlist in Modelsim, which guarantees that the function is correct and the required timing is satisfied.
Power Analysis in Skipping Mode
As described in Sect. 2, when BS is 0 or the difference is larger than the threshold, skipping mode is triggered by the controller. Under the skipping situation, the power consumption could be greatly reduced in two aspects:
1. The memory accesses are eliminated through sending disable signal from controller to memory. 2. The edge filters are terminated by clock gating technique.
In Sect. 3.4, the different skipping situations have been discussed. Taking the sequence (3) in Table 1 as an example, the power analysis for skipping mode is carried out in the VCS and Power Compiler after synthesis. Figure 14 shows the power reduction brought by memory accesses elimination and filter clock gating.
As shown in Fig. 14 , in the case of 86% edges are skipped, the memory accesses elimination and the clock gating could bring 30.0% and 38.9% power reduction respectively. In total, the power reduction could achieve 57.2%.
Conclusion
This paper introduces a novel dual-mode deblocking filter architecture which could support both of the HEVC and H.264/AVC standards. For H.264/AVC standard, it takes 32 clock cycles for luminance component and 16 clocks cycles for chrominance component in one 16 × 16 MB. For the HEVC standard, the proposed SUCU filtering scheme greatly reduces the design complexity. As a result, 16 × 16 coding unit needs 16 and 8 clock cycles for luminance component and chrominance component respectively. In the implementation, the proposed design occupies 41.6k equivalent gate counts at frequency of 200 MHz in SMIC 65 nm library, which could easily satisfy the throughput requirement of SHV. In addition, the total power consumption could be reduced by 57.2% with skipping mode when the edges need not be filtered.
