Abstract-The H.264/AVC standard defines an in-loop de-instructions, available in current multimedia SIMD instruction blocking filter which is used in both the encoder and decoder. This sets. Lastly, we look at improvements achievable through work examines several methods for improving the performance adding new instruction extensions, specialized for the deof the H.264/AVC reference software implementation of the deblocking filter. Methods examined include general software blocking filter application optimization, parallelization through standard multimedia SIMD II BACKGROUND instructions, and augmenting standard SIMD instruction sets with new instructions. Using the above methods, we are are able A. Deblocking Filters achieve a large speedup of the deblocking filter computation.
blocking filter which is used in both the encoder and decoder. This sets. Lastly, we look at improvements achievable through work examines several methods for improving the performance adding new instruction extensions, specialized for the deof the H.264/AVC reference software implementation of the deblocking filter. Methods examined include general software blocking filter application optimization, parallelization through standard multimedia SIMD II BACKGROUND instructions, and augmenting standard SIMD instruction sets with new instructions. Using the above methods, we are are able A. Deblocking Filters achieve a large speedup of the deblocking filter computation.
The block-based nature of many forms of video coding I. INTRODUCTION produces blocking effects in the output video. H.264/AVC, a block-based video coding standard, is prone to such effects. The H.264/AVC specification includes a standard integrated There are a number of sources for the distortion at block edges, loop filter for deblocking in both the encoder and decoder. It as described in [1] . Typical sources of blocking effects include is adaptive on several levels, from the slice level down to the the block-based discrete cosine transform (DCT) and the edges pixel level. The deblocking filter forms a significant part of of blocks used in motion compensated prediction. Because of the computational complexity of the codec. In the H.264/AVC these blocking effects, it is important to use some form of case, it is estimated that the deblocking filter forms one deblocking filtering on the output video sequence. third of the computational complexity of the decoder [5] , There are two main methods used to implement deblocking making it the most computationally demanding function of filters. They may be implemented either as a loop filter or the decoder. Therefore, methods to improve the performance as a post filter, with tradeoffs inherent in either approach of the deblocking filter will provide significant performance [1] . Loop filtering can provide better visual quality and rate benefits for H.264/AVC implementations. distortion performance, but it incurs additional computational The high computational requirements of the H.264/AVC requirements. H.264/AVC includes a normative loop filter, deblocking filter have resulted in several dedicated hardware which is described in more detail below.
architectures for the operation, such as found in [2] or [3] This paper explores a variety of methods to improve the 3) Pixel Level. At the finest granularity, the filter that is performance of the reference implementation. We first ex-applied can vary at the pixel (pel) level. A decision is made amine performance improvement through general software at the pixel level by evaluating the difference between pixels optimization. We also explore improving the performance by next to the edge. If the difference between these pixels is using standard vector single-instruction-mutliple-data (SIMD) higher than some quantization level dependant threshold with original source code, conditional checks are performed on A fundamental requirement of employing SIMD instruc-a pixel-by-pixel basis to determine what filtering should be tions in an application is that it is possible to arrange op-applied. It is not possible to design efficient SIMD code while erations and data such that vector operations can be used. preserving these conditional checks. Instead, we used a "predThe pixel-level and block-level adaptivity of the H.264/AVC icated" approach for the filtering operations to allow for SIMD deblocking filter causes some problems here. In [1], the calculation of eight positions simultaneously. The results for authors state that the conditional branching required consumes all possible branches are calculated. Then, after filtering half a large portion of the filtering time, and that the structure a macroblock edge (eight positions), we select results from of the filter makes it ill-suited to SIMD implementation. the different registers based on a set of conditionals held in A major impediment to SIMD execution results from the separate registers. This approach allows for speedup in this case due to the fact that the gains from performing filtering tially with vector SIMD and partially with serial instructions, software optimization, we also examined the possibility of In order to allow for parallel operation, we store the results improving the speedup through the addition of new SIMD of the conditional checks employed in the filtering in Boolean instructions. To test and implement the new instructions, we mask registers; each register holds eight 16-bit values, each employed the Xtensa tool suite from Tensilica [6] . The Xtensa of which is either OxFFFF or OxOOOO. Filter calculations tool set allows the user to customize a base Xtensa processor are performed using SSE2's parallel arithmetic operations. with new instructions. An application can be built with and After filter computations are complete, the different possible without instruction extensions to measure the performance output registers are selected using the conditional registers. A difference resulting from the new instructions. code sample using SSE2 intrinsic functions and a conditional Tensilica includes a base SIMD engine which can be built register is given below.
with an Xtensa processor, the Vectra LX DSP engine. Similar QO = _mm_or_s128( to SSE2, Vectra provides SIMD operations which are able to mm_andsil28 (aq2, QO pathl) work on a set of eight packed 16-bit integers. The new inmm_andnot si128 (aq2, QO-path2) ) structions were created using Tensilica Instruction Extensions
In the example given, the value of QO is set based on a (TIE) on top of a Vectra base. The next sections will describe combination of values from QO pathl and QO path2. The different instructions that we implemented in TIE to improve selection of elements is based on the conditional register aq2. the performance of the deblocking filter.
This type of operation happens regularly in the code, which led us to believe a new instruction to speed the operation above B. New Instructions would be agood addition to the vector instruction set. The 1) Reproduced SSE2 Instructions: SSE2 provides some investigation into possible new instructions is described later general purpose SIMD instructions that are missing in the Vecin the paper. tra SIMD set. An important set of instructions missing from Vectra were SSE2's mask-generating comparison instructions. In thisper,e the shownth a SIM approaCh can 4ladaptivbe used to improve the performance of the H.264/AVC de-4) Parallel Absolute Comparison. The pixel-level adaptiv-blocking filter. We demonstrate that speedup is possible using ity of the H.264/AVC deblocking filter arises from compar-SIMD, even in highly data-adaptive processing algorithms isons made to the absolute difference between two pixel values such as the H.264/AVC deblocking filter. Using standard SSE2 (luma or chroma). To speed these operations, we introduced a SIMD and a predicated approach, we have achieved a large SIMD mask-generating function which takes three arguments: speedup over the reference implementation. This paper also value A, value B, and a threshold. The entries of the output shows that the inclusion of some specific new instructions register are formed by SIMD comparison to see if the absolute into existing SIMD instruction sets can provide additional difference between vectors A and B is below the threshold speedup for the deblocking filter. Using a combination of vector. The result is a binary vector comprising fields of ones the approaches described in this paper, it is possible to or zeroes, to be used as a data selection mask. greatly speed up the H.264/AVC reference deblocking filter 5) Parallel MUX instruction: Another instruction that we implementation. have introduced using TIE is dblk-select. This is a parallel MUX-type operation with three vector input registers. One of
