I. INTRODUCTION
C OMPUTER architecture and video coding have mutually influenced each other during their technological advancements. The mutual influence is especially strong in the deployment of the single instruction multiple data (SIMD) instructions. SIMD instructions have first been introduced for the purpose of software-only MPEG-1 real-time decoding using general purpose processors (GPPs) [1] , [2] . Since then the concept of SIMD has been further exploited by many architectures for many following video coding standards [3] . To allow for more efficient implementation, video coding standards also consider the SIMD capabilities of the processors during the standardization process. For instance, explicit effort is made to define reasonable intermediate computation precision and eliminate sample dependencies.
The Joint Collaborative Team on Video Coding (JCTVC) has recently released the High Efficiency Video Coding (HEVC) [4] coding standard. HEVC allows for 50% bitrate reduction with the same subjective quality compared to H.264/AVC [5] . Similar to previous standards, significant attention was paid to allow the new standard to be accelerated with SIMD and custom hardware solutions. It is commonly known that hardware-only solutions can potentially provide Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2014.2364413 much higher energy efficiency compared with software solutions using GPPs. Optimized software solutions, however, are required on platforms where hardware acceleration is not available, have a reduced implementation effort and time-tomarket, and avoid hardware overspecialization problems [6] . Many of the improvements to the coding tools responsible for coding efficiency improvements in HEVC over previous standards are also beneficial for accelerating the codec using SIMD. Among others, the larger block sizes, more accurate interpolation filter, and parallel deblocking filter (DF), could make SIMD acceleration even more important than in previous video coding standards.
While HEVC has significant potential for SIMD acceleration, the HEVC standard is also more complex than previous ones. With support for three different coding tree block (CTB) sizes, more transform sizes, additional loop filter, and more intra prediction angles, significantly more effort is required for fully accelerating HEVC using SIMD. This will only become more complex with the addition of future range extensions, which will introduce more chroma formats and higher bit depths. In addition, in recent years, much more diversity has been introduced in the instruction set architectures (ISAs) and their microarchitectures. SIMD ISAs have become more complex with multiple instruction set extensions, and even with the same ISA, the instructions have different performance characteristics depending on the implementation.
In this paper, we investigate the impact SIMD acceleration has on HEVC decoding. For this, the entire HEVC decoder has been accelerated, i.e., SIMD has been applied to all suitable kernels and operations. An implementation has been developed for all recent x86 SIMD extensions as well as ARM NEON. The main contributions of this paper are as follows.
1) SIMD acceleration is presented for all the data-parallel kernels of the HEVC decoder (main and main10 profiles). 2) Implementation and optimization of the HEVC decoder is performed for all relevant SIMD ISAs, including NEON, SSE2, SSSE3, SSE4.1, Extended Operations, and AVX2. 3) The effect of an interleaved chroma format on the SIMD implementation is investigated. 4) Performance evaluation is performed on 14 platforms
providing a large coverage of recent architectures for 1080p (high definition) and 2160p [ultra-HD (UHD)] resolutions. 5) With SIMD optimizations the decoder is able to process up to 133 frames/s for 1080p and 37.8 frames/s for 2160p on average on recent architectures. This paper is organized as follows. First, Section II gives an introduction to SIMD ISAs, while Section III presents the related work. Section IV describes the optimized HEVC decoder used as a baseline. Section V presents the SIMD implementation of the suitable HEVC decoding kernels. Section VI details the experimental setup, and in Section VII the performance results are discussed. Finally, in Section VIII, the conclusion is drawn.
II. OVERVIEW OF SIMD INSTRUCTIONS
SIMD instructions for GPPs are a variation of the classical SIMD computing paradigm [7] . Their purpose is, as the name implies, to process multiple data elements with the same instruction. In a GPP, this is achieved by partitioning each register into subwords and applying the same operation to all of them. With this approach, SIMD instructions offer significant performance improvements with relatively little additional hardware. The algorithm, however, must be suited for SIMD acceleration, i.e., it must contain data level parallelism.
SIMD instructions were first introduced for the PA-RISC architecture in [8] for accelerating MPEG-1 video decoding [9] . After that they have been included in almost all architectures and have been used for accelerating many applications apart from video decoding. One of the main properties of an SIMD ISA is the SIMD width, which defines the number of elements that can be processed in parallel within a register. The first SIMD ISAs such as MAX-1 for the PA-RISC, and MMX for x86 used 64-b registers. The next generation, including Altivec for PowerPC, SSE for x86, and NEON for ARM increased the SIMD width to 128 b. In 2013, Intel introduced a new extension called AVX2 that increased the SIMD width to 256 b. Recently, a new extension specification for the x86 architecture has been released with 512-b registers called AVX-512. Table I shows an overview of the SIMD extensions released over the years.
In general, an SIMD ISA support arithmetic, logical, load, store, type conversion, and data swizzling instructions. Some SIMD ISAs, however, are more complete than others and support, for instance, more packed data types or operations. The load and store instructions depend on the SIMD vector width, but some ISAs have stricter rules regarding the data alignment than others. Often there are also differences in the data swizzling instructions that rearrange the data inside a vector. Although SIMD acceleration for most algorithms generally can follow a similar method for different ISAs, implementing an optimal solution for each ISA requires it to be uniquely tuned.
III. RELATED WORK
Using SIMD extensions, it was possible in 1995, for the first time with software only, to decode Common Intermediate Format (352 × 288) MPEG-1 videos in real time (25 frames/s) [1] , [9] using a workstation running at 80 MHz. After that, SIMD instructions have been used for accelerating different video codecs such as MPEG-2, MPEG-4 Part 2, and more recently H.264/AVC and HEVC. A summary of works reporting SIMD optimization for codecs before H.264/AVC can be found in [3] .
In the case of H.264/AVC, SIMD has been used to accelerate luma and chroma interpolation filters, inverse transform (IT) and DF. Using SSE2, complete application speedups ranging from 2.0 to 4.0 have been reported [16] , [17] . Real-time decoding of 720p content using a Pentium IV processor with SSE3, and a low-power Pentium-M processor with SSE2 have been reported [18] .
Some recent works have proposed SIMD acceleration for HEVC decoding. Yan et al. [19] have reported a decoder with Intel SSE2 optimization for luma and chroma interpolation filters, adaptive loop filter (not included in the final HEVC standard), DF and IT. The obtained speedups are 6.08, 2.21, 5.21, and 2.98 for each kernel, respectively. The total application speedup is 4.16, taking as baseline the HEVC HM4.0 reference decoder. Using an Intel i5 processor running at 2.4 GHz, this system can decode 1080p videos from 27.7 to 44.4 frames/s depending on the content and bitrate. Bossen et al. [20] have presented an optimized HEVC decoder using SSE4.1 and ARM NEON. On an Intel processor i7 running at turbo frequency of 3.6 GHz, the decoder can process more than 60 frames/s for 1080p video up to 7 Mb/s. On a Cortex-A9 processor running at 1.0 GHz, it can decode 480p videos at 30 frames/s and up to 2 Mb/s. Bossen [21] has also shown that on an ARMv7 processor running at 1.3 GHz, 1080p sequences can be decoded at 30 frames/s. In this paper, however, the experimental setup is not well-described direct comparisons cannot be made. Bross et al. [22] have reported an optimized HEVC decoder using SSE4.1 on an Intel i7 processor with an overall speedup of 4.3 and 3.3 for 1080p and 2160p, respectively. When running at 2.5 GHz, the system is able to process, in average, 68.3 and 17.2 frames/s for 1080p and 2160p, respectively. Table II shows a summary of the mentioned works reporting SIMD acceleration for video decoders.
In previous works, results have been presented for one or two SIMD extensions (such as SSE4.1 and NEON) and one or two processor architectures. Or in some cases, only the complete application speedup with SIMD is reported but not the SIMD techniques and the per-stage speedups. Instead, in this paper, we present a detailed analysis of the impact of SIMD optimization on the HEVC decoder by comparing the implementations for multiple SIMD ISAs. In addition, we quantify the impact of the microarchitecture improvements over several processor generations. Furthermore, we evaluate the chroma interleaved format that benefits SIMD acceleration compared with the traditional planar format. Overall, compared with previous work, the performance of the presented decoder is higher when using similar processors and input videos.
IV. GENERAL STRUCTURE OF OPTIMIZED HEVC DECODER
In this section, we describe the optimized HEVC decoder used as a baseline for the SIMD optimization. In the description, we will focus on the decoding process of a coding tree unit (CTU). We discuss first the steps performed on the CTU level, and then in more detail, the parsing and reconstruction that are performed on smaller units, such as prediction units (PUs) and transform units (TUs). Afterward, we present the CTU memory management of the optimized decoder.
A. CTU Decoding
For performance reasons, the entire CTU decoding is performed on a small intermediate buffer that has space for the required data window, which is slightly more than two CTUs. Part of the kernels can be performed on CTU granularity where others have to be performed on smaller units. Our decoder performs the following high-level steps on a CTU level. 1) Presynchronization: Before a CTU is decoded, a synchronization is performed to check whether the CTU dependencies are resolved. Depending on the parallelization strategy, either it is checked if the top-right CTU has been decoded [wavefront parallel processing (WPP)], or the colocated CTU has been decoded (frame parallel). 2) Initialization: In the initialization phase, the syntax element data structures are filled with the appropriate neighboring data and initial values. To reduce the memory requirements and improve cache performance only the data of current, the top and left CTUs are restored. 3) Parsing and Reconstruction: The CTU syntax parsing, boundary strength (BS) calculation, intra prediction, motion compensation, and IT are performed in this step and will be further detailed in Section IV-B. 4) In-Loop Filtering: After reconstruction, the deblocking and SAO filters are performed. The filters are not fully applied on the current CTU. Due to data dependencies, the filters are partially or fully performed on previous decoded CTUs, as shown in Fig. 1 (for luma samples) . For deblocking, all vertical edges of the CTU are filtered first [ Fig. 1(a) ] followed by the horizontal edges [ Fig. 1(b) ]. Due to the DF specification in HEVC, the deblocking of the horizontal edges cannot be fully performed because it requires the deblocked samples from the vertical edges as input. The last four sample columns on the right side of the CTU are not available, before deblocking the first vertical edge of the next CTU. The filtering has to be delayed by a minimum of four samples to circumvent this issue. In the SIMD accelerated decoder, however, the horizontal edges are delayed half the CTU width [ Fig. 1(b) ] for better performance. These details will be discussed in Section V-D where the DF implementation is presented. The SAO filter for similar reasons also requires a delay [ Fig. 1(c) ]. Because it requires deblocked samples as input, it has to be delayed by four samples vertically. In the horizontal direction, a full CTU delay is used instead of delaying the minimum of five samples to avoid having to cross four CTUs with potentially different SAO modes on unaligned data. In the SAO step, the finalized samples are stored to the picture memory. 5) Postsynchronization: After each CTU is processed, thread(s) stalled on this CTU are notified. Depending on the parallelization strategy either stalled threads are notified for each CTU (WPP), or the at the end of a CTU line (frame parallel). Some other management steps such as border exchange, syntax element and sample line buffer management, and picture border extension are performed by the decoder but are left out to simplify the description.
B. CTU Split Process and Leaf Node Processing
In the main parsing and reconstruction step, the CTU is split in CUs using a recursive quad-tree, and the leaf CU nodes are then further split in PUs and TUs. The CTB subdivision is shown in Fig. 2 . The processing of a final leaf PU is as follows.
1) Parsing the Prediction Information: Depending on the CU type parsing, the prediction information involves retrieving the reference indices and motion vectors (inter/skip) or intra luma and chroma modes (intra).
2) Motion Compensation:
The prediction block (PB) samples for an inter predicted coding block (CB) are obtained from an area in a reference picture located by a reference index and motion vector. Interpolation filters (luma/chroma) are used to generate the samples for noninteger positions. The processing of a final leaf TU is as follows.
1) Intra Prediction: In HEVC, intra prediction is interleaved with the IT on a TU-basis (instead of on a PU basis). Using the mode retrieved for the corresponding PU, the intra prediction is invoked.
2) Parsing Coefficients and Inverse Quantization:
In case, coefficients are available for this TU (depends on various conditions), the coefficients are parsed from the bitstream. In our decoder, the inverse quantization is performed after parsing the coefficient level and sign. In this way, no zero coefficients are needlessly inverse quantized. 3) Inverse Transform: In case, coefficients have been parsed for this TU, the IT is performed to recreate the residual. This step also directly adds the residual to the corresponding prediction (inter or intra) and saturates the result.
C. CTU Memory Management
To reduce the SIMD implementation complexity and improve the cache and memory efficiency, the intermediate samples are stored in a local buffer first instead of in the picture memory directly. This scheme has two main advantages compared with operating on the picture memory directly. First, the SIMD implementation complexity is reduced as only two decoding steps, the motion compensation and SAO filter, are interacting directly with the decoded picture buffer. Because the decoded samples are either stored in 8-or 16-b containers depending on the bit depth of the sequence, the all decoding functions that interact with the pictures must have two variants. Overview of the optimized decoder implementation using local buffers. The results of each kernel are stored in the local CTU buffer until they are completely processed.
Second, decoupling the writing to the picture buffer allows for additional memory and cache optimizations. The first writes to the picture buffer introduce significant memory stalls as the output picture memory at this time is not cached. To reduce these stalls, CTBs in the picture memory are aligned to cache lines. Modern architectures that support write combining [23] will omit the line-fill read when it detects entire cache lines are written. Aligning this to CTBs ensures that as little cache lines as possible are written to, and consequently are written as fully as possible. In addition, nontemporal store instructions [13] can be used to bypass the cache hierarchy and directly write the lines to memory. This is beneficial because the caches are not polluted with picture buffer data that will not be read until at least the next frame starts decoding. As we will discuss in Section VII-E, the usage of nontemporal stores leads to considerable reduction in capacity misses and, consequently, memory transfers.
V. SIMD OPTIMIZATION
Our HEVC decoder implements SIMD for all the HEVC processing steps except for the bitstream parsing. This includes inter prediction, intra prediction, IT, DF, SAO filter, and various memory movement operations. We have implemented this for x86, with specialized versions for each of the SIMD extension sets from SSE2 up to AVX2, as well as ARM NEON. For brevity, we will not discuss each implementation for every kernel in detail, but instead we will focus more on the general solutions and challenges involving the SIMD implementations and highlight some distinctions between different instruction sets when required. We will mainly discuss SIMD for the luma component and comment briefly on the chroma implementation.
A. Inter Prediction
During inter prediction, the PB samples must be created from previous pictures indexed by the reference indices. The associated motion vectors specify a translation in these pictures with quarter-sample precision. In case, the horizontal or vertical vector component points to a fractional position, interpolation is required. HEVC specifies that the interpolation is performed using a 7/8-tap finite-impulse response (FIR) filter of which the coefficients listed in Table III .
Inter prediction is the most time-consuming step in HEVC [20] . To derive a horizontally interpolated sample seven to eight multiplications and additions must be performed. If also the vertical position is fractional, a second filter iteration is applied on the horizontally interpolated samples to derive the final interpolated samples. This process is repeated for the other reference direction in case of bi-prediction and the results are either averaged or weighted to form the final block prediction. The interpolation process is parallel for each sample and is well suited for SIMD acceleration. Fig. 4 shows the interpolation process for one direction for a 8 × 8 block.
While a basic SIMD implementation is straightforward, simply multiplying and adding a vector eight times either in horizontal or vertical direction, arriving to an optimal solution for each ISA requires more analysis. 
B. Intra Prediction
The intra prediction has been refined in HEVC compared with H.264. In H.264, 10 distinct modes (dc, planar, and eight angular) of which up to nine are available depending on the block size. HEVC extends this to 35 modes (dc, planar, and 33 angular), which are available to all block sizes (4 × 4 to 32 × 32), as shown in Fig. 5 .
For all modes, the derivation of the prediction values is independent, and therefore well suited for SIMD acceleration. For brevity, we will focus on the angular modes. Each sample can be derived from extrapolating the position to the boundary samples using the specified angle. If the intersecting position is fractional, the prediction value is derived from bilinear filtering The prediction can still be performed horizontally, however, by first copying the vertical boundary samples to an array. The prediction samples will then be created in a bottom-to-top, left-toright order. Storing the produced prediction samples in the right orientation then requires a 90°rotation, which can be implemented using an SIMD transpose and a reverse store of the transposed registers. While the derivation process of the samples can be accelerated with SIMD, this is not possible for the preparation of the boundary samples. Preparing the boundary samples can be quite complex as samples must be extended for boundaries that are not available for prediction, and afterward must also filtered. On average, preparing the boundary samples takes about as much time as the prediction.
C. Inverse Transform
The IT has traditionally been a well-suited kernel for SIMD acceleration. In HEVC, this is also true and with block sizes up to 32 × 32, the transform is much more computationally complex compared to previous standards [24] . A common implementation of the 2-D IT is to perform a series of two 1-D transforms on the columns and then on the rows of the transform block (TB) [25] . The computation of a 1-D column transform consists of a series of matrix-vector multiplication, followed by adding/subtracting the partial results. Fig. 6 shows the 1-D IT for 32 × 32 TBs.
The figure shows that for the largest IT the odd positions of the input vector x are multiplied with the 16 × 16 matrix. Positions 2-30 with steps of 4 (4n + 2) are multiplied with the 8 × 8 matrix, and so on. Then, the resulting partial solutions are combined with additions and subtractions to the final inverse transformed output vector. In HEVC, the smaller 1-D-transforms are contained in the larger ones, meaning that the coefficients of the transform matrices are the same for the smaller transforms. The difference when performing a smaller transform is that larger matrix-vector multiplications are omitted and the input position pattern is starting with the odd pattern at one of the smaller matrices.
An optimization that can be performed to the IT in general is omitting calculations depending on the input. It is very uncommon that all positions of the input array are fully populated with nonzero values. Similar to the most hybrid video codecs, the coefficient scan pattern of HEVC concentrates the coefficients in the top left corner. Because of the larger transforms in HEVC compared with H.264/AVC, more columns and rows contain only zero coefficients for which the computation can be dropped as this would result back in zero. An example of which inputs can be omitted for computation is shown for an 16 × 16 TB in Fig. 7 . For the special case that the input coefficients are all zero except for the top left corner position an even more aggressive optimization can be performed, known as the dc transform. In this case, the entire IT is omitted as this would result in the same value, which can be computed with a single rounding shift, for all positions in the residual block. These optimizations results in up to 5× speedup for the scalar code, and up to 3.6× and 2.8× for SSE2 and AVX2, respectively, for high quantization parameter (QP) videos.
The SIMD implementation of the IT can be either performed inside one column or row transform or using SIMD on multiple columns/rows, or a combination of the two. We have experimented with these approaches and found that the differences are small. The fastest SSE2 implementation uses SIMD over columns followed by a transpose for both passes of the IT. In this approach, not all zero columns can be dropped, because eight columns are inverse transformed at once, and additional transpose overhead is present. This approach is still faster, because for the entire transform all the SIMD lanes can be used efficiently, which is not the case when applying SIMD to an individual column/row. With NEON, the best approach is the same as SSE2 for the smaller transforms (4 × 4 and 8 × 8). For the larger transforms, a combined approach without any transposing overhead proves to be better, in which the first pass used multicolumn SIMD and the second pass performs SIMD per row.
D. Deblocking Filter
The DF can be accelerated using SIMD by considering multiple edge parts simultaneously, as shown in Fig. 8 . Because (for < 11 b) the computation precision of the DF is within 16-b, eight computation lanes are available for both SSE2+ and NEON implementations, which can be used to process four sample wide edges at a time (for AVX2 16, 16-b lanes are available and can be used to process eight edge parts simultaneously). A major difference of the DF compared with other kernels is that a large part of it consists of evaluating conditions [26] . Because multiple edge parts fit inside an SIMD vector the execution of the filter must be made conditional within the vector. The complete SIMD DF is performed in four phases as follows.
1) BS Check:
In the first phase, if any of four consecutive BSs, which are computed during the parsing and reconstruction step, is larger than 0, the samples of four edges parts are loaded. For the vertical edge filter, these samples are transposed after loading.
2) Filtering Decisions: In the second phase, the first and the last sample column of each of the four edge part are swizzled into a 128-b vector. The following filtering expressions are then evaluated for four edge parts at a time:
where p x,y and q x,y are the samples indicated in red in Fig. 8 , and the value of β depends on the QP of the p and q samples. 3) Normal/Strong Filter Decision: If f ilter evaluates to true for any of the four parts, in the third phase, the same swizzled input is used to derive if the filter should be strong or normal. The following expressions evaluate this:
The results of the first three phases are a strong and normal mask, normal second sample mask, and a lossless mask. 4) Filtering Operations: In the last phase, in which the actual sample filtering takes place, we switch back to the original loaded input and two edge parts are filtered at once. If the strong or normal mask is enabled for any of the two edge parts, the computation is performed and the masks are used to select between the filtered and the original samples. Finally, samples that are contained in lossless coded CBs should not be filtered independent of previous decisions, and the original samples are selected before storing back in the local CTU buffer. The SIMD implementation is performing more work compared with the scalar implementation, because multiple edge parts that might not require the same computation, are processed together. We found that this divergent behavior is stronger if edge parts do not belong to the same block (TB, PB, CB, and CTB). The horizontal edge filtering would particularly suffer from this as it must be delayed at the minimum by one edge part in HEVC. Therefore, the horizontal filter is delayed by half the CTB width to reduce the divergent behavior.
E. SAO Filter
The SAO filter has two different modes, the edge offset (EO) mode and the band offset modes [27] . In both modes, the entire CTB is considered and the deblocked samples are used as input. In the edge offset mode, four sample offsets are transmitted in the bitstream. Each sample S x,y is derived 
In the band offset mode, also four offsets are transmitted in the bitstream. The index derivation is simpler, instead of looking at neighbor samples each sample value is classified in 1 of 32 bands using the five MSBs. The four transmitted offsets are associated to any four consecutive bands, while the other bands will default to an offset of 0.
SIMD can be applied for the entire SAO process by considering multiple samples at the same time, because each sample can be derived in parallel. The sign function can be implemented using clipping to {−1, 1} or, when available, with dedicated sign instructions, making the index calculation straightforward. In addition, the add and clip of the final sample value is possible with all SIMD implementations. The o f f set can be derived using in register table lookup. For NEON, the VTBL instruction can perform eight lookups at a time, and for SSSE3 and higher 16 lookups can be performed using the PSHUFB instruction. For the x86 processors not supporting SSSE3, however, the lookup has to be performed using regular scalar code.
While the SAO SIMD implementation is straightforward, two considerations must be made. First, for correctness, the samples of lossless coded CBs must not be filtered. Also samples at pictures borders and some slice border must not be filtered. Because these cases are relatively rare, it is not desired to add complicated checking inside the inner SIMD loop. We solve this instead by first filtering all the samples and afterward put back the original samples where needed. Second, because the SAO is the last step in the decoding process, the output samples are written to the picture buffer and not back to the intermediate buffer. As discussed earlier in Section IV-C, nontemporal stores to the picture buffer are used to reduce cache misses and memory bandwidth requirements.
F. Other Kernels
At various places in the decoder, other than the highly sequential bitstream parsing, memory operations are required such as filling an array, copying memory, and clearing memory. For example border extensions, setting initial values of syntax elements, and clearing coefficient arrays are also accelerated in our implementation.
G. Chroma Interleaving
Until now, we have only considered SIMD acceleration for the luma plane. SIMD acceleration can also be applied to all the kernels for the chroma planes. The chroma planes in some kernels (inverse transform, intra prediction, and SAO filter) have (almost) the same derivation process as luma and the code can be shared. In other kernels (deblocking and inter prediction), a different and less complex derivation process is followed, requiring a specialized chroma implementation.
In the HEVC, main profile only YUV420 formats are supported. This means that all chroma blocks (TB, PB, CB, and CTB) are half the width and height of their luma blocks. Overall, this leads to worse SIMD utilization as the smaller blocks are unable to fill the entire SIMD vector.
The SIMD efficiency can be improved by interleaving the two chroma planes horizontally sample-by-sample into one plane with double the width. This semiplanar sample format is referred to as NV12 for 8 b and PO10 for 10 b [28] . In HEVC, the SIMD efficiency can be improved using this sample format, because colocated chroma blocks/samples always require the same computation for all the kernels. An exception for this is the IT for which one of the two planes could have no coefficients transmitted. It should be noted that the use of chroma interleaved formats is not restricted to the HEVC codec and 4:2:0 formats but can be applied to any video codec that operates on a planar YUV format.
The usage of chroma interleaved processing does require that the application receiving the output of the decoder must be able to handle the semiplanar color format. The support level for NV12 and P010 is, at the time of writing, not as good as the regular planar formats in many applications. Interleaved chroma formats are currently commonly used in video codec hardware accelerators, but are still rarely supported through the entire displaying software stack.
VI. EXPERIMENTAL SETUP
A wide range of platforms has been selected for evaluation, providing a good coverage of the important consumer processor architectures of the last 10 years. The properties of the in total 14 platforms are shown in Table IV, different core microarchitecture for a generation of processors. Most platforms have multiple cores and in addition most Intel platforms also support simultaneous multithreading (SMT). One AMD platform has also support for hardware multithreading in the form of cluster multithreading (CMT), which promises better performance scaling compared with SMT.
To provide a fair and reproducible evaluation, an as common as possible software stack has been used. All x86 platforms use the Kubuntu 13.04 distribution with Linux kernel 3.8. The ARM platforms run a Linaro distribution that is also derived from the Ubuntu 13.04 packages. For all platforms, the GCC 4.8.1 compiler is used with −O3 optimization level. Execution time is measured outside of the program using the time command and performance metrics such as instructions, cycles, and frequency are collected with perf. For all platforms, the dynamic voltage frequency scaling is disabled in all experiments, including the turbo boost and turbo core features of recent processors. The processors are fixed to their nominal frequency listed in Table IV. Two video testsets have been selected for the experiments. The first include all five 1080p videos from the JCTVC testset [29] and the second one includes five 2160p50 videos from the EBU UHD-1 testset [30] . All videos were encoded with the HM-10.1 reference encoder using four QP points (24, 28, 32 , and 36). The 1080p sequences are encoded with the random access main (8-b) configuration, and the 2160p sequences are encoded with the random access main10 (10-b) configuration. All videos are encoded with WPP enabled. Table V shows the resulting bitrates. 
VII. RESULTS

A. Single-Threaded Performance
Tables VI and VII show the performance of a single thread in frame/s for 1080p and 2160p resolutions, respectively. The tables present the performance achieved on each architecture for four different QPs. For each QP, both the planar chroma (YUV) and the interleaved chroma (YC) results are shown. The results are averaged over the input videos with the same resolution.
The results show that there is quite a large performance difference for different QPs. Depending on the resolution and architecture, there is a 1.5× up to 2.4× difference between QP 24 and QP 36. Comparing different architectures, an even larger performance span can be observed. For instance, there is an up to 15.6× single-threaded performance difference between the tested Cortex-A9 and Haswell platform. Most of the tested platforms, however, achieve the common movie frame rate of 24 frames/s for 1080p. Only the Dothan, atom, and Cortex-A9 are not able to achieve this on a single core. This is different for 2160p 10-b as none of the platforms is able to achieve (the expected to be) common frame rate of 50 frames/s for any QP point, showing the necessity of parallelization.
Chroma interleaving provides in all cases an improvement. For Intel processors, there is a 4.4%-8.6% improvement and this is stable across resolutions. For AMD at 2160p, the improvement is even higher than 16%. On the ARM processors, the difference is smaller with up to 2.5% improvement. Chroma interleaving improves the SIMD utilization and memory/cache behavior for the chroma plane(s). For brevity, the results discussed in the following sections use the chroma interleaved configuration.
B. Impact of ISA and Architecture
In the previous section, the absolute single-threaded performance was discussed with respect to resolution, QP, and platform. In this section, we will refine the platform related results and show the impact of ISA and microarchitectural differences. For all the results, the runtimes of all the videos and QPs are averaged for each resolution. Table VIII shows the average number of instructions executed per frame for the different ISAs and their SIMD extensions. As can be observed, employing SIMD reduces the instruction count dramatically. The instruction count reduction compared with scalar ranges from 4.8× to 8.1× depending on the ISA and SIMD extension. The instruction count reduction is similar for 1080p 8-b and 2160p 10-b. For 10-b, typically less scalar instructions are replaced by SIMD instructions, because higher intermediate computation precision is required. This is counterbalanced by the higher resolution, which increases the portion of time spent in kernels that are improved by SIMD.
Comparing the 32-and 64-b x86 architectures shows that 64-b requires significantly less instructions per frame. This can be mostly accounted to the increased number of architectural registers (8 in x86 and 16 in x86-64), which reduces the generation of so-called spilling instructions due to too few available registers. The instruction count for the ARMv7 ISA, which also has 16 architectural general purpose registers, is even lower for scalar execution. NEON, however, is less powerful compared to SSE. While both ISAs have 128-b SIMD instruction, many NEON instructions that perform operations horizontally in the vector, such as table lookup (shuffle), halving adds, and narrowing and widening, are defined only for 64-b. Also, not all SIMD instructions and their options are exposed as intrinsics, placing a higher burden on the compiler for code generation.
In Fig. 9 , the normalized performance and instructions per cycle (IPC) are shown for 1080p and 2160p, respectively. For the normalized performance, the frequency differences of the platforms are first factored out (by multiplying the runtimes with the frequency), and the results are then normalized to the Haswell scalar results. The plots show how the different architectures would compare with each other when running at the same frequency. It can be observed that SIMD always provides a significant speedup compared with their own scalar baseline. For instance, the atom performs at the same frequency 4.8× slower than Haswell in scalar execution, but is about equal when it uses SIMD. When also using SIMD on Haswell, however, the performance gap returns.
The improvements of the additional SIMD extensions are more incremental, except for avx2 for which the SIMD registers are 256-b instead of 128-b. The generationto-generation architectural improvements have been more significant.
Finally, it can be observed that the IPC when using SIMD instructions are always lower compared to scalar. Because only part of the application is accelerated with SIMD, the IPC of the parts that have more limited acceleration become more dominant. Typically, the code parts related to bitstream parsing (CABAC) have low IPC because of frequent branches and data dependencies. Also SIMD code is more optimized and processes data faster leading to relatively more cache misses and control instructions, both typically reducing the IPC.
C. Speedup Per Stage
In Table IX , the per stage speedup on Haswell is presented for different SIMD extensions averaged over all QPs. For 1080p 8-b, the overall speedup ranges from 3.6× to 4.84× from sse2 to avx2. The stages that improve the most are the inter prediction (10.33×) and the SAO filter (11.18×). The IT and DF benefit less with up to 3.69× and 3.44×, respectively. Although limited, also the PSide, PCoeff, and other stages are accelerated, mainly from improving filling and clearing memory. (The PSide and PCoeff results are discarded from the table for space reasons.) Fig. 10 shows the execution profile of the decoder for scalar and SIMD for the Haswell platform. It can be observed that portion of time spent on parsing the side information and coefficients (PSide and PCoeff), intra prediction (intra), and other increases when using SIMD. The contribution of these stages is relatively higher because the SIMD acceleration has limited effect in these stages.
The SAO filter shows the widest speedup span across SIMD extensions. This is especially caused by sse2 that cannot accelerate table lookups. This requires the PSHUFB instruction introduced in ssse3. Apart from avx2 that shows the biggest improvement due to the wider vectors, ssse3 improves the performance the most, mainly due to the introduction of the PSHUFB instruction.
The 2160p (10-b) results show in general the same trend as the 1080p (8-b) results. The main differences are that the inter prediction and SAO filter achieve lower speedups, while the intra prediction, IT, and DF show higher speed up. 10-b The inter prediction and SAO filter are the two stages where data are read from and written to picture memory. With 10-b samples, the picture memory and bandwidth per sample is doubled putting more pressure on the memory system. The SAO filter achieves less than half the speedup for 2160p 10-b compared with 1080p 8-b and is clearly bottlenecked by the memory system. The impact on inter prediction is less, because the memory accesses are spread over a larger portion of time and can be more effectively overlapped with computation. Also the computation when using 10-b pictures is more costly, counterbalancing the increased load operations.
The intra prediction, IT, and DF do not operate on the picture memory directly, but on an intermediate buffer and the increased bit depth has limited consequences in these stages. The speedup for 2160p sequences is slightly higher because of the more common use of larger prediction and TBs, which make more effective use of the SIMD vector width. For the DF, larger blocks result in less divergence in the filter evaluation, reducing the amount of redundant work.
Finally, Tables X and XI and architectures without an SIMD table lookup instruction, suffer from lower speedups for the SAO filter. Prescott and K8, only support SSE2, while Conroe and the ARM processors are known to be more memory limited. It can also be observed that the overall speedup on older architectures, such as Prescott, Conroe, K8, and the ARM processors, is lower than on the latest architectures of Intel and AMD. Over the years, more emphasis has been put on SIMD performance, because more relevant applications have been optimized for SIMD. An exception is the Bay Trail platform that improves performance compared with stom by introducing out-of-order execution. The SIMD execution remained in-order leading to a relatively lower speedup.
D. Multithreaded Performance
Orthogonal to SIMD acceleration, additional performance can be gained using multithreading. For parallelization, we used an approach based on [31] , which combines WPP and frame-level parallelism. In this strategy, threads decode the rows of a frame in wavefront order using WPP substreams, and in addition, rows of the next frame are already started before fully completing the current frame. To make this approach fully standard compliant, motion vector dependencies are tracked dynamically.
Tables XII and XIII show the multithreaded performance for 1080p 8-b and 2160p 10-b, respectively. The results are averaged over the four QPs. For reproducibility in the experiments, the threads are pinned to individual physical cores first, and secondary to the hardware threads exposed by hardware multithreading.
Overall, the decoder scales well with multiple physical cores for both 1080p and 2160p. With two cores the scaling is mostly close to 2×, while with four cores the speedup is around 3.8×. The ARM cores scale noticeably less, due to a relatively weaker memory system. It can also be observed that Intel's SMT and AMDs CMT do not provide the same improvement as when the threads are executed on individual cores. SMT provides a performance improvement between 7.2% and 37.2% with a typical improvement of ∼15%. CMT fares better with an improvement of 45%-47.1%, because more resources in the core are duplicated.
E. Memory Footprint and Bandwidth
The maximum resident set size when decoding the random access encoded sequences of the optimized decoder for x86-64 is 29.9 and 191.6 MB for 1080p 8-b and 2160p 10-b, respectively. When using wavefront and frame-level parallelism, an extra picture buffer is used and the memory footprint rises to 34.1 and 222.6 MB, respectively. For ARMv7 and x86 32-b architectures, the memory footprint is slightly lower (1%-2%) due to the smaller pointers.
A larger difference between ARMv7 and x86/x86-64 is present in the binary size. For ARMv7, this is 756 KB, while for x86-64 1.92 MB is required. This is mostly caused by the many SIMD extensions of x86, which all are contained in the same binary. During execution, only one part of this binary is actually residing in the caches, though, as only one of the SIMD variants is used per sequence. Table XIV shows the average bytes transferred per frame and the memory bandwidth when using scalar, avx2, and multithreading. The bytes written per frame is very similar for all configurations and corresponds closely to the actual data size of the frame with extended borders. The number of bytes read per frame are in all cases higher than the bytes written per frame, which is expected because in the random access configuration the blocks use mainly bidirectional prediction, which reads more than two times the samples it writes. The benefit of using nontemporal stores is clearly visible when comparing scalar with avx2. Especially, for 2160p 10-b, the memory transfers savings are significant. The 8 MB large L3 cache can only fit part of the 27-MB picture buffers, and many additional capacity misses are avoided by not write allocating cache lines for the produced picture buffer data. Chroma interleaving also clearly reduces the memory requirements as wider blocks improve the utilization of the data fetched in a cache line.
The average memory bandwidth is significantly higher for 10-b video because of the 16-b storage type compared with 8 b for 8-b videos. The total memory bandwidth requirements for high frame rate (120 frames/s+) 2160p video is high at around 10 GB/s, but is feasible for current mainstream systems, which have a practical limit of around 20-25 GB/s.
VIII. CONCLUSION
As for previous video coding standards, HEVC is also well suited for acceleration with SIMD instructions. Compared with a well-optimized HEVC decoder an additional speedup of 2.8× to 5× can be obtained over the complete decoding process using SIMD. The acceleration factor provided by SIMD and multithreading allows real-time HEVC decoding to be easily performed on current hardware platforms, even for UHD applications.
The large speedup, however, could only be achieved with high programming complexity and effort. The complexity of the HEVC standard and the diversity of current computer architectures required many specializations to achieve the optimal performance. Even when introducing a conceptually simple change, such as an interleaved chroma format, a specialized SIMD version for almost every chroma function is required. When video standards and applications continue to increase in complexity, the programmability of SIMD could become the main bottleneck for achieving the highest performance.
To alleviate the programmability issues of SIMD, the synergy between the SIMD ISA, the video codecs, and foremost the programming model has to be improved. While the core of SIMD ISAs fulfill the same purpose, and are often interchangeable in functionality, none of the ISAs are truly compatible as a whole. These small ISA differences could be abstracted away elegantly, through a vendor independent standardized SIMD extension to programming languages. A standard SIMD language extension would allow programmers to target multiple SIMD ISAs with a single SIMD parallelization method.
From a video codec perspective, the main implementation complexity arises from the support for multiple bitdepths. For each function, the possible block sizes and the input and intermediate precisions must be carefully considered to avoid overflows. This could be avoided by reducing the supported bitdepths, e.g., only support 8-and 12-b instead of all the possible bit depths in between. Another possible solution is to specify the codec in standardized floating point operations. General purpose architectures have lately increased their floating point performance at a much faster rate than integer performance and in the latest architectures have even a higher floating point throughput. As the use of floating point numbers might also improve compression performance, it is an interesting and promising direction for future work.
