The new High Efficiency Video Coding Standard (HEVC) was finalized in January 2013. Compared to its predecessor H.264 / MPEG4-AVC, this new international standard is able to reduce the bitrate by 50% for the same subjective video quality. This paper investigates decoder optimizations that are needed to achieve HEVC real-time software decoding on a mobile processor. It is shown that HEVC real-time decoding up to high definition video is feasible using instruction extensions of the processor while decoding 4K ultra high definition video in real-time requires additional parallel processing. For parallel processing, a picture-level parallel approach has been chosen because it is generic and does not require bitstreams with special indication.
INTRODUCTION
Recent studies analyzed that coded video data is becoming the major part in consumer internet traffic with a predicted share of 90% by 2015.
1 This is supported by mobile devices where increasing screen resolutions enable them to playback high definition (HD) video which is usually streamed or downloaded over mobile networks. Besides that, there are first attempts to broadcast 4K ultra-high definition (UHD) video in TV networks. All these developments are asking for a new, more efficient video codec that is able to reduce the bitrate without sacrificing video quality or, to increase the video resolution without increasing the bitrate. In 2010, the two premier video standardization organizations, the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Pictures Expert Group (MPEG), accepted this challenge and established a Joint Collaborative Team on Video Coding (JCT-VC) to develop a new international video coding standard. This standard should be able to reduce the bitrate by 50% for the same quality compared to the state-of-the-art H.264/MPEG4-AVC standard. Three years later, the first edition of the new standard called High Efficiency Video Coding (HEVC) was finalized in January. 2 In April 2013, the ITU-T published the HEVC specification text as Recommendation H.265 while in ISO/IEC, HEVC becomes MPEG-H Part 2 (ISO/IEC 23008-2).
The new HEVC standard provides improvements all over the hybrid video coding design, which is the same basic design already applied in previous video coding standards. A summary of its main features and a general codec design overview is given by Sullivan et al. 3 Ohm et al. analyzed the coding efficiency of HEVC and compared it with previous video coding standards like H.264/MPEG4-AVC and H.262/MPEG2-Video. They report bitrate reductions of 50% for the same subjective quality compared to H.264/MPEG4-AVC. 4 In order to clarify whether this coding efficiency gain comes along with increased complexity, Bossen et. al studied the complexity aspects of HEVC software encoding and decoding. This study concludes that the encoding is much more challenging than the decoding, e.g. encoding one second of a 1080p60 HD video with the HM reference encoder takes longer than an hour. 5 Hence, it is expected that first applications of HEVC will be offline coded video content, e.g. internet video, video on demand and the like. While application in broadcast usually requires hardware decoder chips as is in set-top boxes, people usually watch internet video on their computer where software decoding plays an important role. Therefore, this paper analyzes the decoding performance of an optimized, multi-threaded HEVC software decoder for HD and 4K/UHD video on a current mobile processor.
The rest of the paper is structured as follows. The next section reviews approaches that have been studied already for parallel HEVC decoding. Section 3 presents code optimizations and a picture-level parallel decoding approach and Section 4 reports runtime and profiling results for these techniques for HD and 4K/UHD test sequences. Finally Section 5 concludes this paper and gives a short outlook.
TOWARDS HEVC REAL-TIME DECODING
It has been shown that for resolutions up to HD (1920×1080), code optimizations including heavy use of singleinstruction multiple-data (SIMD) instructions are sufficient to achieve HEVC real-time software decoding. 5 When it comes to decode UHD video (3840×2160), single threaded execution with code optimization is not enough anymore.
Several approaches to achieve HEVC real-time decoding of UHD video in software have been studied. [6] [7] [8] [9] All studies are based on an optimized version of the HEVC test model (HM) reference software decoder 10 because the original HM software was developed as a reference implementation focussing on correctness and completeness. Hence, the HM reference software is fairly slow. For example, when decoding an HEVC bitstream with only intracoded pictures and a QP value of 27, it takes the HM decoder two minutes to decode ten seconds of a 1080p60 video. 5 The aforementioned modifications of the HM decoder include code optimization and multithreading support, necessary to achieve real-time decoding. All of the studies are making use of the HEVC high-level tools that allow for parallel decoding, namely slices, wavefront parallel processing and tiles. In the first one, Alvarez et al. investigated a wavefront like concept using entropy slices in HM 3.0, which are not supported anymore in the final version of the standard. 6 The following two papers are suggesting a slightly modified version of wavefront parallel processing, called overlapped wavefronts. 7, 8 This concept as well as parallel processing with tiles have been integrated in an optimized HM 4.1 decoder. The most recent publication shows results for overlapped wavefronts based on HM 8.0 and further reports speedup due to the use of SIMD code optimizations.
9 Although all these high-level concepts have been proven to provide real-time decoding, the main disadvantage of them is that they put constraints on HEVC bitstreams, i.e. require an explicit signaling, in order to do so. A more generic approach using picture-level is presented in the next section.
SIMD OPTIMIZATION AND PICTURE-LEVEL PARALLELISM
As already mentioned in the previous section, performing parallel decoding on a sub-picture granularity requires an indication of this sub-picture granularity in the bitstream, be it on a slice, tile or CTU line level as in wavefronts. Decoding whole pictures in parallel is a generic way to speed up a decoder using parallel decoding. Different from traditional picture-level parallelization, in which only completely independent pictures or slices are decoded in parallel, the employed parallelization strategy allows decoding dependent pictures in parallel by maintaining the dependencies in a more fine-grain manner. The execution of the picture will only stall if a particular reference region has not been decoded yet, allowing any bitstream to speedup independent of the employed referencing scheme. This picture-level parallelism has been integrated in an HEVC software decoder developed from scratch at the Fraunhofer Heinrich Hertz Institute (HHI). The single threaded version of the HHI decoder is already optimized with regard to code structure and SIMD instruction set extensions of x86 processors * .
In general, for all parts of a codec that perform the same operation on a large amount of data, e.g. a block of picture samples, can be sped up by optimizing these parts of the code for SIMD instruction set extensions. For HEVC, interpolation, intra-picture prediction, inverse transformation, de-blocking and memory copy operation were identified to benefit from SIMD optimizations. The sample adaptive offset filter operations are also well suited for SIMD optimizations but this has not been implemented in the current optimized decoder. Currently, processor extensions from SSE2 to AVX are supported. Results of how much the SIMD optimizations speed up the single-threaded decoder are presented in Section 4.2 where the HHI decoder with SIMD optimizations is compared to the HHI decoder without SIMD optimizations (scalar code) and the HM 12.0 reference decoder.
Besides code and SIMD optimization, a major speedup can be achieve by using multiple threads to run decoding operations in parallel. In the aforementioned picture-level parallel decoding approach, each picture to * in cooperation with the Embedded Systems Architecture Group at the Technical University of Berlin be decoded is assigned one worker thread that performs the decoding. For coding structures where every picture is coded with intra-picture prediction, almost linear speedup can be achieved because and the synchronization overhead between threads is negligible. When inter-picture prediction coding structures are considered, the inter-picture dependencies require more complicated synchronization between the worker threads. This is further investigated in Section 4.3 where the speedup for different numbers of threads in intra-picture and inter-picture prediction coding structures are shown and analyzed.
RESULTS
In this section, results for the optimized HEVC software decoder described in Section 3 are presented and discussed. First, the experimental setup is described in Subsection 4.1. In Subsection 4.2, the single threaded performance of the HM reference decoder and the optimized HHI decoder is compared followed by an analysis of the multithreaded execution of the HHI decoder in Subsection 4.3. Finally, profiling results for the optimized HHI decoder are given in Subsection 4.4.
Experimental setup
The parallel HEVC decoder has been implemented from scratch and optimized with SIMD intrinsics for SSE extensions. Multithreading has been performed using the C++ Boost libraries, which offer a convenient C++ wrapper around platform-dependent threading libraries such as Pthreads. The optimized decoder has support for multiple operation systems such as Linux, Microsoft Windows and Apple OS X, but the performance experiments presented in this paper have been conducted under Linux.
System
The system employed to measure performance is a Dell mobile workstation with an Intel i7-2920XM processor. This processor is based on the Sandy Bridge microarchitecture and includes four cores running at 2.5 GHz. The simultaneous multithreading feature (SMT, also called Hyperthreading by Intel) provides a total number of eight available hardware threads for the four cores. It has support for SSE (up to version 4.2) and AVX SIMD instructions. Although AVX only includes 256-bit SIMD registers for floating point instructions, integer SIMD instructions still benefit from the three operand mode (non-destructive instruction destination). All details of the hardware/software environment are listed in Table 1 
Test Sequences and HEVC Encoding
Test sequences in two resolutions have been used: 1080p which is representative for current high definition systems, and 2160p which is representative for the next generation of high quality video. For 1080p, the five class B sequences from the JCT-VC test set have been used which have 24, 50 and 60 frames per second (fps). For 2160p50, five sequences from the EBU UHD-1 50 fps test set 11 have been selected, namely Lupo confetti, fountain lady, rain fruits, studio dancer and waterfall pan. All the test sequences have been encoded with the HEVC HM reference encoder version 12.0 using the JCT-VC common test conditions. 12 Encoding options are based on HEVC main and main 10 profiles using two configurations: random access (RA) and all intra (AI). Each video is encoded at four different QP points: 22, 27, 32, 37. The 1080p sequences were encoded with random access main profile and all intra main 10 profiles, while the 2160p sequences were encoded using the random access main 10 profile. (f) 1080p60 AI-main10 Figure 1 . Rate-distortion performance of the HD sequences. EBULupoconfetti-p50 EBUfountainlady-p50 EBUrainfruits-p50
EBUstudiodancer-p50 EBUwaterfallpan-p50 Figure 2 . Rate-distortion performance of the UHD-1 sequences for the random access main10 configuration.
Rate-Distortion Performance
Figures 1 and 2 show the resulting rate-distortion performance of the considered HD and UHD-1 test sequences. Here, the peak signal to noise ratio between the original and the reconstructed luma samples (PSNR Y) is used as the distortion measurement. It can be observed that the RD-performance is highly content dependent. For example, an average luma PSNR value of around 41 is measured for the sequence rain fruits at 21 Mbits/s while coding fountain lady at the same luma PSNR value results in a bitrate of 41 Mbits/s.
Comparison with single threaded HM reference decoder
As a first step in the decoding runtime analysis, the optimized HHI decoder is compared with the HM reference decoder. When compiling the HM decoder with default settings and a state-of-the-art compiler, so called auto vectorization already tries to automatically optimize the code for SIMD instructions. In order to perform a fair comparison of both decoders without SIMD optimizations, both have been compiled with the auto vectorization functionality turned off and the SIMD intrinsics have been disabled for the HHI decoder. Table 2 shows the decoding runtimes in frames per second for all tested resolutions, frame rates, coding configurations and QP values averaged over all sequences for a given resolution, frame rate and QP value. In general, it can be observed that the scalar speedup is rather constant over bitrates (QP values) while the speedup achieved with SIMD optimizations decreases with decreasing QP values (increasing bitrate). This can be explained by the fact that with decreasing QP values, more and larger quantized transform coefficients are to be decoded by the CABAC entropy coding and this part of the code does not benefit from SIMD optimizations. Detailed results for every sequence of the UHD-1 test set are shown in Figure 3 . Here, the execution time per frame is plotted over the bitrate. Since the frame rate of the UHD-1 test sequences is 50 fps, all points that lie under 1000 ms per 50 frames (20 ms/frame) can be considered as to be decoded in real-time. It can be seen that even with SIMD optimizations, single threaded decoding of 4K/UHD video is not possible. Furthermore, the impact of the auto vectorization for the HM decoder is not negligible because runtime reductions of around 50 ms/frame are observed for all rate points.
Multithreaded execution using picture-level parallelism
In a second step, the speedup when using the picture-level parallel decoding approach described in Section 3 is analyzed. In addition to the single threaded execution times presented in Subsection 4.2, the execution times of the HHI decoder with SIMD optimization are measured when two, four, eight and ten threads are used. Figure 4 illustrates the speedup compared to the single threaded execution averaged over all QP values and test sequences for a given resolution, frame rate and coding configuration. For all four subfigures, three different slope segments can be identified.
The first segment ranges from one to four worker threads where the the number of worker threads can be mapped to the number of physical CPU cores, which is four in the system used for the experiments. Here, the all intra configuration provides an almost one-to-one correlation between the number of threads and the speedup factor (speedup of almost 4 for four worker threads) while the random access configuration provides a more flat speedup (speedup between 2.5 and 3 for four worker threads). This can be explained by the fact that interpicture prediction is used in the random access configuration, which introduces picture-to-picture dependencies. In the random access configuration pictures are encoded using QP cascading. This results in different QP values for different pictures, and, consequently, different execution times. Threads that process pictures which execute normally faster can stall because the required reference areas of depending pictures are not available yet.
The next segment ranges from five to eight worker threads, which corresponds to the number of hardware threads made available by SMT. These hardware threads provide significantly less speedup than a physical CPU core, e.g. only 4.5 for eight worker threads for all intra configurations. This is expected as the SMT threads are sharing the execution core with the "normal" threads. In this particular processor each core is shared by two threads. The speedup achieved using five to eight threads originate from the additional instruction level parallelism (ILP) exposed by the additional threads, which increases the utilization of the functional units in the core.
In the last segment from nine worker threads on, no additional speedup is achieved for the all intra configuration, which is reasonable since no CPU resources are available anymore to decode a picture. For the random access configuration, however, using more than eight worker threads still provides an additional speedup. Due to the aforementioned inter-thread synchronization for inter-picture prediction, it may occur that one worker thread is idle. In that case, the associated CPU core can be used to start decoding another picture. Therefore, increasing the number of worker threads still provides a speedup for coding configurations using inter-picture prediction. EBULupoconfetti-p50-t2 EBUfountainlady-p50-t2
EBUrainfruits-p50-t2
EBUstudiodancer-p50-t2 EBUwaterfallpan-p50-t2 50 Hz EBULupoconfetti-p50-t4 EBUfountainlady-p50-t4
EBUrainfruits-p50-t4
EBUstudiodancer-p50-t4 EBUwaterfallpan-p50-t4 50 Hz EBULupoconfetti-p50-t8 EBUfountainlady-p50-t8
EBUrainfruits-p50-t8
EBUstudiodancer-p50-t8 EBUwaterfallpan-p50-t8 50 Hz EBULupoconfetti-p50-t10 EBUfountainlady-p50-t10
EBUrainfruits-p50-t10
EBUstudiodancer-p50-t10 EBUwaterfallpan-p50-t10 50 Hz (d) 10 threads Figure 5 . Average time per frame for 2160p50 RA-main10 for optimized decoder in SIMD modes (multi-threaded).
In order to illustrate what is needed to achieve real-time decoding of 4K/UHD video, detailed execution times for all tested UHD-1 sequences for two, four, eight and ten worker threads are shown in Figure 5 . When two threads are used, it can be seen that all measured execution times are slower than the 20 ms/frame real-time limit. Allowing the HHI decoder to use four working threads already results in execution times for the two lowest bitrates (at 2.6 and 5 MBits/s) of the slowest sequence rain fruits being faster than 20 ms/frame. As already discussed, Figure 4a shows that the highest speedup for the UHD-1 test sequences and the random access main 10 configuration can be achieved when ten worker threads are used. Consequently, the execution times for at least all points below 20 MBits/s are faster than 20 ms/frame in that case. Overall, it can be said that HEVC real-time decoding of 50 Hz 4K/UHD video on a quad-core mobile CPU like the i7 Sandy Bridge at 2.5 GHz is possible.
Profiling results
After investigating the overall performance, the contribution of the different parts of the optimized HHI decoder has been analyzed by profiling. The decoding process has been broke down into the following parts:
• PS: Parsing of prediction data side information, e.g. motion vectors, splitting flags and the like.
• PC: Parsing of transform coefficients.
• IP: Intra prediction with SIMD optimized code.
• IT: Inverse transform with SIMD optimized code.
• MC: Motion compensation with SIMD optimized code.
• DF: Deblocking filter with SIMD optimized code.
• SF: Sample adaptive offset filter.
• OT: All other operations including high-level syntax parsing but mostly copying samples from local buffer to picture memory.
The detailed profiling results can be found in Table 3 for 1080p all intra main 10, in Table 4 for 1080p random access main and in Table 5 for 2160p50 random access main 10.
When comparing the results for the two QP values, the first thing that can be observed is that for increasing QP values, the prediction parts increase while the transform coefficient parsing part decreases. This is conclusive since a low QP value reduces the number and size of the transform coefficients. The other thing that can be noticed is, that the SAO filtering takes up more decoding time for reconstructed sample values with finer quantization while more time is spent on deblocking when the quantization is more coarse (more blockiness).
Looking at the results for the different configurations, the main difference between the all intra and the random access configuration is, that most decoding time in the all intra configuration is spent for coefficient parsing (34.2%) and intra prediction (14.5%) while the random access configuration on the other hand spends almost half of the decoding time for motion compensation (41.4% and 45.3%). This is plausible since intra-picture prediction generally produces a much higher residual that leads to more and larger transform coefficients.
Since the profiling results were generated using the optimized HHI decoder with SIMD optimizations, it would be interesting to know how much of the decoding time would be spent in the SIMD optimized parts when no SIMD optimizations are used. Therefore, average profiling results for the scalar code and a SIMD speedup factor for all parts that include SIMD optimized code are listed in the two lines below the average. It can be seen that the interpolation filter benefits the most with measured speedup factors of 8.1 and 5.47. For intra prediction, SIMD optimizations only reduce the decoding time for that part by a factor of 1.4 to 1.7. This matches with the decoding times shown in Table 2 where the overall speedup using SIMD optimization for all intra configurations is much less than for the random access configuration. 
CONCLUSION
In this paper, it has been shown that HEVC software decoding of 4K 50Hz 10 bit video on a quad-core mobile CPU is possible for bitrates up to 20 Mbits/s. In order to achieve that, SIMD code optimization and parallel decoding is essential. In future developments, further speedup could be obtained by using most recent and upcoming SIMD instruction set extensions like AVX2 and by adding SIMD optimizations for the sample adaptive offset filter.
