Abstract-This paper analyzes the complexity of the HEVC video decoder being developed by the JCT-VC community. The HEVC reference decoder HM 3.1 is profiled with Intel VTune on Intel Core 2 Duo processor. The analysis covers both Low Complexity (LC) and High Efficiency (HE) settings for resolutions varying from WQVGA (416 × 240 pixels) up to 1600p (2560 × 1600 pixels). The yielded cycle-accurate results are compared with the respective results of H.264/AVC Baseline Profile (BP) and High Profile (HiP) reference decoders. HEVC offers significant improvement in compression efficiency over H.264/AVC: the average BD-rate saving of LC is around 51% over BP whereas the BD-rate gain of HE is around 45% over HiP. However, the average decoding complexities of LC and HE are increased by 61% and 87% over BP and HiP, respectively. In LC, the most complex functions are motion compensation (MC) and loop filtering (LF) that account on average for 50% and 14% of the decoder complexity. The decoding complexity of HE configuration is on average 42% higher than that of the LC configuration. Majority of the difference is caused by extra LF stages. In HE, the complexities of MC and LF are 37% and 32%, respectively. In practice, a standard 3 GHz dual core processor is expected to be able to decode 1080p HEVC content in real-time.
INTRODUCTION
The wireless and wired transmission of next-generation resolutions demand coding efficiency that is beyond the capabilities of the current state-of-the-art H.264/AVC standard [1] . Therefore, MPEG and VCEG have established a Joint Collaborative Team on Video Coding (JCT-VC) to develop a successor to H.264/AVC. This forthcoming international standard is referred to as High Efficiency Video Coding (HEVC) [2] . HEVC focuses on coding of progressively scanned rectangular pictures whose resolution can vary at least between QVGA (320 × 240) and UHDTV (7620 × 4320).
The plan of JCT-VC is to publish draft versions of HEVC in 2012 and the final standard in early 2013. JCT-VC is currently in a collaborative phase refining the technical content of the draft design that has originally been created from the bestperforming initial HEVC proposals [3] - [7] . The initial HEVC versions roughly halve the bit rate over H.264/AVC with the same subjective visual quality, whereas the respective BD-rate savings have been measured to be around 20 -40% [8] . To be able to study trade-offs between complexity and coding efficiency, the HEVC coding tools are separately specified for Low Complexity (LC) and High Efficiency (HE) operation [9] .
The public HEVC assessments have mainly focused on its BDrate and BD-PSNR gains [3] - [8] , whereas the complexity evaluation of HEVC is limited either to single HEVC tools [10] or processing time comparisons between consecutive HEVC versions and H.264/AVC [8] , [11] . This paper addresses the cycle-accurate complexity of the HEVC reference decoder and compares the results with H.264/AVC reference decoder.
Since the standardization is still in progress, the experiments rely on the temporary HEVC Test Model HM 3.1 utilizing both LC and HE random access (RA) configurations [9] . HM 3.1 is benchmarked against the current JM 18.0 [12] reference decoder of H.264/AVC. The used JM profiles are Baseline Profile ( BP) and High Profile ( HiP). All cycle-accurate profiling results are yielded with Intel® VTune™ Amplifier XE 2011 on Intel® Core™2 Duo E8400 processor.
The remainder of this paper is organized as follows. Section II presents the HEVC decoder and its main functions. Section III describes the setup for the complexity analysis and reports the cycle-accurate complexities of HM 3.1 LC and HE decoders. Section IV compares the complexities of HM 3.1 and JM 18.0. In addition, practical implementation alternatives for the HEVC decoders are discussed. Section V concludes the paper.
II.
HEVC DECODER The coding structure of HEVC is based on a quadtree scheme in which the size of the square-shaped Coding Unit (CU) is 2N × 2N, where N {4, 8, 16, 32}. Each CU can be recursively divided into four smaller CUs until N = 4. In inter/intra prediction, the CUs can be further partitioned into rectangularshaped Prediction Units (PUs). With CU of size 2N, the size of the PU can be 2N × 2N, 2N × N, N × 2N, or N × N [13] . For transforms, HEVC specifies Transform Unit (TU) whose size can be from 4 × 4 to 32 × 32.
A general HEVC decoder structure and its main functions are depicted in Fig. 1 Intra prediction ( IP) stage accesses a frame memory to compute intra prediction (P intra ) for a decoded block. The frame memory contains previously decoded blocks of the current picture. HEVC increases angular IP modes over H.264/AVC by specifying 17 modes for 4 × 4 blocks, 34 modes for 8 × 8, 16 × 16, and 32 × 32 blocks, as well as 3 modes for 64 × 64 blocks. In addition, HEVC contains planar IP mode. If the decoder operates in IP mode, P intra is added to a residual block and a reconstructed block is stored in the frame memory.
Motion compensation (MC) stage produces an inter prediction (P inter ) for a decoded block by addressing decoded picture buffer (DPB) with MVs and Idxs. DPB contains previously decoded pictures. HEVC uses 8-tap interpolation filter for luminance and 4-tap filter for chrominance samples in -pixel (chrominance only), ¼-pixel, and ½-pixel MC. If the decoder operates in inter prediction mode, P inter is added to the residual block to form the reconstructed block.
Loop filtering (LF) stage filters the distortions and visible CU/PU/TU borders from the picture. LF stage contains three in-loop filters: deblocking filter (DF), adaptive loop filter (ALF), and sample-adaptive offset (SAO). DF corresponds to DF in H.264/AVC, ALF improves quality with diamond-shape 2D filters [13] , and SAO applies offset values indicated in the bitstream [13] . Each filter can be used sequentially according to encoder decision. ALF is excluded from LC. [9] with I-frames roughly at one second intervals and limiting the number of reference pictures in inter prediction to four. Each HM configuration has been run 10 times with all sequences and median of the sequence-specific test runs have been selected.
III. HEVC DECODER ANALYSIS
The profiling results are tabulated in TABLE III and TABLE  IV , in which only the sequences with maximum and minimum complexities are reported for each format. The absolute complexities of the tabulated sequences are reported as million cycles per frame (Mcpf). In addition, the percentages of the cycle counts are allocated for each decoder stage (ED, IQ/IT, IP, MC, and LF). Pre-processing, memory, and postprocessing functions not belonging directly to any of these stages are allocated to group "Misc".
In LC, the average complexities of ED, IQ/IT, IP, MC, and LF are 5%, 6%, 2%, 50%, and 14%, respectively. Changing settings from LC to HE increases complexity of HM by 42% of which the majority is caused by ALF overhead in HE. The respective function-specific shares of HE are 6%, 5%, 1%, 37%, and 32%. When QP is changed from 22 to 37 in LC, the complexities of ED, IQ/IT, IP, MC, and LF degrade 80%, 57%, 67%, 23%, and 34%, respectively. In HE, the respective degradations are 88%, 66%, 61%, 23%, and 54%. Fig. 2 depicts average QP-specific complexities of HM LC/HE and JM BP/HiP at each resolution. As in LC/HE, hierarchical coding structure is also used in BP/HiP. Since 10-bit precision is not supported by BP/HiP, only 8-bit sequences are compared. On average, LC is 61% more complex than BP and the respective ratio is 87% between HE and HiP. LC reduces complexity of ED by 20% over BP, whereas the LC overheads of IQ/IT, IP, MC, and LF are 2.3x, 2.3x, 2.2x, and 1.2x, respectively. In HE, the corresponding ratios are 0.9x, 2.4x, 1.1x, 1.5x, and 4.2x over HiP. However, as illustrated in TABLE V, LC is able to reduce the average bit rate about 51% over BP whereas the average BD-rate between HE and Hip is over 45%. The gap has widened from initial HM versions, where the respective percentages were only 20% and 36% [8] .
HM and JM realize all features of the respective standard without optimizations, so they are targeted for research and conformance testing rather than practical real-time decoders. Since HM is currently the only available HEVC decoder, its attainable complexity reduction is here predicted through complexity ratio of JM and an optimized H.264/AVC decoder incorporated in FFmpeg [14] . Conducting the same tests with the optimized H.264/AVC decoder averagely consumes 75% less computational power than JM on a single thread. If the equivalent speed-up ratio is assumed between HM and an optimized HEVC decoder, the complexity of HEVC decoding would be below 200 Mcpf at 1080p format (TABLE III) . I.e., real-time (30 fps) performance requirement would be around 6 000 M cycles per second. In theory, that complexity would be tackled with 3 GHz dual-core processor and a dualthreaded HEVC decoder.
V. CONCLUSIONS This paper analyzed the complexity of HEVC reference decoder (HM 3.1) and compared the results with H.264/AVC reference decoder (JM 18.0). In HM, changing settings from LC to HE increases decoding complexity by 42%. The most complex functions of HM are MC and LF, whose respective shares are 50% and 14% in LC as well as 37% and 32% in HE. Under the same QP value, the average complexities of LC and HE are 61% and 87% higher than those of JM BP and JM HiP, respectively. However, HM outperforms JM noticeably in terms of the coding efficiency. The average BD-rate gains of LC over BP and HE over Hip are 51% and 45%, respectively. Assuming that the complexity of HM can be reduced by 75% as in the case of JM, real-time HEVC decoding up to 1080p format could be possible with 3 GHz dual-core processor and a dual-threaded HEVC decoder. The processing technology improvements will further alleviate usage of HEVC standard in the next-generation video products and services. 
