Abstract-Context-based Adaptive Binary Arithmetic Coding is the entropy coding module in the most recent JCT-VC video coding standard HEVC/H.265. As in the predecessor H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Beside other optimizations, the replacement of the context model memory by a smaller cache has been proposed, resulting in an improved clock frequency. However, the effect of potential cache misses has not been properly evaluated. Our work fills this gap and performs an extensive evaluation of different cache configurations. Furthermore, it is demonstrated that application-specific context model prefetching can effectively reduce the miss rate and make it negligible. Best overall performance results were achieved with caches of two and four lines, where each cache line consists of four context models. Four cache lines allow a speed-up of 10% to 12% for all video configurations while two cache lines improve the throughput by 9% to 15% for high bitrate videos and by 1% to 4% for low bitrate videos.
I. INTRODUCTION
High Efficiency Video Coding (HEVC/H.265, [1] ) is the most recent video coding standard developed by the Joint Collaborative Team on Video Coding (JCT-VC). It allows the compression of videos with the same perceptive quality as its predecessor H.264/AVC ( [2] ) while requiring only half the bitrate. Context-based Adaptive Binary Arithmetic Coding (CABAC, [3] , [4] ) has been a throughput bottleneck in AVC due to its sequential nature, and it still is in HEVC. CABAC operates on the bitstream and decodes binary symbols (bins) which are connected to form syntax elements that are used to control the remaining decoding steps. If possible, context models are used to estimate the probabilities that bins have a specific value. Context-coded bins are associated with context models to exploit statistical properties and thereby increase the compression rate. However, sometimes the probabilities cannot be accurately predicted. Because of that, the decoding of these so called bypass-coded bins goes without context models.
Strong bin-to-bin dependencies make low-level parallelization of CABAC decoding very challenging. Although high-level parallelization is possible in HEVC, it has to be enabled in the encoder which is not mandatory. Beside many other optimizations, the replacement of the context model memory in the data path by a smaller cache has been proposed [5] , [6] . This aims to shorten the critical path and increase the clock frequency and throughput. Unfortunately, the effect of potential cache misses has not been properly evaluated. Cache misses result in a performance degradation that might nullify a lot of the throughput improvements reached by the introduction of the cache. Prefetching has been proposed to address this issue [7] , but a quantitative evaluation is also missing. In this paper we evaluate both, cache miss rate without and with prefetching, to justify if these are gainful optimizations.
Our work makes the following contributions:
• an optimized cache architecture (as a result of an extensive evaluation of different configurations) • an efficient context model memory layout regarding spatial locality and prefetching efficiency, as well as the corresponding prefetching algorithm • an evaluation of prefetching efficiency for different cache configurations The remaining paper is structured as follows. An overview of related work is provided in Section II. The proposed decoder architecture with a context model cache and a prefetching module is described in Section III. Afterwards, Section IV quantitatively evaluates the cache miss rate and prefetching efficiency for different cache configurations. Finally, our work is concluded in Section V.
II. RELATED WORK
Many implementations of CABAC hardware decoders have been proposed. Although most of them cover AVC, the general ideas are also applicable for HEVC. Additionally, HEVC CABAC is designed to allow higher throughput by various optimizations in the standard, such as a reduced fraction of context-coded bins, grouping of bypass-coded bins, a reduction of the total number of bins as well as more relaxed parsing and context selection dependencies [4] .
Two architectural optimizations have proven themselves to be effective and at least one of them can be found in almost every proposed CABAC hardware decoder. Figure 1 : Two-stage pipeline of the proposed decoder architecture.
[8], [9] , [10] . It is well-suited for bypass-coded bins because their decoding process is simple and does not require context models. The second optimization is pipelining which is used to overlap the decoding of consecutive bins and thereby increase the throughput. Most proposals agree on a conceptual four-stage pipeline for CABAC decoding: context selection (CS) → context load (CL) → arithmetic decoding (AD) → de-binarization (DB). However, often neighboring stages are merged because the efficient implementation of deep pipelining is very complex for CABAC due to strong bin-to-bin dependencies. For example, CS depends on the results of AD and DB, which might lead to a flush of the complete pipeline. Decoders with three pipeline stages include [5] and [9] . Yu-Hsin Chen et al. [8] propose a very deep pipeline. A fifth pipeline stage is added at the beginning to compute a binary decision tree that is used for state prefetching in the remaining stages and thereby reduces pipeline stalls. Additionally, their implementation decodes up to two bypass-coded bins per cycle, resulting in a throughput of up to 2 Gbin/s which represents the stateof-the-art. The CL stage can be shortened when the context model memory is replaced by a small cache. The stage might be even removed when the cache fits into an adjacent stage. The shorter pipeline might result in less pipeline stalls. Cached designs have been proposed [5] , [6] , [7] , however the potential performance degradation due to cache misses has not been evaluated. Prefetching was used to reduce the cache miss rate [7] , but results for this optimization were also not provided. Our work evaluates both, the cache miss rate and the effectiveness of prefetching.
III. ARCHITECTURE
The proposed decoder architecture is implemented with a two-stage pipeline (see Figure 1 
A. Context Model Cache
A non-cached version of the decoder has been implemented as a reference where the context model memory is directly accessed. The memory contains sixteen memory sets of context models and is capable of fast in-memory copies. This allows the maintenance of multiple context model memory sets and thus supports efficient CABAC decoding when high-level parallelization tools (wavefront processing, tiles) are used. A cache can be used to replace the context model memory in the critical path and thereby allow a higher clock frequency. The cache fetches context model sets (CMS's) from the memory and writes them back when they have to be replaced. In our implementation a cache miss results in a miss penalty of two clock cycles while the missing CMS is loaded from memory. CMS replacement is handled by a least recently used (LRU) policy. The cache is fully-associative and contains a generic number of cache lines (1 to 64) each storing one CMS. Table I shows the optimized context model memory layout. Mostly, context models for the same type of syntax element are grouped to exploit spatial locality, e.g. last sig x/y pref ix (ll. 0-9), sig f lags (ll. 16-27) and absG1 flags (ll. 32-37). However, in some cases context models that are not logically connected are put in the same CMS (e.g. ll. 10, 11, 29 sig Cb/Cr 4x4). In all other cases the required context models fit in a CMS. With a smaller CMS size, the required context models for 4×4 transform sub-blocks do not fit because at least three sig coef f flags and four coef f abs level greater1 flags are potentially needed. This is a critical issue as transform block decoding contains a high fraction of the decoded bins. Bigger CMS sizes require that context models for different types of syntax elements are merged to keep the memory overhead small. Unfortunately, the context models that are used for the decoding of consecutive groups of equal syntax elements often depend on different parameters. For example, sig coef f flags depend on the transform block size and scan pattern while coef f abs level greater1 flags depend on the decoded bins in the previous 4×4 sub-block.
B. Context Model Prefetching
Application-specific context model prefetching can significantly reduce the miss rate in HEVC CABAC. Admittedly, the required context model often depends on the results of the previously decoded bin and cannot be prefetched early enough. However, most of the time one can be sure that the required context model is contained in a specific set of context models. If this set is available in the cache, it is not necessary to know the exact context model, but a hit is still guaranteed.
The prefetching module reads the current state of the control state machine, the decoder settings, some decoded syntax elements and the currently decoded bin. Based on this information, it selects up to two CMS candidates that are likely to be needed soon. As the module keeps track of the CMS's in the cache, it can see if they are already present. If one is not, a read request is sent to the context model memory. The first candidate has a higher priority than the second, so the second is only fetched when the first is already available. Unfortunately this can lead to a behavior where the first line is available and will be replaced by the second because it is the next to be replaced according to LRU. A refresh mechanism is implemented to avoid this behavior. This is done by resetting the LRU index of the first context model set candidate if it is already in the cache. As a result, it will not be replaced next. The prefetching strategy depends on the number of available cache lines (CLs). The strategy for at least four CLs fetches CMS's that are likely to be used soon, while for smaller caches only the CMS's are fetched that will be used for sure. Prefetching with one CL can only be used when no CMS is currently in use or if it is known when it will not be needed anymore. Also the second candidate is not used by the strategy for one CL.
IV. EVALUATION
A hybrid HEVC decoder has been realized as hardwaresoftware co-design on the Xilinx Zynq-7045 SoC to validate the functionality of the proposed CABAC hardware decoder. The CABAC decoder is implemented in the FPGA while the remaining parts are executed on the ARM CPU. The highly optimized HEVC software decoder developed by the Embedded Systems Architecture Group at TU Berlin is used [11] . Five test sequences from the JCT-VC class B test set (1080p) serve for evaluation. They are encoded in allintra (AI) and random-access (RA) mode with quantization parameters (QP) of 14, 22, 30 and 38. Wavefront Parallel Processing is enabled so that the same context models are used for a row of thirty CTUs. In the remaining paper the arithmetic mean of the results for the test sequences is shown.
The remaining evaluation section is structured as follows. First, the impact of different cache sizes on the clock frequency is shown to provide an upper boundary for the overall speed-up. Afterwards, the cache miss rate without prefetching is presented to illustrate that a cache does not improve the overall throughput without further measures. Finally, it is demonstrated that the miss rate can be significantly reduced when application-specific context model prefetching is used, resulting in different overall speed-ups depending on the cache size.
A. Clock Frequency
The purpose of replacing the context model memory in the data path by a smaller cache is to shorten the critical path and thereby increase the achievable clock frequency and throughput. The proposed design has been synthesized with Xilinx Synthesis Technology 14.6 (optimization goal: speed). Both, the memory and the cache, are forced to be synthesized with the same FPGA resources to get a fair comparison that is not only valid for FPGAs. The influence of the cache size on the maximum clock frequency of the decoder can be seen in Figure 2 . It is increased by 20.1% for a single CL, by 17.3% for two CLs and by 12.3% for four CLs. While there is no significant improvement for eight CLs (2.8%), the clock frequency is reduced for bigger caches. The rapid clock frequency reduction comes from the LRU implementation and CL selection which are not well suited for bigger caches with full associativity. It should also be noted that these results can vary for different implementations, e.g. shorter pipeline stages can lead to greater relative improvement. While the improved clock frequency accelerates the decoding of all bins, only context-coded bins can result in cache misses. The fraction of context-coded bins in the test sequences varies from 64% to 77%. However, as up to two bypass-coded bins can be decoded in parallel, the decoding time fraction for context-coded bins is slightly increased (74% to 83%).
B. Performance without Prefetching
Unfortunately the improved clock frequency comes at the cost of potential cache misses. They lead to stalls in the decoder pipeline and thereby reduce the overall throughput. Figure 3 (top) presents the cache miss rate for different cache sizes and video modes. In general, the miss rate grows with higher QPs. This is due to the fact that smaller QPs result in more bins because of less quantization. As bins of the same syntax elements are grouped, temporal and spatial locality can be better exploited when accessing the required context models in the cache. A significant miss rate reduction can be observed for all video modes when the number of CLs is increasing. However, there is no noticeable improvement with 64 CLs where all required context models for the decoding of a specific CTU fit in the cache. This means that all resulting cache misses are cold misses during the first access. Often not all context models are used during the decoding of a CTU, especially if only few bins are decoded as in the RA QP38 configuration. In this case 32 CLs are also sufficient and lead to the same results as 64 CLs. As there are high miss rates for few CLs and reduced clock frequencies for more than eight CLs, an overall performance improvement cannot be reached without prefetching. For example with an AI QP14 video and two CLs (74% context-coded bin cycle fraction, 17.3% higher clock frequency, 22.6% miss rate) the overall performance is only 88% of a non-cached decoder.
C. Performance with Prefetching
Our prefetching algorithm significantly reduces the cache miss rate (see Figure 3 (bottom) ). The miss rate with only one CL is still not acceptable as it is greater than 20% for all configurations because of the restricted prefetching opportunities. Two CLs already result in significant improvements that depend on the video mode. For all AI modes and for the low QP RA modes the miss rate is less than 5%, but it almost reaches 10% for the high QP RA modes. With four and more CLs the miss rate is less than 1.5% for all modes. As a result the decoder performance is not noticeably affected by the cache miss rate anymore (less than 2.5% performance reduction) and almost the full gain of the clock frequency improvement remains. Figure 4 shows the speed-up over the non-cached design, but only for the cache sizes where an improvement was achieved. The miss rate for a single CL is too high to result in an overall speed-up (10% to 23% reduction depending on the video mode) while more than eight CLs cannot improve the throughput due to the reduced clock frequency. 16, 32 and 64 CLs result in a throughput reduction of 8%, 17% and 21%. The cache with eight CLs combines a miss rate of at most 1% and a small clock frequency improvement of 2.8%, leading to a 1.0% to 2.5% throughput improvement. Table III compares the resource utilization of the noncached CABAC decoder with the cached designs with two and four CLs. Synthesis has been performed with area optimization to get meaningful results for a comparison between the different designs. Results with speed optimization are also provided to allow a fair comparison with other implementations. Two main conclusions can be drawn from the results. First, the cached designs (2CL and 4CL) require only 6% and 11% more Registers, as well as 7% and 10% more LUTs compared to the non-cached design. Second, less than 3% of the FPGA resources are needed to implement the CABAC decoder including the processor interface.
D. Resource Utilization

V. CONCLUSIONS
A quantitative performance analysis of an HEVC CABAC decoder has been conducted in this paper. We focused on the evaluation of the miss rate when the context model memory is replaced by a smaller cache. While this replacement results in significant clock frequency improvements for small cache sizes, the emerging cache misses nullify the effect. However, the cache miss rate can be effectively reduced with a well-designed context model memory layout and the corresponding prefetching strategy.
The configurations with two and four CLs are most promising. Four CLs result in a speed-up of 10% to 12% due to effective prefetching and a solid clock frequency improvement. Two CLs allow even higher clock frequencies but the miss rate is also higher, especially for high QP RA videos. As a result the design with two CLs outperforms the design with four CLs for high bitrate videos (9% to 15% speed-up) but not for low bitrates (1% to 4% speed-up). However, as CABAC throughput is not critical for the latter, the design with two CLs can still be the preferred option.
Despite the direct throughput improvement due to the enhanced clock frequency, other designs might remove the pipeline stage that performs the context model memory access when the cache can be shifted to an adjacent stage. The shortened pipeline might also significantly improve the throughput as the strong dependencies in CABAC decoding make deep pipelining inefficient.
