Abstract-This paper proposes a 16x16 and 32x32 inverse transform architecture for HEVC (High Efficiency Video Coding). HEVC large transform of 16x16 and 32x32 suffers from huge computational complexity. To resolve this problem, we proposed a new large inverse transform architecture based on hardware reuse. The processing element is optimized by exploiting fully recursive and regular butterfly structure. To achieve low area, the processing element is implemented by shifters and adders without multiplier. Implementation of the proposed 2-D inverse transform architecture in 0.18 m technology shows about 300 MHz frequency and 287 Kgates area, which can process 4K (3840x2160)@ 30 fps image.
I. INTRODUCTION
Recent development of digital video compression technology, high quality video applications such as HDTV are popular. In the near future, next-generation video devices will have much higher definition and resolution such as UHD (Ultra High Definition) TV. In these services, multimedia data increase tremendously. For example, data amount for UHD TV is sixteen times as large as full-HD TV. It is difficult to transmit UHD resolution data to end-user over current network using current video coding standards such as H.264/AVC.
To resolve this problem, ISO-IEC/MPEG and ITU-T/VCEG recently formed the joint collaborative team on video coding (JCT-VC). It aims to develop the nextgeneration video coding standard called high efficiency video coding (HEVC) [1] . Main goal of HEVC is to achieve high compression, where data rate is reduced by 50% compared to H.264/AVC with same picture quality.
However, HEVC has huge computational amounts using various complex algorithms to achieve high compression efficiency. Its computation is said to be 2-4 times larger than H.264/AVC at the same picture size. Furthermore, its computation increases exponentially as HEVC supports up to 8K image.
Practically, video coding standards such as H.264/ AVC should be implemented in SoC, where area reduction is a major concern [2, 3] . Especially, huge computation of HEVC results in extremely large area. To resolve this problem, new design methodology is required. In this paper, low area inverse transform architecture for HEVC is proposed. By exploiting fully recursive and regular butterfly structure, the proposed architecture achieves low area with hardware reuse, where its processing element consists of shifters, adders, and multiplexers only.
II. HEVC TRANSFORM

Coding Structure
In H.264/AVC, a basic coding unit is MB (Macro Block) with 16x16 pixels. But HEVC uses several basic units, i.e. CU (Coding Unit), PU (Prediction Unit), and TU (Transform Unit). CU is a basic coding unit, as MB Manuscript received Sep. 4, 2011; revised Dec. 1, 2011. School of Electronic Engineering, Soongsil University, Sangdo-dong, Dongjak-gu, Seoul, 156-743, Korea E-mail : sslee@ssu.ac.kr in H.264/AVC. CU is considered to be the fundamental square shaped unit. It has various sizes. PU is a basic prediction unit. It is defined after the last level of CU splitting. So CU can be further split into PU. TU is a basic unit for transform and quantization. Its size must be smaller than or equal to the CU size, but it can be larger than the PU size. The overall coding structure is characterized by CU, PU and TU.
By using various CU sizes, an efficient encoding for various spatial resolution and block characteristic is possible. In general, when spatial resolution is low or pixel values change significantly in local area, intra and inter prediction for small CU are more useful as shown in Fig. 1(a) . When spatial resolution is high or pixel values change a little in local area, large CU can improve coding efficiency as shown in Fig. 1(b) . When large CU can be used for prediction instead of a small CU, prediction error doesn't increase significantly [4] .
HEVC exploits RQT (Residual Quadtree) structure [5, 6] . The prediction error is transformed and quantized based on RQT structure as shown in Fig. 2 . TU size is adaptively determined based on prediction error characteristics of PUs. A PU can be further split into several TUs if the prediction errors of the split TUs in the PU are quite different. On the contrary, several PUs can be combined into a TU if the prediction errors of the combined PUs in the TU are quite similar. By transforming and quantizing TUs of various sizes, the overall coding efficiency can be significantly improved.
Large Transform
For high resolution displays, large transform has several advantages such as better energy compaction and reduced quantization error. In HD or UHD images, most image patterns in a MB with 16x16 pixels represent a small part of objects or backgrounds, which can be described as relatively homogeneous texture patterns with little variation. Therefore, coding efficiency of high resolution video can be improved by using large transform as well as large block size [7] . To reduce complexity, HEVC large transform is based on Chen's fast DCT algorithm [8] that has fully recursive and regular butterfly structure. Fig. 3 shows a signal flow graph of HEVC 32x32 inverse transform [5] based on Chen's fast DCT algorithm. Inputs are 32 pixels, and they are processed with 8-stage butterfly operations. So the required complexity is very high. The dotted box in Fig. 3 shows the signal flow graph of HEVC 16x16 inverse transform. As 16x16 inverse transform is also based on Chen's fast DCT algorithm, it has same butterfly structure with 6-stage but coefficient values are different. These 16x16 and 32x32 inverse transforms have fully recursive and regular butterfly structures. Compared with conventional 8x8 or 4x4 H.264 inverse transform, HEVC large inverse transforms (16x16 and 32x32) have extremely high complexity. This requires impractically large hardware size, so it is important to minimize hardware area. In this paper, we propose low area architecture of HEVC large inverse transform as follows. It reduces area by reusing processing elements (PE) [9] . The PE is optimized to be implemented with shifters, adders, and multiplexers only. Fig. 3 . But it does not meet the required throughput if only one PE is used. Therefore, we exploited 16 PEs in a processing unit, and 16 pixels are computed by one processing unit. Note that the processing unit and the PE corresponds to the thin solid line and bold solid line rectangles in Fig. 3 .
III. IMPLEMENTATION
Hardware Reuse
A processing unit can compute both 16x16 inverse transform and 32x32 inverse transform. If 16 PEs in a processing unit perform six stages as dotted line rectangle in Fig. 3 
PE Optimization (1): Multiplexers
One serious problem of PE architecture in Fig. 4 
PE Optimization (2): Shifters and Adders
In straightforward design, each PE requires 2 multipliers, which is impractically large. To reduce hardware area, PE can be implemented by shifters and adders without multipliers [9] based on coefficient decomposition. Table 2 shows only 1st stage, and maximum number of shift for each PE is 4 or 5 considering all stages of 16x16 and 32x32 inverse transform. As shown Fig. 5 , we exploited 4 shifters (Type A) or 5 shifters (Type B).
2-D Inverse Transform
For high resolution such as 4k and 8k images, 32x32 2-D inverse transform is required. In this case, 1 processing unit with 16 PEs is not enough for throughput requirement. Therefore, we used two 1-D processing units in pipelined manner. As shown Fig. 6 , when one processing unit executes 2-D transform for current TU#0, the other processing unit can execute 1-D transform for next TU#1. So total operation cycle can be reduced. However, in this case, two 32x32 buffers are requiredone for TU#0 outputs and the other for TU#1 outputs. Furthermore, these buffers should perform transpose operation.
To resolve this problem, we used 1 32x32 transpose buffer, shared with 2 processing units in common. As shown in Fig. 6 Fig. 7(b) . Whenever 2-D transpose data (32x1) of current TU are calculated, transpose data shifts up. Proposed architecture supports both 16x16 and 32x32 inverse transform. So transpose data (16x16) are stored right-top of transpose buffer when the proposed architecture performs 16x16 inverse transform. 
Proposed Architecture
IV. RESULTS
The proposed 2-D HEVC large inverse transform (16x16, 32x32) is implemented by 0.18 m CMOS technology. Total gate count of proposed architecture is about 287K gates (2-NAND equivalent). Among them, transpose buffer is about 183K gates.
Since 32x32 inverse transform architecture is not proposed in previous researches yet, we compared the proposed architecture and other 1-D 8x8 inverse transform architectures, as shown in Table 3 . Considering 32x32 block size (16 times larger than 8x8), proposed architecture shows quite small gate counts. Fig. 9 shows the layout of the proposed HEVC large inverse transform IP. Its maximum operating frequency is 300 MHz.
Execution cycle of proposed architecture for 32x32 block is 481 cycles, i.e. 2.1289 pixels/cycle, as shown Fig. 6 . Required throughput for 4k(3840x2160)@30 fps image is 248,832,000 pixels/s. Therefore, required operating frequency is 248,832,000/2.1289 = 117 MHz. However, in many video decoders, there may be some data stall to wait data from and to its connecting IPs such as motion compensation, intra prediction, and inverse quantization IPs. Even the proposed architecture is stalled in 50% of its operation time, it can easily support 4K@30 fps image, since the required operating frequency is 234MHz.
In H.264 decoders, inverse transform IP is small and fast. Therefore, in many cases, it is directly connected to other IPs without interface buffer. Note that interface buffer greatly reduces data stall, but it occupies much additional area. However, 32x32 HEVC inverse transform IP is quite large, so it is advantageous to use interface buffer to speed up. When the proposed architecture is stalled in 20% of its operation time, the required operating frequency for 4k@60 fps image is 292 MHz. Therefore, the proposed architecture can support 4K@60 fps image by exploiting interface buffer.
V. CONCLUSIONS
In this paper, we proposed 2-D large inverse transform architecture for HEVC. To reduce hardware area, we exploited hardware reuse and optimized adder/shifter/ multiplexer-only PE architecture. Implementation of the proposed 2-D inverse transform architecture in 0.18 m technology shows about 300 MHz frequency and 287 Kgates area, which can process 4K@30 fps image.
