# REGISTER TRANSFER LEVEL DESIGN OF TRANSPOSE MEMORY FOR THE TWO-DIMENSIONINVERSE DISCRETE COSINE TRANSFORM FOR HIGH EFFICIENCY VIDEO CODING

GOH DIH JIANN

UNIVERSITI TEKNOLOGI MALAYSIA

# REGISTER TRANSFER LEVEL DESIGN OF TRANSPOSE MEMORY FOR THE TWO-DIMENSION INVERSE DISCRETE COSINE TRANSFORM FOR HIGH EFFICIENCY VIDEO CODING

GOH DIH JIANN

A project report submitted in partial fulfilment of the requirements for the award of the degree of Master of Engineering (Computer and Microelectronic System)

> Faculty of Electrical Engineering Universiti Teknologi Malaysia

> > JUNE 2018

Specially dedicated to my supervisor, friends and family who encouraged me throughout my journey of education.

# ACKNOWLEDGEMENT

I would like to take this opportunity to thanks to my family members in supporting me for me able to focus at fulfilling the objectives of this object. I would like to express my sincere appreciation to my friends and colleague that provide time and space for me and their encouragement. I am also very thankful to my project supervisor, Dr. Ab Al-Hadi Ab Rahman for his guidance and advices. The support from everyone around me always be my strongest motivation.

### ABSTRACT

The rapid revolution in consumer devices have caused in a variety of emerging video coding applications which contribute the aggressive demands on video compression requirement. The requirement of video compression efficiency getting higher. Today, Advance Video Coding (AVC) standard was replaced by the new High Efficiency Video Coding (HEVC) video compression standard due to major advance in compression compare to former. However, optimizing coding efficiency in HEVC is the root of increased computational complexity. Thus, Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) are absolute necessary accelerator in HEVC hardware implementation. However, the hardware design of these accelerator complexity become more complicated due to flexibility given by the new video compression standard. This project aimed to design Two-Dimension Inverse Discrete Cosine Transform (2D IDCT) hardware transpose memory using hardware description language. The first objective in this project was implemented transpose memory that support different transform block dimension (4x4, 8x8, 16x16 and 32x32 transform unit). Both register-based design and RAM-based design were implemented. Secondly, a test bench was designed to validate the functionality of RTL design. Third, the integration was done between 1D IDCT building block with designed transpose memory and overall system functionality was validated. Finally, analysis was done to find out trade-off in performance, resource and power between register-based and dedicate RAM based transpose memory. The results show that register-based 2D IDCT have 2.24 times better throughput and 35.6% less energy consumption compare to RAM-based 2D IDCT. However, register-based 2D IDCT have 30 times more resource utilization compare to RAM-based 2-D IDCT. Thus, RAM-based 2D IDCT is more suitable for small electronic device. If area expenses is negligible and performance is needed, register-based 2D IDCT can be considered.

## ABSTRAK

Perkembangan teknologi dalam peranti pengguna telah menyebabkan kemunculan pelbagai aplikasi pengekodan video. Ini telah menyumbangkan permintaan resolusi video yang semakin tinggi terhadap keperluan dalam video rakaman dan video ulangan. Oleh itu, H.264/AVC standard mampatan video yang lama telah diganti dengan standard H.265/HEVC yang lebih ada kebolehan untuk memampat resolusi video yang tinggi dengan kualiti yang lebih tinggi. Sebagai salah satu langkah untuk melajukan pengekodan and penyahkodan video, reka bentuk algoritma melalui perkakasan adalah munasabah. Walau bagainmanapun, perubahan yang diperkenalkan oleh standard HEVC iaitu penggunaan saiz blok yang berbeza-beza daripada saiz 4x4 kepada 32x32 dengan fleksibel telah meningkat kerumitkan dalam reka bentuk perkakasan. Projek ini bertujuan untuk mereka duadimensi transformasi songsang (2D IDCT) transpose memori dengan menggunakan bahasa perhuraian perkakasan (HDL). Objektif pertama dalam projek init adalah mereka transpose memori yang boleh sokong saiz blok yang pelbagain dimensi. Reka bentuk kedua-dua transpose memori adalah berasaskan register and RAM. Keduanya, ujian untuk reka bentuk telah direka untuk menguji fungsi-fungsi transpose memori supaya perkakasan berfungsi seperti yang diharapkan. Ketiga, integrasi antara transpose memori dan blok 1D IDCT telah dilaksanakan untuk menghasilkan 2D IDCT dan fungsi keseluruhan sistem telah diuji. Akhir sekali, analisis telah dilaksanakan untuk mengetahui prestasi, pengunaan kawasan dan penggunaan kuasa di antara 2D IDCT yang menggunakan register dan RAM. Keputusan menunjukkan bahawa register 2D IDCT mempunyai 2.24 kali lebih baik prestasi dan 35.6% kurang pengguanna tenaga berbanding dengan RAM 2D IDCT. Walau bagaimanapun, register 2D IDCT mempunyai pengunaan kawasan sebanyak 30 kali lebih berbanding dengan RAM 2D IDCT. Oleh itu, RAM 2D IDCT lebih sesuai untuk peranti elektronik yang kecil. Sekiranya perbelanjaan dalam pengunnaan kawasan boleh diabaikan and prestasi lebih dipentingkan, register 2D IDCT boleh dipertimbangkan.

# **TABLE OF CONTENTS**

| CHAPTER |                 | TITLE                        | PAGE    |  |
|---------|-----------------|------------------------------|---------|--|
|         | DECLARATION     |                              | ii      |  |
|         | Ι               | DEDICATION                   | iii     |  |
|         | A               | ACKNOWLEDGEMENT              | iv      |  |
|         | A               | ABSTRACT                     | v<br>vi |  |
|         | A               | ABSTRAK                      |         |  |
|         | 1               | TABLE OF CONTENTS            | vii     |  |
|         | Ι               | LIST OF TABLES               | х       |  |
|         | LIST OF FIGURES |                              | xi      |  |
|         | Ι               | LIST OF ABBREVIATIONS        | xiii    |  |
| 1       | INTRODUCTION    |                              | 1       |  |
|         | 1.1             | Background of Study          | 1       |  |
|         | 1.2             | Problem Statement            | 2       |  |
|         | 1.3             | Objectives of Project        | 3       |  |
|         | 1.4             | Scope of Project             | 3       |  |
|         | 1.5             | Project Report Outline       | 4       |  |
| 2       | LIT             | ERATURE REVIEW               | 5       |  |
|         | 2.1             | High Efficiency Video Coding | 5       |  |
|         | 2.2             | HEVC Encoder and Decoder     | 6       |  |
|         | 2.3             | HEVC Transform               | 9       |  |
|         |                 | 2.3.1 2-D IDCT Architecture  | 9       |  |
|         |                 | 2.3.2 Transpose Memory       | 10      |  |
|         | 2.4             | Previous Work                | 11      |  |

|     | 2.4.1  | Single-Port SRAM-based Transpose Memory    | 11 |
|-----|--------|--------------------------------------------|----|
|     |        | With Diagonal Data Mapping for Large Size  |    |
|     |        | 2-D DCT/IDCT [16]                          |    |
|     | 2.4.2  | Data Mapping Scheme and Implementation     | 12 |
|     |        | For High-Throughput DCT/IDCT Transpose     |    |
|     |        | Memory [17]                                |    |
|     | 2.4.3  | Hardware Design of the Inverse Discrete    | 13 |
|     |        | Cosine Transform For High Efficiency Video |    |
|     |        | Memory [17]                                |    |
| RES | EARCH  | METHODOLOGY                                | 16 |
| 3.1 | Projec | t Flow                                     | 16 |
| 3.2 | Desigr | n Tools                                    | 17 |
| 3.3 | Regist | er-based transpose memory for 2-D IDCT     | 17 |
|     | 3.3.1  | Designing single-point transpose memory    | 17 |
|     |        | (Block Level)                              |    |
|     | 3.3.2  | Designing transpose memory that support    | 20 |
|     |        | different TUs (Unit Level)                 |    |
|     | 3.3.3  | Integrate with 1-D IDCT module to form     | 21 |
|     |        | 2-D IDCT (Integration Level)               |    |
| 3.4 | RAM-   | based transpose memory for 2-D IDCT        | 22 |
|     | 3.4.1  | Designing SRAM memory                      | 22 |
|     |        | (Block Level)                              |    |
|     | 3.4.2  | Designing Address Generator Module         | 24 |
|     |        | with SRAM (Unit Level)                     |    |
|     | 3.4.3  | Integrate with 1-D IDCT module to form     | 25 |
|     |        | 2-D IDCT (Integration Level)               |    |
| 3.5 | Test B | ench Design                                | 26 |
| RES | ULTS A | ND DISCUSSIONS                             | 28 |
| 4.1 | FPGA   | Utilization                                | 29 |
| 4.2 | Throug | ghput                                      | 29 |
| 4.3 | Power  | and Energy Consumption                     | 30 |
|     |        |                                            |    |

| 5 | CON | CLUSION AND FUTURE WORK | 32 |
|---|-----|-------------------------|----|
|   | 5.1 | Conclusion              | 32 |
|   | 5.2 | Future Work             | 34 |
|   |     |                         |    |

| REFERENCES | 35 |
|------------|----|
| Appendix A | 36 |

# LIST OF TABLES

| TABLE NO. | TITLE                                            | PAGES |
|-----------|--------------------------------------------------|-------|
| 4.1       | Resource Utilization between register-based and  | 29    |
|           | RAM-based                                        |       |
| 4.2       | Throughput comparison between register-based and | 30    |
|           | RAM-based                                        |       |
| 4.3       | Power and Energy consumption between register-   | 31    |
|           | based and RAM-based                              |       |

# LIST OF FIGURES

| FIGURE NO | ). TITLE                                                                                                                  | PAGE |
|-----------|---------------------------------------------------------------------------------------------------------------------------|------|
| 2.1       | General block diagram of a video coding system.                                                                           | 6    |
| 2.2       | Block diagram of an HEVC encoder with built-in decoder (gray shaded).                                                     | 7    |
| 2.3       | (a) Partitioning of the picture into 16 x16 macroblocks (b)<br>Partitioning of the picture into 64x 64 coding tree units. | 8    |
| 2.4       | Partitioning of coding tree block into coding blocks and transform blocks.                                                | 8    |
| 2.5       | (a) The folded 2-D DCT architecture. (b) The full parallel 2-D DCT architectures.                                         | 10   |
| 2.6       | Diagonal data mapping scheme for 8x8 2-D DCT with 4-sample/cycle write throughput.                                        | 12   |
| 2.7       | The generalized architecture for single-port SRAM based transpose memory proposed in [17].                                | 13   |
| 2.8       | Block diagram of basic inverse discrete cosine transform architecture.                                                    | 14   |
| 2.9       | Block diagram of basic Inverse Transform Blocks of Parallel8 and Parallel16 architectures.                                | 15   |
| 3.1       | Block diagram of single-point transpose memory.                                                                           | 18   |
| 3.2       | Concept of transposition over 8x8 TU size, column output need h0 for first transposition result.                          | 19   |
| 3.3       | Read and write protocol that alternatively swapped between<br>row and column direction, the first TU size is 4x4.         | 20   |

| 3.4 | Transpose memory module that support $4x4$ to $32x32$ transform unit's size. | 20 |
|-----|------------------------------------------------------------------------------|----|
| 3.5 | Block diagram of proposed register-based 2-D IDCT architecture.              | 22 |
| 3.6 | Block diagram of RAM-based memory with 4 memory banks.                       | 23 |
| 3.7 | Mapping of matrix to 4 SRAM memory banks to perform transpose.               | 23 |
| 3.8 | Block diagram of proposed RAM-based 2-D IDCT architecture.                   | 26 |

# LIST OF ABBREVIATIONS

| AGM  | - | Address Generator Module              |
|------|---|---------------------------------------|
| DCT  | - | Discrete Cosine Transform             |
| RAM  | - | Random Access Memory                  |
| IDCT | - | Inverse Discrete Cosine Transform     |
| SRAM | - | Synchronous Random Access Memory      |
| HD   | - | High Definition                       |
| HEVC | - | High Efficiency Video Coding          |
| TU   | - | Transform Unit                        |
| MPEG | - | Moving Picture Experts Group          |
| ITU  | - | International Telecommunication Union |
| VCEG | - | Video Coding Experts Group            |
| RCDM | - | Row Column Decomposition Method       |
| HDL  | - | Hardware Description Language         |
| FPGA | - | Field-Programmable Gate Array         |
| SAIF | - | Switching Activity Interface Format   |

# **CHAPTER 1**

#### INTRODUCTION

# 1.1 Background of Study

In recent years, having electronic devices becoming a norm and emerging of variety video coding applications have contribute the aggressive demands on video compression requirement. With the increasing diversity of electronic devices supporting digital video, growing popularity of High Definition (HD) video and people tends to look for higher frame rate and resolution for video beyond HD-formats, the requirement on recording, coding, decoding and playback ability increased vigorously due to the wish for better user video experience [1].

In order to provide higher video coding and decoding efficiency, a new video standard which is H.265/High Efficiency Video Coding (HEVC) was proposed by the MPEG and the ITU bodies to substitute the older H.264/AVC video standard [2]. Transform Unit (TU) is a basic unit for HEVC transform thereby larger TU size can improve coding/decoding efficiency. The major introduction in HEVC is using four different TU block sizes from 4x4, 8x8, 16x16 and 32x32 with higher precision (16-bit precision) as compared to AVC's 4x4 and 8x8. Because of this, the implementation of new standard using hardware imposes more challenges, data exchange complexity increased significantly due to more varying TU block size are

required to perform Discrete Cosine Transform (DCT) and inverse Discrete Cosine Transform (IDCT) [3].

For hardware implementation of HEVC video standard, there are multiple block structures and parallelism features [1]. Over the past, 2-D DCT and IDCT has been widely used in the field of block based video coding and decoding standard due to its capability to concentrate the energy of video residual data into low frequency domain and vice versa [4]. 2-D DCT is use for video compression while 2-D IDCT is use for video decompression. What this project is looking into is 2-D IDCT block, and highly focusing on implementation of transpose memory. 2-D IDCT consist of two 1-D IDCT unit and a transpose memory, the details architecture of 2-D IDCT will be provided in next chapter.

## **1.2 Problem Statement**

For higher video coding and decoding efficiency in digital video, there are performance limitation if only using software to implement. Therefore a dedicated hardware is needed for HEVC transform block to guaranteed performance. As previously stated, the introduction of larger TU sizes by the new standard and the flexibility of supporting 4 different TU sizes from 4x4 to 32x32 bring the hardware design complexity to a new level, internal building block also need redesign to support the requirement, including transpose memory. Previously published studies [6] – [11] on implementation of HEVC transform has tended to focus on optimizing 1-D IDCT block rather than transpose memory, mostly using RAM memory design, or avoid usage of transpose memory. The studies would have been more interesting if they had also included tradeoff analysis between different memory type designs of transpose memory. Due to insufficient data for tradeoff benchmark value for different memory type transpose memory, a more comprehensive study needed. Thus, this project aims to propose a transpose memory architecture to support the new requirement in HEVC transform with two different memory type and to identify

tradeoff performance between them at top level (2-D IDCT) by analyze the performance, resource and energy consumption.

### **1.3** Objectives of Project

Followings are the objectives proposed for this project: -

- (1) To design transpose memory using register-based memory.
- (2) To design transpose memory using RAM-based memory.
- (3) To develop a test bench to validate the functionality of transpose memories.
- (4) To integrate transpose memories with 1-D IDCT unit to form 2-D IDCT.
- (5) To analyze and determine the tradeoff in performance, resource and energy consumption between register-based design and RAM-based design at integration level.

## **1.4** Scope of the Study

In this project, only 2-D IDCT components will be validate and analyze while other HEVCs components will not include. The transpose memory will be design based on two type of memory which is register-based memory and RAM-based memory. For the 2-D IDCT integration, the 1-D IDCT unit will reuse existing validated design in [5], thus design of 1-D IDCT unit is not needed. Redesign of control path in existing control logic in [5] is needed due to input is not always available for second IDCT unit because of transposition delay cause by different transform unit sizes. The design of transpose memory must compatible with the 1-D IDCT unit design in [5] which use pipeline architecture and support varies transform unit size. The design and the test bench will be done using hardware description language (HDL).

# **1.5 Project Report Outline**

Hereby the thesis consist of total of 5 Chapter. Chapter 1 consist of background of study, problem statement, objectives of project and project report outline of this thesis. Chapter 2 consist of literature review of HEVC video and design of 2-D IDCT from previous researchers. Chapter 3 consist of all project methodology that had been deployed. Chapter 4 consist of the results of project and analysis on results. Chapter 5 concluded the whole project and make recommendation for future work.

#### REFERENCES

- 1. Vivienne, S., Madhukar, B. and Gary, J.S. *High Efficiency Video Coding (HEVC) Algorithm* and Architectures. London: Springer. 39-88, 141-167, 303-341; 2014.
- Ruhan, C.j., Claudio, S.J., Ricardo, J., Marcelo, P., Julio, M., Luciano, A. Hardware Design for the 32x32 IDCT of the HEVC Video Coding Standard. 26<sup>th</sup> Symposium, Curitiba, Brazil: Integrated Circuits and Systems Design (SBCCI). 2013. 1-6.
- 3. Meher, P.K., Park, S.Y., Mohanty, B.K., Lim, K.S. and Yeo, C. Efficient integer DCT architectures for HEVC. *IEEE Transactions on Circuits and Systems for Video Technology*, 2014. 24(1): 168-178.
- 4. Sha, S., Weiwei, S., Yibo, F., and Xiaoyang, Z. A Unified Forward/Inverse Transform Architecture for Multi-Standard Video Codec Design. *IEICE Trans. Fundamentals Electron. Computer Sci.*, 2013. 96-A(7): 1534-1542.
- 5. Ng Yan Duan. Hardware Design of the Inverse Discrete Cosine Transform For High Efficiency Video Coding. Master Thesis. Universiti Teknologi Malaysia; 2017
- 6. Mehul, T., Chao-Tsung, H., Vivienne, S., and Anantha, C. Energy And Area-Efficient Hardware Implementation of HEVC Inverse Transform and Dequantization. *Image Processing (ICIP), 2014 IEEE International Conference*, Paris, France: IEEE. 2014. 2100-2104.
- Sha, S., Weiwei, S., Yibo, F., and Xiaoyang, Z. A Unified 4/8/16/32-point Integer IDCT Architecture for Multiple Video Coding Standards. *Multimedia and Expo (ICME), IEEE International Conference*, Melbourne, VIC, Australia: IEEE. 2012. 788-793.
- Heming, S., Dajiang, Z., Jiayi, Z., Shinji, K. and Sataoshi, G. An Area-efficient 4/8/16/32point Inverse DCT Architecture for UHTDTV HEVC Decoder. *Visual Communications and Image Processing Conference*, Valletta, Malta: IEEE. 2014. 197-200.
- Chih-Peng Fan. Fast 2-Dimensional 4x4 Forward Integer Transform Implementation for H.264/AVC. *IEEE Transactions on Circuits and Systems II: Express Briefs*, 2006. 53(3): 174-177.
- Thomas, T., Stavros, D. A Novel Architecture for Fast 2D IDCT Decoders, with Reduced Number of Multiplications. *IEEE Transactions on Consumer Electronics*, 2011. 57(3): 1384-1389.
- Ahmed, K., Maher, A., Ahmed, S., and Mohammed S.S. A Reconfigurable 2-D IDCT Architecture for HEVC Encoder/Decoder. *Microelectronic (ICM), 2015 27<sup>th</sup> International Conference*, Casablanca, Morocco: IEEE. 2015. 242-245.
- Benjamin, B., Woo-Jin, H., Gary, J.S., Jens-Rainer, O., and Thomas, W. *High Efficiency Video Coding (HEVC) Text Specification Draft 9*, document JCTVC-K1003, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC), Oct. 2012.
- 13. Gary, J.S., Jens-Rainer. O., Woo-Jin, H., and Thomas, W. Overview of the High Efficiency Video Coding (HEVC) Standard. *IEEE Transactions on Circuits and System for Video Technology*, 2012. 22(12): 1649-1668.
- Jens-Rainer, O., Gary, J.S., Heiko, S., Thiow, K.T., and Thomas, W. Comparison of the Coding Efficiency of Video Coding Standars – Including High Efficiency Video Coding (HEVC). *IEEE Transactions on Circuits and System for Video Technology*, 2012. 22(12): 1669-1684.
- 15. Frank, B., Benjamin, B., Karsten, S., and David, F. HEVC Complexity and Implementation Analysis. *IEEE Transactions on Circuits and System for Video Technology*, 2012. 22(12): 1685-1696.
- Qing, S., Yibo, F. Weiwei, S., Sha, S., and Xiaoyang, Z. Single-Port SRAM-ased Transpose Memory with Diagonal Data Mapping for Larger Size 2-D DCT/IDCT. *IEEE Transactions* on Very Large Scale Integration (VLSI) Systems, 2014. 22(11): 2422-2462.

17. Zheng, X., YanHeng, L., Yibo, F. and Xiaoyang, Z. Data Mapping Scheme and Implementation for High Throughput DCT/IDCT Transpose Memory. *Solid State and Integrated Circuit Technology (ICSICT), 2014 12th IEEE International Conference*, Guilin China: IEEE. 2014. 1-3.