43 research outputs found

    Algorithms and Hardware Co-Design of HEVC Intra Encoders

    Get PDF
    Digital video is becoming extremely important nowadays and its importance has greatly increased in the last two decades. Due to the rapid development of information and communication technologies, the demand for Ultra-High Definition (UHD) video applications is becoming stronger. However, the most prevalent video compression standard H.264/AVC released in 2003 is inefficient when it comes to UHD videos. The increasing desire for superior compression efficiency to H.264/AVC leads to the standardization of High Efficiency Video Coding (HEVC). Compared with the H.264/AVC standard, HEVC offers a double compression ratio at the same level of video quality or substantial improvement of video quality at the same video bitrate. Yet, HE-VC/H.265 possesses superior compression efficiency, its complexity is several times more than H.264/AVC, impeding its high throughput implementation. Currently, most of the researchers have focused merely on algorithm level adaptations of HEVC/H.265 standard to reduce computational intensity without considering the hardware feasibility. What’s more, the exploration of efficient hardware architecture design is not exhaustive. Only a few research works have been conducted to explore efficient hardware architectures of HEVC/H.265 standard. In this dissertation, we investigate efficient algorithm adaptations and hardware architecture design of HEVC intra encoders. We also explore the deep learning approach in mode prediction. From the algorithm point of view, we propose three efficient hardware-oriented algorithm adaptations, including mode reduction, fast coding unit (CU) cost estimation, and group-based CABAC (context-adaptive binary arithmetic coding) rate estimation. Mode reduction aims to reduce mode candidates of each prediction unit (PU) in the rate-distortion optimization (RDO) process, which is both computation-intensive and time-consuming. Fast CU cost estimation is applied to reduce the complexity in rate-distortion (RD) calculation of each CU. Group-based CABAC rate estimation is proposed to parallelize syntax elements processing to greatly improve rate estimation throughput. From the hardware design perspective, a fully parallel hardware architecture of HEVC intra encoder is developed to sustain UHD video compression at 4K@30fps. The fully parallel architecture introduces four prediction engines (PE) and each PE performs the full cycle of mode prediction, transform, quantization, inverse quantization, inverse transform, reconstruction, rate-distortion estimation independently. PU blocks with different PU sizes will be processed by the different prediction engines (PE) simultaneously. Also, an efficient hardware implementation of a group-based CABAC rate estimator is incorporated into the proposed HEVC intra encoder for accurate and high-throughput rate estimation. To take advantage of the deep learning approach, we also propose a fully connected layer based neural network (FCLNN) mode preselection scheme to reduce the number of RDO modes of luma prediction blocks. All angular prediction modes are classified into 7 prediction groups. Each group contains 3-5 prediction modes that exhibit a similar prediction angle. A rough angle detection algorithm is designed to determine the prediction direction of the current block, then a small scale FCLNN is exploited to refine the mode prediction

    Application-Specific Cache and Prefetching for HEVC CABAC Decoding

    Get PDF
    Context-based Adaptive Binary Arithmetic Coding (CABAC) is the entropy coding module in the HEVC/H.265 video coding standard. As in its predecessor, H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Besides other optimizations, the replacement of the context model memory by a smaller cache has been proposed for hardware decoders, resulting in an improved clock frequency. However, the effect of potential cache misses has not been properly evaluated. This work fills the gap by performing an extensive evaluation of different cache configurations. Furthermore, it demonstrates that application-specific context model prefetching can effectively reduce the miss rate and increase the overall performance. The best results are achieved with two cache lines consisting of four or eight context models. The 2 × 8 cache allows a performance improvement of 13.2 percent to 16.7 percent compared to a non-cached decoder due to a 17 percent higher clock frequency and highly effective prefetching. The proposed HEVC/H.265 CABAC decoder allows the decoding of high-quality Full HD videos in real-time using few hardware resources on a low-power FPGA.EC/H2020/645500/EU/Improving European VoD Creative Industry with High Efficiency Video Delivery/Film26

    A Deeply Pipelined CABAC Decoder for HEVC Supporting Level 6.2 High-tier Applications

    Get PDF
    High Efficiency Video Coding (HEVC) is the latest video coding standard that specifies video resolutions up to 8K Ultra-HD (UHD) at 120 fps to support the next decade of video applications. This results in high-throughput requirements for the context adaptive binary arithmetic coding (CABAC) entropy decoder, which was already a well-known bottleneck in H.264/AVC. To address the throughput challenges, several modifications were made to CABAC during the standardization of HEVC. This work leverages these improvements in the design of a high-throughput HEVC CABAC decoder. It also supports the high-level parallel processing tools introduced by HEVC, including tile and wavefront parallel processing. The proposed design uses a deeply pipelined architecture to achieve a high clock rate. Additional techniques such as the state prefetch logic, latched-based context memory, and separate finite state machines are applied to minimize stall cycles, while multibypass- bin decoding is used to further increase the throughput. The design is implemented in an IBM 45nm SOI process. After place-and-route, its operating frequency reaches 1.6 GHz. The corresponding throughputs achieve up to 1696 and 2314 Mbin/s under common and theoretical worst-case test conditions, respectively. The results show that the design is sufficient to decode in real-time high-tier video bitstreams at level 6.2 (8K UHD at 120 fps), or main-tier bitstreams at level 5.1 (4K UHD at 60 fps) for applications requiring sub-frame latency, such as video conferencing

    The Optimization of Context-based Binary Arithmetic Coding in AVS2.0

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기정보공학부, 2016. 2. 채수익.HEVC(High Efficiency Video Coding)는 지난 제너레이션 표준 H.264/AVC보다 코딩 효율성을 향상시키기를 위해서 국제 표준 조직과(International Standard Organization) 국제 전기 통신 연합(International Telecommunication Union)에 의해 공동으로 개발된 것이다. 중국 작업 그룹인 AVS(Audio and Video coding standard)가 이미 비슷한 노력을 바쳤다. 그들이 많이 창의적인 코딩 도구를 운용한 첫 제너레이션 AVS1의 압축 퍼포먼스를 높이도록 최신의 코딩 표준(AVS2 or AVS2.0)을 개발했다. AVS2.0 중에 엔트로피 코딩 도구로 사용된 상황 기반 2진법 계산 코딩(CBAC)은 전체적 코딩 표준 중에서 중요한 역하를 했다. HEVC에서 채용된 상황 기반 조정의 2진법 계산 코딩(CABAC)과 비슷하게 이 두 코딩은 다 승수 자유 방법을 채용해서 계산 코딩을 현실하게 된다. 그런데 각 코딩마다 각자의 특정한 알고리즘을 통해 곱셈 문제를 처리한 것이다. 본지는 AVS2.0중의 CBAC에 대한 더 깊이 이해와 더 좋은 퍼포먼스 개선의 목적으로 3가지 측면의 일을 한다. 첫째, 우리가 한 비교 제도를 다자인을 해서 AVS2.0플랫폼 중의 CBAC와 CABAC를 비교했다. 다른 실행 세부 사항을 고려하여 HEVC중의 CABAC 알고리즘을 AVS2.0에 이식한다.예를 들면, 상황 기반 초기치가 다르다. 실험 결과는 CBAC가 더 좋은 코딩 퍼포먼스를 달성한다고 알려진다. 그 다음에 CBAC 알고리즘을 최적화시키기를 위해서 몇 가지 아이디어를 제안하게 됐다. 코딩 퍼포먼스 향상시키기의 목적으로 근사 오차 보상(approximation error compensation)과 확률 추정 최적화(probability estimation)를 도입했다. 두 코딩은 다른 앵커보다 다 부호화효율 향상 결과를 얻게 됐다. 다른 한편으로는 코딩 시간을 줄이기를 위하여 레테 추정 모델(rate estimation model)도 제안하게 된다. 부호율-변형 최적화 과정(Rate-Distortion Optimization process)의 부호율-변형 대가 계산(Rate-distortion cost calculation)을 지지하도록 리얼 CBAC 알고리즘(real CBAC algorithm) 레테 추정(rate estimation)을 사용했다. 마지막으로 2진법 계산 디코더(decoder) 실행 세부 사항을 서술했다. AVS2.0 중의 상황 기반 2진법 계산 디코딩(CBAD)이 너무 많이 데이터 종속성과 계산 부담을 도입하기 때문에 2개 혹은 2개 이상의 bin 평행 디코딩인 처리량(CBAD)을 디자인을 하기가 어렵다. 2진법 계산 디코딩의 one-bin 제도도 여기서 디자인을 하게 됐다. 현재까지 AVS의 CBAD 기존 디자인이 없다. 우리가 우리의 다자인을 관련된 HEVC의 연구와 비교하여 설득력이 강한 결과를 얻었다.High Efficiency Video Coding (HEVC) was jointly developed by the International Standard Organization (ISO) and International Telecommunication Union (ITU) to improve the coding efficiency further compared with last generation standard H.264/AVC. The similar efforts have been devoted by the Audio and Video coding Standard (AVS) Workgroup of China. They developed the newest video coding standard (AVS2 or AVS2.0) in order to enhance the compression performance of the first generation AVS1 with many novel coding tools. The Context-based Binary Arithmetic Coding (CBAC) as the entropy coding tool used in the AVS2.0 plays a vital role in the overall coding standard. Similar with Context-based Adaptive Binary Arithmetic Coding (CABAC) adopted by HEVC, both of them employ the multiplier-free method to realize the arithmetic coding procedure. However, each of them develops the respective specific algorithm to deal with multiplication problem. In this work, there are three aspects work we have done in order to understand CBAC in AVS2.0 better and try to explore more performance improvement. Firstly, we design a comparison scheme to compare the CBAC and CABAC in the AVS2.0 platform. The CABAC algorithm in HEVC was transplanted into AVS2.0 with consideration about the different implementation detail. For example, the context initialization. The experiment result shows that the CBAC achieves better coding performance. Then several ideas to optimize the CBAC algorithm in AVS2.0 were proposed. For coding performance improvement, the proposed approximation error compensation and probability estimation optimization were introduced. Both of these two coding tools obtain coding efficiency improvement compared with the anchor. In the other aspect, the rate estimation model was proposed to reduce the coding time. Using rate estimation instead of the real CBAC algorithm to support the Rate-distortion cost calculation in Rate-Distortion Optimization (RDO) process, can significantly save the coding time due to the computation complexity of CBAC in nature. Lastly, the binary arithmetic decoder implementation detail was described. Since Context-based Binary Arithmetic Decoding (CBAD) in AVS2.0 introduces too much strong data dependence and computation burden, it is difficult to design a high throughput CBAD with 2 bins or more decoded in parallel. Currently, one-bin scheme of binary arithmetic decoder was designed in this work. Even through there is no previous design for CBAD of AVS up to now, we compare our design with other relative works for HEVC, and our design achieves a compelling experiment result.Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Key Techniques in AVS2.0 3 1.3 Research Contents 9 1.3.1 Performance Comparison of CBAC 9 1.3.2 CBAC Performance Improvement 10 1.3.3 Implementation of Binary Arithmetic Decoder in CBAC 12 1.4 Organization 12 Chapter 2 Entropy Coder CBAC in AVS2.0 14 2.1 Introduction of Entropy Coding 14 2.2 CBAC Overview 16 2.2.1 Binarization and Generation of Bin String 17 2.2.2 Context Modeling and Probability Estimation 19 2.2.3 Binary Arithmetic Coding Engine 22 2.3 Two-level Scan Coding CBAC in AVS2.0 26 2.3.1 Scan order 28 2.3.2 First level coding 30 2.3.3 Second level coding 31 2.4 Summary 32 Chapter 3 Performance Comparison in CBAC 34 3.1 Differences between CBAC and CABAC 34 3.2 Comparison of Two BAC Engines 36 3.2.1 Statistics and initialization of Context Models 37 3.2.2 Adaptive Initialization Probability 40 3.3 Experiment Result 41 3.4 Conclusion 42 Chapter 4 CBAC Performance Improvement 43 4.1 Approximation Error Compensation 43 4.1.1 Error Compensation Table 43 4.1.2 Experiment Result 48 4.2 Probability Estimation Model Optimization 48 4.2.1 Probability Estimation 48 4.2.2 Probability Estimation Model in CBAC 52 4.2.3 The Optimization of Probability Estimation Model in CBAC 53 4.2.4 Experiment Result 56 4.3 Rate Estimation 58 4.3.1 Rate Estimation Model 58 4.3.2 Experiment Result 61 4.4 Conclusion 63 Chapter 5 Implementation of Binary Arithmetic Decoder in CBAC 64 5.1 Architecture of BAD 65 5.1.1 Top Architecture of BAD 66 5.1.2 Range Update Module 67 5.1.3 Offset Update Module 69 5.1.4 Bits Read Module 73 5.1.5 Context Modeling 74 5.2 Complexity of BAD 76 5.3 Conclusion 77 Chapter 6 Conclusion and Further Work 79 6.1 Conclusion 79 6.2 Future Works 80 Reference 82 Appendix 87 A.1. Co-simulation Environment 87 A.1.1 Range Update Module (dRangeUpdate.v) 87 A.1.2 Offset Update Module(dOffsetUpdate.v) 102 A.1.3 Bits Read Module (dReadBits.v) 107 A.1.4 Binary Arithmetic Decoding Top Module (BADTop.v) 115 A.1.5 Test Bench 117Maste

    Decoder Hardware Architecture for HEVC

    Get PDF
    This chapter provides an overview of the design challenges faced in the implementation of hardware HEVC decoders. These challenges can be attributed to the larger and diverse coding block sizes and transform sizes, the larger interpolation filter for motion compensation, the increased number of steps in intra prediction and the introduction of a new in-loop filter. Several solutions to address these implementation challenges are discussed. As a reference, results for an HEVC decoder test chip are also presented.Texas Instruments Incorporate

    Design and Implementation of Parallel Bypass Bin Processing for CABAC Encoder

    Get PDF
    The ever-increasing demand for high-quality digital video requires efficient compression techniques and fast video codecs. It necessitates increased complexity of the video codec algorithms. So, there is a need for hardware accelerators to implement such complex algorithms. The latest video compression algorithms such as High-Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) have been adopted Context-based Adaptive Binary Arithmetic Coding (CABAC) as the entropy coding method. The CABAC has two main data processing paths: regular and bypass bin path, which can achieve good compression when used with Syntax Elements (SEs) statistics. However, it is highly intrinsic data dependence and has sequential coding characteristics. Thus, it is challenging to parallelize. In this work, a 6-core bypass bin path having high-throughput and low hardware area has been proposed. It is a parallel architecture capable of processing up to 6 bypass bins per clock cycle to improve throughput. Further, the resource-sharing techniques within the binarization and a common controller block have reduced the hardware area. The proposed architecture has been simulated, synthesized, and prototyped on 28 nm Artix 7 Field Programmable Gate Array (FPGA). The implementation of Application Specific Integrated Circuit (ASIC) has been done using 65 nm CMOS technology. The proposed design achieved a throughput of 1.26 Gbin/s at 210 MHz operating frequency with a low hardware area compared to existing architectures. This architecture also supports multi-standard (HEVC/VVC) encoders for Ultra High Definition (UHD) applications

    Feasibility Study of High-Level Synthesis : Implementation of a Real-Time HEVC Intra Encoder on FPGA

    Get PDF
    High-Level Synthesis (HLS) on automatisoitu suunnitteluprosessi, joka pyrkii parantamaan tuottavuutta perinteisiin suunnittelumenetelmiin verrattuna, nostamalla suunnittelun abstraktiota rekisterisiirtotasolta (RTL) käyttäytymistasolle. Erilaisia kaupallisia HLS-työkaluja on ollut markkinoilla aina 1990-luvulta lähtien, mutta vasta äskettäin ne ovat alkaneet saada hyväksyntää teollisuudessa sekä akateemisessa maailmassa. Hidas käyttöönottoaste on johtunut pääasiassa huonommasta tulosten laadusta (QoR) kuin mitä on ollut mahdollista tavanomaisilla laitteistokuvauskielillä (HDL). Uusimmat HLS-työkalusukupolvet ovat kuitenkin kaventaneet QoR-aukkoa huomattavasti. Tämä väitöskirja tutkii HLS:n soveltuvuutta videokoodekkien kehittämiseen. Se esittelee useita HLS-toteutuksia High Efficiency Video Coding (HEVC) -koodaukselle, joka on keskeinen mahdollistava tekniikka lukuisille nykyaikaisille mediasovelluksille. HEVC kaksinkertaistaa koodaustehokkuuden edeltäjäänsä Advanced Video Coding (AVC) -standardiin verrattuna, saavuttaen silti saman subjektiivisen visuaalisen laadun. Tämä tyypillisesti saavutetaan huomattavalla laskennallisella lisäkustannuksella. Siksi reaaliaikainen HEVC vaatii automatisoituja suunnittelumenetelmiä, joita voidaan käyttää rautatoteutus- (HW ) ja varmennustyön minimoimiseen. Tässä väitöskirjassa ehdotetaan HLS:n käyttöä koko enkooderin suunnitteluprosessissa. Dataintensiivisistä koodaustyökaluista, kuten intra-ennustus ja diskreetit muunnokset, myös enemmän kontrollia vaativiin kokonaisuuksiin, kuten entropiakoodaukseen. Avoimen lähdekoodin Kvazaar HEVC -enkooderin C-lähdekoodia hyödynnetään tässä työssä referenssinä HLS-suunnittelulle sekä toteutuksen varmentamisessa. Suorituskykytulokset saadaan ja raportoidaan ohjelmoitavalla porttimatriisilla (FPGA). Tämän väitöskirjan tärkein tuotos on HEVC intra enkooderin prototyyppi. Prototyyppi koostuu Nokia AirFrame Cloud Server palvelimesta, varustettuna kahdella 2.4 GHz:n 14-ytiminen Intel Xeon prosessorilla, sekä kahdesta Intel Arria 10 GX FPGA kiihdytinkortista, jotka voidaan kytkeä serveriin käyttäen joko peripheral component interconnect express (PCIe) liitäntää tai 40 gigabitin Ethernettiä. Prototyyppijärjestelmä saavuttaa reaaliaikaisen 4K enkoodausnopeuden, jopa 120 kuvaa sekunnissa. Lisäksi järjestelmän suorituskykyä on helppo skaalata paremmaksi lisäämällä järjestelmään käytännössä minkä tahansa määrän verkkoon kytkettäviä FPGA-kortteja. Monimutkaisen HEVC:n tehokas mallinnus ja sen monipuolisten ominaisuuksien mukauttaminen reaaliaikaiselle HW HEVC enkooderille ei ole triviaali tehtävä, koska HW-toteutukset ovat perinteisesti erittäin aikaa vieviä. Tämä väitöskirja osoittaa, että HLS:n avulla pystytään nopeuttamaan kehitysaikaa, tarjoamaan ennen näkemätöntä suunnittelun skaalautuvuutta, ja silti osoittamaan kilpailukykyisiä QoR-arvoja ja absoluuttista suorituskykyä verrattuna olemassa oleviin toteutuksiin.High-Level Synthesis (HLS) is an automated design process that seeks to improve productivity over traditional design methods by increasing design abstraction from register transfer level (RTL) to behavioural level. Various commercial HLS tools have been available on the market since the 1990s, but only recently they have started to gain adoption across industry and academia. The slow adoption rate has mainly stemmed from lower quality of results (QoR) than obtained with conventional hardware description languages (HDLs). However, the latest HLS tool generations have substantially narrowed the QoR gap. This thesis studies the feasibility of HLS in video codec development. It introduces several HLS implementations for High Efficiency Video Coding (HEVC) , that is the key enabling technology for numerous modern media applications. HEVC doubles the coding efficiency over its predecessor Advanced Video Coding (AVC) standard for the same subjective visual quality, but typically at the cost of considerably higher computational complexity. Therefore, real-time HEVC calls for automated design methodologies that can be used to minimize the HW implementation and verification effort. This thesis proposes to use HLS throughout the whole encoder design process. From data-intensive coding tools, like intra prediction and discrete transforms, to more control-oriented tools, such as entropy coding. The C source code of the open-source Kvazaar HEVC encoder serves as a design entry point for the HLS flow, and it is also utilized in design verification. The performance results are gathered with and reported for field programmable gate array (FPGA) . The main contribution of this thesis is an HEVC intra encoder prototype that is built on a Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 GX FPGA Development Kits, that can be connected to the server via peripheral component interconnect express (PCIe) generation 3 or 40 Gigabit Ethernet. The proof-of-concept system achieves real-time. 4K coding speed up to 120 fps, which can be further scaled up by adding practically any number of network-connected FPGA cards. Overcoming the complexity of HEVC and customizing its rich features for a real-time HEVC encoder implementation on hardware is not a trivial task, as hardware development has traditionally turned out to be very time-consuming. This thesis shows that HLS is able to boost the development time, provide previously unseen design scalability, and still result in competitive performance and QoR over state-of-the-art hardware implementations

    High-Level Synthesis Based VLSI Architectures for Video Coding

    Get PDF
    High Efficiency Video Coding (HEVC) is state-of-the-art video coding standard. Emerging applications like free-viewpoint video, 360degree video, augmented reality, 3D movies etc. require standardized extensions of HEVC. The standardized extensions of HEVC include HEVC Scalable Video Coding (SHVC), HEVC Multiview Video Coding (MV-HEVC), MV-HEVC+ Depth (3D-HEVC) and HEVC Screen Content Coding. 3D-HEVC is used for applications like view synthesis generation, free-viewpoint video. Coding and transmission of depth maps in 3D-HEVC is used for the virtual view synthesis by the algorithms like Depth Image Based Rendering (DIBR). As first step, we performed the profiling of the 3D-HEVC standard. Computational intensive parts of the standard are identified for the efficient hardware implementation. One of the computational intensive part of the 3D-HEVC, HEVC and H.264/AVC is the Interpolation Filtering used for Fractional Motion Estimation (FME). The hardware implementation of the interpolation filtering is carried out using High-Level Synthesis (HLS) tools. Xilinx Vivado Design Suite is used for the HLS implementation of the interpolation filters of HEVC and H.264/AVC. The complexity of the digital systems is greatly increased. High-Level Synthesis is the methodology which offers great benefits such as late architectural or functional changes without time consuming in rewriting of RTL-code, algorithms can be tested and evaluated early in the design cycle and development of accurate models against which the final hardware can be verified
    corecore