11 research outputs found

    An efficient interpolation filter VLSI architecture for HEVC standard

    Get PDF

    HEVC의 소수 단위 움직임 추정을 위한 보간 필터 중복 연산 감소 방법

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 8. 이혁재.High-Efficiency Video Coding (HEVC) [1] is the latest video coding standard established by Joint Collaborative Team on Video Coding (JCT-VC) aiming to achieve twice encoding efficiency with comparatively high video quality compared to its predecessor, the H.264 standard. Motion Estimation (ME) which consists of integer motion estimation (IME) and fractional motion estimation (FME) is the bottleneck of HEVC computation. In the execution of the HM reference software, ME alone accounts for about 50 % of the execution time in which IME contributes to about 20 % and FME does around 30% [2].The FMEs enormous computational complexity can be explained by two following reasons: • A large number of FME refinements processed: In HEVC, a frame is divided into CTU, whose size is usually 64x64 pixels. One 64x64 CTU consists of 85 CUs including one 64x64 CU at depth 0, four 32x32 CUs at depth 1, 16 16x16 CUs at depth 2, and 64 8x8 CUs at depth 3. Each CU can be partitioned into PUs according to a set of 8 allowable partition types. An HEVC encoder processes FME refinement for all possible PUs with usually 4 reference frames before deciding the best configuration for a CTU. As a result, typically in HEVCs reference software, HM, for one CTU, it has to process 2,372 FME refinements, which consumes a lot of computational resources. • A complicated and redundant interpolation process: Conventionally, FME refinement, which consists of interpolation and sum of absolute transformed difference (SATD), is processed for every PU in 4 reference frames. As a result, for a 64x64 CTU, in order to process fractional pixel refinement, FME needs to interpolate 6,232,900 fractional pixels. In addition, In HEVC, fractional pixels which consist half fractional pixels and quarter fractional pixels, are interpolated by 8-tap filters and 7-tap filters instead of 6-tap filters and bilinear filters as previous standards. As a result, interpolation process in FME imposes an extreme computational burden on HEVC encoders. This work proposes two algorithms which tackle each one of the two above reasons. The first algorithm, Advanced Decision of PU Partitions and CU Depths for FME, estimates the cost of IMEs and selects the PU partition types at the CU level and the CU depths at the coding tree unit (CTU) level for FME. Experimental results show that the algorithm effectively reduces the complexity by 67.47% with a BD-BR degrade of 1.08%. The second algorithm, A Reduction of the Interpolation Redundancy for FME, reduces up to 86.46% interpolation computation without any encoding performance decrease. The combination of the two algorithms forms a coherent solution to reduce the complexity of FME. Considering interpolation is a half of the complexity of an FME refinement, then the complexity of FME could be reduced more than 85% with a BD-BR increase of 1.66%Chapter 1. Introduction 1 1. Introduction to Video Coding 1 1.1. Definition of Video Coding 1 1.2. The Need of Video Coding 1 1.3. Basics of Video Coding 2 1.4. Video Coding Standard 2 2. Introduction to HEVC 6 2.1. HEVC Background and Development 6 2.2. Block Partitioning Structure in HEVC 9 Chapter 2. Fractional Motion Estimation in HEVC and Related Works on Complexity Reduction 21 1. Motion Estimation 21 2. Fractional Motion Estimation 22 2.1. Interpolation 22 2.2. Sum of Absolute Transformed Difference Calculation 27 2.3. Fractional Motion Estimation Procedure 28 Chapter 3. Complexity Reduction for FME 31 1. Problem Statement and Previous Studies 31 1.1. Problem Statement 31 1.2. Previous Studies 32 2. Proposed Algorithms 34 2.1. Advanced Decision of PU Partitions and CU Depths for Fractional Motion Estimation in HEVC 34 2.2. Range-based interpolation algorithm 40 Chapter 4. Experiment Results 43 1. Advanced Decision of PU Partitions and CU Depths for Fractional Motion Estimation in HEVC Algorithms 43 1.1. Advanced Decision of PU Partitions 43 1.2. Advanced Decision of CU Partitions 47 1.3. Combination of Advanced PU Partition and CU Depth Decision 47 1.4. Comparison with Other Similar Works 48 2. Range-based Algorithm 49 2.1. Software Implementation 49 2.2. Hardware Implementation of the Algorithm 50 Chapter 5. Conclusion 61 Bibliography 64 Abstract in Korean 66Maste

    High-Level Synthesis Based VLSI Architectures for Video Coding

    Get PDF
    High Efficiency Video Coding (HEVC) is state-of-the-art video coding standard. Emerging applications like free-viewpoint video, 360degree video, augmented reality, 3D movies etc. require standardized extensions of HEVC. The standardized extensions of HEVC include HEVC Scalable Video Coding (SHVC), HEVC Multiview Video Coding (MV-HEVC), MV-HEVC+ Depth (3D-HEVC) and HEVC Screen Content Coding. 3D-HEVC is used for applications like view synthesis generation, free-viewpoint video. Coding and transmission of depth maps in 3D-HEVC is used for the virtual view synthesis by the algorithms like Depth Image Based Rendering (DIBR). As first step, we performed the profiling of the 3D-HEVC standard. Computational intensive parts of the standard are identified for the efficient hardware implementation. One of the computational intensive part of the 3D-HEVC, HEVC and H.264/AVC is the Interpolation Filtering used for Fractional Motion Estimation (FME). The hardware implementation of the interpolation filtering is carried out using High-Level Synthesis (HLS) tools. Xilinx Vivado Design Suite is used for the HLS implementation of the interpolation filters of HEVC and H.264/AVC. The complexity of the digital systems is greatly increased. High-Level Synthesis is the methodology which offers great benefits such as late architectural or functional changes without time consuming in rewriting of RTL-code, algorithms can be tested and evaluated early in the design cycle and development of accurate models against which the final hardware can be verified

    Low-power and application-specific SRAM design for energy-efficient motion estimation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 181-189).Video content is expected to account for 70% of total mobile data traffic in 2015. High efficiency video coding, in this context, is crucial for lowering the transmission and storage costs for portable electronics. However, modern video coding standards impose a large hardware complexity. Hence, energy-efficiency of these hardware blocks is becoming more critical than ever before for mobile devices. SRAMs are critical components in almost all SoCs affecting the overall energy-efficiency. This thesis focuses on algorithm and architecture development as well as low-power and application-specific SRAM design targeting motion estimation. First, a motion estimation design is considered for the next generation video standard, HEVC. Hardware cost and coding efficiency trade-offs are quantified and an optimum design choice between hardware complexity and coding efficiency is proposed. Hardware-efficient search algorithm, shared search range across CU engines and pixel pre-fetching algorithms provide 4.3x area, 56x on-chip bandwidth and 151 x off-chip bandwidth reduction. Second, a highly-parallel motion estimation design targeting ultra-low voltage operation and supporting AVC/H.264 and VC-1 standards are considered. Hardware reconfigurability along with frame and macro-block parallel processing are implemented for this engine to maximize hardware sharing between multiple standards and to meet throughput constraints. Third, in the context of low-power SRAMs, a 6T and an 8T SRAM are designed in 28nm and 45nm CMOS technologies targeting low voltage operation. The 6T design achieves operation down to 0.6V and the 8T design achieves operation down to 0.5V providing ~ 2.8x and ~ 4.8x reduction in energy/access respectively. Finally, an application-specific SRAM design targeted for motion estimation is developed. Utilizing the correlation of pixel data to reduce bit-line switching activity, this SRAM achieves up to 1.9x energy savings compared to a similar conventional 8T design. These savings demonstrate that application-specific SRAM design can introduce a new dimension and can be combined with voltage scaling to maximize energy-efficiency.by Mahmut Ersin Sinangil.Ph.D

    Implementation of a motion estimation algorithm for Intel FPGAs using OpenCL

    Get PDF
    Producción CientíficaMotion Estimation is one of the main tasks behind any video encoder. It is a compu- tationally costly task; therefore, it is usually delegated to specific or reconfigurable hardware, such as FPGAs. Over the years, multiple FPGA implementations have been developed, mainly using hardware description languages such as Verilog or VHDL. Since programming using hardware description languages is a complex task, it is desirable to use higher-level languages to develop FPGA applications.The aim of this work is to evaluate OpenCL, in terms of expressiveness, as a tool for devel- oping this kind of FPGA applications. To do so, we present and evaluate a parallel implementation of the Block Matching Motion Estimation process using OpenCL for Intel FPGAs, usable and tested on an Intel Stratix 10 FPGA. The implementa- tion efficiently processes Full HD frames completely inside the FPGA. In this work, we show the resource utilization when synthesizing the code on an Intel Stratix 10 FPGA, as well as a performance comparison with multiple CPU implementations with varying levels of optimization and vectorization capabilities. We also compare the proposed OpenCL implementation, in terms of resource utilization and perfor- mance, with estimations obtained from an equivalent VHDL implementation.Junta de Castilla y León - Consejería de Educación de la Proyecto PROPHET-2 (VA226P20)Ministerio de Economía, Industria y Competitividad: (PID2019- 104834 GB-I00) and European Regional Development Fund (ERDF) program: Project PCAS (TIN2017-88614-R)Ministerio de Ciencia e Innovación (PID2019-104184RB-I00 / AEI / 10.13039/501100011033)Xunta de Galicia y fondos FEDER de la UE (Centro de Investigación de Galicia acreditación 2019-2022, ref. ED431G 2019/01; Consolidation Program of Competitive Reference Groups, ref. ED431C 2021/30Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación y “European Union NextGenerationEU/PRTR” : (MCIN/ AEI/10.13039/501100011033) - grant TED2021-130367B-I00Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    Low energy video processing and compression hardware designs

    Get PDF
    Digital video processing and compression algorithms are used in many commercial products such as mobile devices, unmanned aerial vehicles, and autonomous cars. Increasing resolution of videos used in these commercial products increased computational complexities of digital video processing and compression algorithms. Therefore, it is necessary to reduce computational complexities of digital video processing and compression algorithms, and energy consumptions of digital video processing and compression hardware without reducing visual quality. In this thesis, we propose a novel adaptive 2D digital image processing algorithm for 2D median filter, Gaussian blur and image sharpening. We designed low energy 2D median filter, Gaussian blur and image sharpening hardware using the proposed algorithm. We propose approximate HEVC intra prediction and HEVC fractional interpolation algorithms. We designed low energy approximate HEVC intra prediction and HEVC fractional interpolation hardware. We also propose several HEVC fractional interpolation hardware architectures. We propose novel computational complexity and energy reduction techniques for HEVC DCT and inverse DCT/DST. We designed high performance and low energy hardware for HEVC DCT and inverse DCT/DST including the proposed techniques. VII We quantified computation reductions achieved and video quality loss caused by the proposed algorithms and techniques. We implemented the proposed hardware architectures in Verilog HDL. We mapped the Verilog RTL codes to Xilinx Virtex 6 and Xilinx ZYNQ FPGAs, and estimated their power consumptions using Xilinx XPower Analyzer tool. The proposed algorithms and techniques significantly reduced the power and energy consumptions of these FPGA implementations in some cases with no PSNR loss and in some cases with very small PSNR loss

    Applications in Electronics Pervading Industry, Environment and Society

    Get PDF
    This book features the manuscripts accepted for the Special Issue “Applications in Electronics Pervading Industry, Environment and Society—Sensing Systems and Pervasive Intelligence” of the MDPI journal Sensors. Most of the papers come from a selection of the best papers of the 2019 edition of the “Applications in Electronics Pervading Industry, Environment and Society” (APPLEPIES) Conference, which was held in November 2019. All these papers have been significantly enhanced with novel experimental results. The papers give an overview of the trends in research and development activities concerning the pervasive application of electronics in industry, the environment, and society. The focus of these papers is on cyber physical systems (CPS), with research proposals for new sensor acquisition and ADC (analog to digital converter) methods, high-speed communication systems, cybersecurity, big data management, and data processing including emerging machine learning techniques. Physical implementation aspects are discussed as well as the trade-off found between functional performance and hardware/system costs

    Holistic Optimization of Embedded Computer Vision Systems

    Full text link
    Despite strong interest in embedded computer vision, the computational demands of Convolutional Neural Network (CNN) inference far exceed the resources available in embedded devices. Thankfully, the typical embedded device has a number of desirable properties that can be leveraged to significantly reduce the time and energy required for CNN inference. This thesis presents three independent and synergistic methods for optimizing embedded computer vision: 1) Reducing the time and energy needed to capture and preprocess input images by optimizing the image capture pipeline for the needs of CNNs rather than humans. 2) Exploiting temporal redundancy within incoming video streams to perform computationally cheap motion estimation and compensation in lieu of full CNN inference for the majority of frames. 3) Leveraging the sparsity of CNN activations within the frequency domain to significantly reduce the number of operations needed for inference. Collectively these techniques significantly reduce the time and energy needed for computer vision at the edge, enabling a wide variety of exciting new applications
    corecore