170 research outputs found

    Exploring the design space of HEVC inverse transforms with dataflow programming

    Get PDF
    This paper presents the design space exploration of the hardware-based inverse fixed-point integer transform for High Efficiency Video Coding (HEVC). The designs are specified at high-level using CAL dataflow language and automatically synthesized to HDL for FPGA implementation. Several parallel design alternatives are proposed with trade-off between performance and resource. The HEVC transform consists of several independent components from 4x4 to 32x32 discrete cosine transform and 4x4 discrete sine transform.This work explores the strategies to efficiently compute the transforms by applying data parallelism on the different components. Results show that an intermediate version of parallelism, whereby the 4x4 and 8x8 are merged together, and the 16x16 and 32x32 merged together gives the best trade-off between performance and resource. The results presented in this work also give an insight on how the HEVC transform can be designed efficiently in parallel for hardware implementation

    Low-Complexity Reconfigurable DCT-V Architecture

    Get PDF
    This brief presents a low-complexity, reconfigurable architecture for the Discrete Cosine Transform (DCT) of type V (DCT-V) of length 32. The proposed architecture can be reconfigured to compute five DCT-V of length 4 with negligible area overhead. As the DCT-V is one of the odd type transforms employed in the Adaptive Multiple Transform (AMT) scheme, the effect of fixed point implementation has been assessed in the Joint Exploration Model (JEM) developed by the JVET group for the Versatile-Video-Coding (VVC) forthcoming standard. Simulation results show that the proposed architecture is not only low-complexity and reconfigurable, but features also imperceptible quality loss. Moreover, when implemented in 90 nm CMOS technology it occupies only 90k eq. gates running at 187 MHz

    Performance engineering for HEVC transform and quantization kernel on GPUs

    Get PDF
    Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application’s intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times

    Performance analysis of Discrete Cosine Transform in Multibeamforming

    Get PDF
    Aperture arrays are widely used in beamforming applications where element signals are steered to a particular direction of interest and a single beam is formed. Multibeamforming is an extension of single beamforming, which is desired in the fields where sources located in multiple directions are of interest. Discrete Fourier Transform (DFT) is usually used in these scenarios to segregate the received signals based on their direction of arrivals. In case of broadband signals, DFT of the data at each sensor of an array decomposes the signal into multiple narrowband signals. However, if hardware cost and implementation complexity are of concern while maintaining the desired performance, Discrete Cosine Transform (DCT) outperforms DFT. In this work, instead of DFT, the Discrete Cosine Transform (DCT) is used to decompose the received signal into multiple beams into multiple directions. DCT offers simple and efficient hardware implementation. Also, while low frequency signals are of interest, DCT can process correlated data and perform close to the ideal Karhunen-Loeve Transform (KLT). To further improve the accuracy and reduce the implementation cost, an efficient technique using Algebraic Integer Quantization (AIQ) of the DCT is presented. Both 8-point and 16-point versions of DCT using AIQ mapping have been presented and their performance is analyzed in terms of accuracy and hardware complexity. It has been shown that the proposed AIQ DCT offers considerable savings in hardware compared to DFT and classical DCT while maintaining the same accuracy of beam steering in multibeamforming application

    Exploring manycore architectures for next-generation HPC systems through the MANGO approach

    Full text link
    [EN] The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza-Alonso, D.; Brandolese, C.; Cappe, E.; Cilardo, A.... (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems. 61:154-170. https://doi.org/10.1016/j.micpro.2018.05.011S1541706
    corecore