9 research outputs found

    Performance engineering for HEVC transform and quantization kernel on GPUs

    Get PDF
    Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application’s intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times

    Exploring manycore architectures for next-generation HPC systems through the MANGO approach

    Full text link
    [EN] The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza-Alonso, D.; Brandolese, C.; Cappe, E.; Cilardo, A.... (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems. 61:154-170. https://doi.org/10.1016/j.micpro.2018.05.011S1541706

    Cardiovascular diseases and air pollution in Novi Sad, Serbia

    Full text link
    Objectives: A large body of evidence has documented that air pollutants have adverse effect on human health as well as on the environment. The aim of this study was to determine whether there was an association between outdoor concentrations of sulfur dioxide (SO2) and nitrogen dioxide (NO2) and a daily number of hospital admissions due to cardiovascular diseases (CVD) in Novi Sad, Serbia among patients aged above 18. Material and Methods: The investigation was carried out during over a 3-year period (from January 1, 2007 to December 31, 2009) in the area of Novi Sad. The number (N = 10 469) of daily CVD (ICD-10: I00-I99) hospital admissions was collected according to patients' addresses. Daily mean levels of NO2 and SO2, measured in the ambient air of Novi Sad via a network of fixed samplers, have been used to put forward outdoor air pollution. Associations between air pollutants and hospital admissions were firstly analyzed by the use of the linear regression in a single polluted model, and then trough a single and multi-polluted adjusted generalized linear Poisson model. Results: The single polluted model (without confounding factors) indicated that there was a linear increase in the number of hospital admissions due to CVD in relation to the linear increase in concentrations of SO2 (p = 0.015; 95% confidence interval (95% CI): 0.144-1.329, R2 = 0.005) and NO2 (p = 0.007; 95% CI: 0.214-1.361, R2 = 0.007). However, the single and multi-polluted adjusted models revealed that only NO2 was associated with the CVD (p = 0.016, relative risk (RR) = 1.049, 95% CI: 1.009-1.091 and p = 0.022, RR = 1.047, 95% CI: 1.007-1.089, respectively). Conclusions: This study shows a significant positive association between hospital admissions due to CVD and outdoor NO2 concentrations in the area of Novi Sad, Serbia

    Arhitekture sustava za odlučivanje o načinu pravovremenoga videotranskodiranja na reznorodnim računalima visokih performaci

    No full text
    Today, Internet traffic is dominated by video content and projections show that this trend will continue to increase. The performance growth of the technology also allowed the introduction of High Definition (HD) and Ultra-High Definition (UHD) videos. These are some of the facts that point out that efficiency in storing and streaming video content is a necessity. Current methods and paradigms for storing and streaming multimedia are not sustainable. To provide optimal efficiency of storing and streaming video, it is necessary to have the possibility to encode the video with occasion-specific parameters. This can be achieved using just-in-time transcoding. Efficient video transcoding requires significant work on modeling, mapping and optimizing parts of the algorithms to different underlying architectural elements. Software optimizations are required but not sufficient, and the use of the hardware accelerator kernels for critical parts of the algorithm is mandatory to enable efficient processing from the performance, power and QoE perspective. Balancing between these three characteristics in real-time presents a great challenge and is often considered as a critical point called mode decision. Mode decision algorithms can be efficiently co-designed with hardware-based accelerator kernels to provide greater performance, while maintaining the quality and compression efficiency.In this thesis, two original scientific contributions were achieved:1. Design of performance-optimized just-in-time video transcoding mode decision algorithms and hardware-based accelerator kernels for heterogeneous high performance computers 2. Performance-efficient integration of system architectures composed of implemented just-in-time video transcoding mode decision algorithms and hardware-based accelerator kernels on heterogeneous high performance computersU današnje vrijeme, 80% ukupnog Internet prometa čini video sadržaj, a predviđanja pokazuju da će taj udio nastaviti rasti tokom idućih godina. Ova iznimna količina video sadržaja, glavni je pokretač razvoja novih normi za enkodiranje odnosno kompresiju videa kojima se omogućuje njegova učinkovita pohrana i prijenos. Trenutne metode i paradigme pohrane i prijenosa video sadržaja nisu održive. Umjesto da se video sadržaj transkodira u različite formate prilikom pohrane na poslužitelj, moguće je pohraniti samo sadržaj najviše kvalitete te ga kasnije pravovremeno transkodirati na zahtjev korisnika. Ovaj proces naziva se pravovremeno videotranskodiranje. Pravovremeno videotranskodiranje iznimno je računalno zahtjevan proces koji rješava problem višestruke pohrane istog sadržaja na poslužitelj, ali i omogućuje dinamičku prilagodbu svojstava video sadržaja korisničkom uređaju i okolini čime se ostvaruje ušteda u energiji te povećava učinkovitost i korisnički doživljaj.Učinkovit sustav video transkodiranja zahtijeva modeliranje, mapiranje i optimiziranje algoritama različitim arhitekturama za izvođenje. Programske optimizacije su potrebne, ali ne i dovoljne te je nužno koristiti jezgre za ubrzanje za kritične dijelove algoritma kako bi se ostvarila učinkovitost iz 3 perspektive: učinkovitost obzirom na performance, učinkovitost potrošnje energije i osiguravanje kvalitete usluge. Balansiranje između ove 3 karakteristike u stvarnom vremenu predstavlja veliki izazov i smatra se kritičnim dijelom sustava koji se još naziva algoritmom odlučivanja.Unutar ovog doktorskog rada ostvareni su sljedeći doprinosi:Dizajn algoritama odlučivanja o načinu pravovremenoga videotranskodiranja i sklopovskih jezgara za ubrzanje, optimiranih za učinkovito izvođenje na raznorodnim računalima visokih performanci.Integracija arhitektura sustava, učinkovita s obzirom na performance, sačinjena od izvedenih algoritama odlučivanja o načinu pravovremenoga videotranskodiranja i sklopovskih jezgara za ubrzanje na raznorodnim računalima visokih performanci

    Arhitekture sustava za odlučivanje o načinu pravovremenoga videotranskodiranja na reznorodnim računalima visokih performaci

    No full text
    Today, Internet traffic is dominated by video content and projections show that this trend will continue to increase. The performance growth of the technology also allowed the introduction of High Definition (HD) and Ultra-High Definition (UHD) videos. These are some of the facts that point out that efficiency in storing and streaming video content is a necessity. Current methods and paradigms for storing and streaming multimedia are not sustainable. To provide optimal efficiency of storing and streaming video, it is necessary to have the possibility to encode the video with occasion-specific parameters. This can be achieved using just-in-time transcoding. Efficient video transcoding requires significant work on modeling, mapping and optimizing parts of the algorithms to different underlying architectural elements. Software optimizations are required but not sufficient, and the use of the hardware accelerator kernels for critical parts of the algorithm is mandatory to enable efficient processing from the performance, power and QoE perspective. Balancing between these three characteristics in real-time presents a great challenge and is often considered as a critical point called mode decision. Mode decision algorithms can be efficiently co-designed with hardware-based accelerator kernels to provide greater performance, while maintaining the quality and compression efficiency.In this thesis, two original scientific contributions were achieved:1. Design of performance-optimized just-in-time video transcoding mode decision algorithms and hardware-based accelerator kernels for heterogeneous high performance computers 2. Performance-efficient integration of system architectures composed of implemented just-in-time video transcoding mode decision algorithms and hardware-based accelerator kernels on heterogeneous high performance computersU današnje vrijeme, 80% ukupnog Internet prometa čini video sadržaj, a predviđanja pokazuju da će taj udio nastaviti rasti tokom idućih godina. Ova iznimna količina video sadržaja, glavni je pokretač razvoja novih normi za enkodiranje odnosno kompresiju videa kojima se omogućuje njegova učinkovita pohrana i prijenos. Trenutne metode i paradigme pohrane i prijenosa video sadržaja nisu održive. Umjesto da se video sadržaj transkodira u različite formate prilikom pohrane na poslužitelj, moguće je pohraniti samo sadržaj najviše kvalitete te ga kasnije pravovremeno transkodirati na zahtjev korisnika. Ovaj proces naziva se pravovremeno videotranskodiranje. Pravovremeno videotranskodiranje iznimno je računalno zahtjevan proces koji rješava problem višestruke pohrane istog sadržaja na poslužitelj, ali i omogućuje dinamičku prilagodbu svojstava video sadržaja korisničkom uređaju i okolini čime se ostvaruje ušteda u energiji te povećava učinkovitost i korisnički doživljaj.Učinkovit sustav video transkodiranja zahtijeva modeliranje, mapiranje i optimiziranje algoritama različitim arhitekturama za izvođenje. Programske optimizacije su potrebne, ali ne i dovoljne te je nužno koristiti jezgre za ubrzanje za kritične dijelove algoritma kako bi se ostvarila učinkovitost iz 3 perspektive: učinkovitost obzirom na performance, učinkovitost potrošnje energije i osiguravanje kvalitete usluge. Balansiranje između ove 3 karakteristike u stvarnom vremenu predstavlja veliki izazov i smatra se kritičnim dijelom sustava koji se još naziva algoritmom odlučivanja.Unutar ovog doktorskog rada ostvareni su sljedeći doprinosi:Dizajn algoritama odlučivanja o načinu pravovremenoga videotranskodiranja i sklopovskih jezgara za ubrzanje, optimiranih za učinkovito izvođenje na raznorodnim računalima visokih performanci.Integracija arhitektura sustava, učinkovita s obzirom na performance, sačinjena od izvedenih algoritama odlučivanja o načinu pravovremenoga videotranskodiranja i sklopovskih jezgara za ubrzanje na raznorodnim računalima visokih performanci

    Performance-efficient integration and programming approach of DCT accelerator for HEVC in MANGO platform

    Get PDF
    Video encoding based on novel HEVC standard is an extremely computationally expensive process and achieving efficient encoding requires intelligent utilization of all available resources, from both software and hardware perspective. Profiling and analysis of the encoding process identified Discrete cosine transform (DCT) as one of the key kernels that consume most of the time in the application's runtime. Therefore, high-throughput, fully-pipelined hardware accelerator was designed in FPGA and integrated into MANGO platform. MANGO platform is heterogeneous HPC system that consists of different types of nodes, from general purpose nodes (GN) to heterogeneous nodes (HN). While executing specific kernels on GN nodes is a straight-forward process, executing kernels on accelerator-based HNs is a more complex procedure and requires specific integration to successfully exploit heterogeneous architecture. This paper presents performance-efficient integration of DCT hardware accelerator in MANGO platform, focusing on the performance of the encoder while maintaining coding efficiency and video quality of the encoded bitstream. Several approaches were considered, tested and compared; from the standalone integration where series of single tasks were offloaded to the DCT accelerator, to more complex solutions based on smart buffer utilization

    Dynamic load balancing algorithm based on HEVC tiles for just-in-time video encoding for heterogeneous architectures

    Get PDF
    This paper proposes a novel algorithm for dynamic tile partitioning to achieve the optimal workload balance for parallel processing architectures in just-in-time HEVC encoding. Tile boundaries are dynamically shifted depending on the tile cost, a value that denotes predicted computational complexity of a single tile in a frame. The overall cost of a tile is determined as a combination of costs of three computationally most expensive and resource-hungry operations in HEVC encoding: prediction, transformation, and entropy coding. The algorithm aims at exploiting different types of processing architectures, from homogeneous multicore CPU architectures to heterogeneous architectures in the actual conditions in which streaming servers operate. The experimental results show that the proposed algorithm outperforms uniform tiling, by up to 5.5% in processing time, while maintaining the same video quality and bitrate. Compared to the state-of-the-art algorithms, the proposed algorithm achieves up to 8.85% speedup depending on the number of videos that are being encoded concurrently on a video streaming server

    Highly parallel GPU accelerator for HEVC transform and quantization

    No full text
    When analysing Internet traffic today it can be found that digital video content prevails. Its domination will continue to grow in the upcoming years and reach 82% of all traffic by 2021. If converted to Internet video minutes per second, this equals about one million video minutes per second. Providing and supporting improved compression capability is therefore expected from video processing devices. This will relieve the pressure on storage systems and communication networks while creating preconditions for further development of video services. Transform and quantization is one of the most compute-intensive parts of modern hybrid video coding systems where coding algorithm itself is commonly standardized. High Efficiency Video Coding (HEVC) is state-of-the-art video coding standard which achieves high compression efficiency at the cost of high computational complexity. In this paper we present highly parallel GPU accelerator for HEVC transform and quantization which targets most common heterogeneous computing CPU+GPU system. The accelerator is implemented using CUDA programming model. All the relevant state-of-the-art techniques related to kernel vectorization, shared memory optimization and overlapping data transfers with computation were investigated, customized and carefully combined to obtain a performance efficient solution across all applicable transform sizes. The proposed solution is compared against reference implementation which uses NVIDIA cuBLAS library to perform the same work. Obtained speedup factors for DCI 4K frame are 2.46 times for largest transform size and 130.17 times for smallest transform size what revealed substantial performance gap of this library when targeting GPU of the Kepler architecture. Achieved processing time of frame transform and quantization are up to 4.82 ms
    corecore