Search CORE

188 research outputs found

pocl: A Performance-Portable OpenCL Implementation

Author: Berg Heikki
de La Lama Carlos Sánchez
Jääskeläinen Pekka
Raiskila Kalle
Schnetter Erik
Takala Jarmo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

Portable and efficient FFT and DCT algorithms with the Heterogeneous Butterfly Processing Library

Author: Amor Margarita
Fraguela Basilio B.
Vázquez Pardo Sergio
Publication venue: Elsevier
Publication date: 01/03/2019
Field of study

Versión final aceptada de: https://doi.org/10.1016/j.jpdc.2018.11.011This version of the article: Vázquez, S., Amor, M., Fraguela, B. B. (2019). 'Portable and efficient FFT and DCT algorithms with the heterogeneous butterfly processing library', has been accepted for publication in Journal of Parallel and Distributed Computing, 125, 135–146. The Version of Record is available online at https://doi.org/10.1016/j.jpdc.2018.11.011.[Abstract]: The existence of a wide variety of computing devices with very different properties makes essential the development of software that is not only portable among them, but which also adapts to the properties of each platform. In this paper, we present the Heterogeneous Butterfly Processing Library (HBPL), which provides optimized portable kernels for problems of small sizes that allow using orthogonal transform algorithms such as the FFT and DCT on different accelerators and regular CPUs. Our library is implemented on the OpenCL standard, which provides portability on a large number of platforms. Furthermore, high performance is achieved on a wide range of devices by exploiting run-time code generation and metaprogramming guided by a parametrization strategy. An exhaustive evaluation on different platforms shows that our proposal obtains competitive or better performance than related libraries.This research has received financial support from the Ministerio de Economía y Competitividad of Spain and European Regional Development Fund (ERDF) funds (80%) of the EU (TIN2016-75845-P), by the Consellería de Cultura, Educación e Ordenación Universitaria, Xunta de Galicia co-founded by European Regional Development Fund (ERDF) funds under the Consolidation Programme of Competitive Reference Groups (Ref. ED431C 2017/04) and the Consolidation Programme of Competitive Research Units (Ref. R2014/049 and Ref. R2016/037) as well as by the Consellería de Cultura, Educación e Ordenación Universitaria, Xunta de Galicia (Centro Singular de Investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund, ERDF) under Grant Ref. ED431G/01.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/01Xunta de Galicia; R2014/049Xunta de Galicia; R2016/03

Repositorio da Universidade da Coruña

Performance engineering for HEVC transform and quantization kernel on GPUs

Author: Alen Duspara
Igor Piljić
Leon Dragić
Mario Kovač
Mate Čobrnić
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2020
Field of study

Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application’s intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

FPGA Virtualisation on Heterogeneous Computing Systems --- Model, Tools, and Systems

Author: Pham Khoa
Publication venue
Publication date: 01/08/2020
Field of study

The University of Manchester - Institutional Repository

Exploiting Intrinsic Hardware Guardbands and Software Heterogeneity to Improve the Energy Efficiency of Computing Systems

Author: Κουτσοβασίλης Παναγιώτης Σ.
Publication venue
Publication date: 01/01/2020
Field of study

University of Thessaly Institutional Repository

Overview of Parallel Platforms for Common High Performance Computing

Author: Adamec Filip
Fryza Tomas
Marsalek Roman
Prokopec Jan
Svobodova Jitka
Publication venue: Společnost pro radioelektronické inženýrství
Publication date: 01/04/2012
Field of study

The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well

Directory of Open Access Journals

Digital library of Brno University of Technology

Exploring manycore architectures for next-generation HPC systems through the MANGO approach

[EN] The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 671668.Flich Cardo, J.; Agosta, G.; Ampletzer, P.; Atienza-Alonso, D.; Brandolese, C.; Cappe, E.; Cilardo, A.... (2018). Exploring manycore architectures for next-generation HPC systems through the MANGO approach. Microprocessors and Microsystems. 61:154-170. https://doi.org/10.1016/j.micpro.2018.05.011S1541706

Infoscience - École polytechnique fédérale de Lausanne

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

RiuNet