Search CORE

2,329 research outputs found

High-level programming of stencil computations on multi-GPU systems using the SkelCL library

Author: Breuer Stefan
Gorlatch Sergei
Haidl Michael
Steuwer Michel
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/09/2014
Field of study

The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code

Crossref

Enlighten

Transformations of High-Level Synthesis Codes for High-Performance Computing

Author: Besta Maciej
Hoefler Torsten
Licht Johannes de Fine
Meierhans Simon
Publication venue
Publication date: 29/10/2019
Field of study

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

arXiv.org e-Print Archive

Repository for Publications and Research Data

Improving Data Locality in Distributed Processing of Multi-Channel Remote Sensing Data with Potentially Large Stencils

Author: Posovszky Philipp
Publication venue
Publication date: 01/02/2020
Field of study

Distributing a multi-channel remote sensing data processing with potentially large stencils is a difficult challenge. The goal of this master thesis was to evaluate and investigate the performance impacts of such a processing on a distributed system and if it is possible to improve the total execution time by exploiting data locality or memory alignments. The thesis also gives a brief overview of the actual state of the art in remote sensing distributed data processing and points out why distributed computing will become more important for it in the future. For the experimental part of this thesis an application to process huge arrays on a distributed system was implemented with DASH, a C++ Template Library for Distributed Data Structures with Support for Hierarchical Locality for High Performance Computing and Data-Driven Science. On the basis of the first results an optimization model was developed which has the goal to reduce network traffic while initializing a distributed data structure and executing computations on it with potentially large stencils. Furthermore, a software to estimate the memory layouts with the least network communication cost for a given multi-channel remote sensing data processing workflow was implemented. The results of this optimization were executed and evaluated afterwards. The results show that it is possible to improve the initialization speed of a large image by considering the brick locality by 25%. The optimization model also generate valid decisions for the initialization of the PGAS memory layouts. However, for a real implementation the optimization model has to be modified to reflect implementation-dependent sources of overhead. This thesis presented some approaches towards solving challenges of the distributed computing world that can be used for real-world remote sensing imaging applications and contributed towards solving the challenges of the modern Big Data world for future scientific data exploitation

Institute of Transport Research:Publications

A Technique to Automatically Determine Ad-hoc Communication Patterns at Runtime

Author: González Escribano Arturo
Llanos Ferraris Diego Rafael
Moretón Fernández Ana
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Producción CientíficaCurrent High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of shared-memory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.2018-12-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 (TIN2016-81840- REDT), COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS), and by the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETACIEMAT belongs to CIEMAT and the Government of Spain

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositorio Documental de la Universidad de Valladolid