Search CORE

8 research outputs found

Using accelerators to speed up scientific and engineering codes: perspectives and problems

Author: Calore E.
Schifano S.F.
Tripiccione R.
Publication venue: CIMNE
Publication date: 01/01/2015
Field of study

Accelerators are quickly emerging as the leading technology to further boost computing performances; their main feature is a massively parallel on-chip architecture. NVIDIA and AMD GPUs and the Intel Xeon-Phi are examples of accelerators available today. Accelerators are power-efficient and deliver up to one order of magnitude more peak performance than traditional CPUs. However, existing codes for traditional CPUs require substantial changes to run efficiently on accelerators, including rewriting with specific programming languages. In this contribution we present our experience in porting large codes to NVIDIA GPU and Intel Xeon-Phi accelerators. Our reference application is a CFD code based on the Lattice Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for processor architectures with a large degree of parallelism. However, the challenge of exploiting a large fraction of the theoretically available performance is not easy to met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation-of-state of a perfect gas. We describe in details how we implement and optimize our LB code for Xeon-Phi and GPUs, and then analyze performances on single- and multi-accelerator systems. We finally compare results with those available on recent traditional multi-core CPUs

UPCommons. Portal del coneixement obert de la UPC

Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications

Author: Biferale
Biferale
Biferale
Calore
Calore
Calore
Calore
Calore
Crimi
Dick
Etinski
Ge
Khabi
Lim
Mantovani
Mazouz
Peraza
Sbragaglia
Scagliarini
Succi
Sundriyal
Williams
Wittmann
Publication venue: 'Wiley'
Publication date: 01/01/2017
Field of study

Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Ferrara

Optimization of lattice Boltzmann simulations on heterogeneous computers

Author: Calore Enrico
Gabbana Alessandro
Schifano Sebastiano Fabio
Tripiccione Raffaele
Publication venue: 'SAGE Publications'
Publication date: 01/01/2019
Field of study

High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs

Archivio istituzionale della ricerca - Università di Ferrara

Performance and portability of accelerated lattice Boltzmann applications with OpenACC

Author: Calore Enrico
Gabbana Alessandro
Kraus Jiri
Schifano Sebastiano Fabio
Tripiccione Raffaele
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Ferrara

Benchmarking GPUs with a parallel Lattice-Boltzmann code

Author: J. Kraus
M. Pivanti
M. Zanella
R. Tripiccione
S. F. Schifano
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Accelerators are an increasingly common option to boost performance of codes that require extensive number crunching. In this paper we report on our experience with NVIDIA accelerators to study fluid systems using the Lattice Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for processor architectures with a large degree of parallelism, such as recent multi- and many-core processors and GPUs; however, the challenge of exploiting a large fraction of the theoretically available performance of this new class of processors is not easily met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation-of-state of a perfect gas. The computational features of this model make it a significant benchmark to analyze the performance of new computational platforms, since critical kernels in this code require both high memory-bandwidth on sparse memory addressing patterns and floating-point throughput. In this paper we consider two recent classes of GPU boards based on the Fermi and Kepler architectures; we describe in details all steps done to implement and optimize our LB code and analyze its performance first on single-GPU systems, and then on parallel multi-GPU systems based on one node as well as on a cluster of many nodes; in the latter case we use CUDA-aware MPI as an abstraction layer to assess the advantages of advanced GPU-to-GPU communication technologies like GPUDirect. On our implementation, aggregate sustained performance of the most compute intensive part of the code breaks the

1

double-precision Tflops barrier on a single-host system with two GPUs

Crossref

Archivio istituzionale della ricerca - Università di Ferrara