171 research outputs found
Performance and portability of accelerated lattice Boltzmann applications with OpenACC
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
Energy-efficiency evaluation of Intel KNL for HPC workloads
Energy consumption is increasingly becoming a limiting factor to the design
of faster large-scale parallel systems, and development of energy-efficient and
energy-aware applications is today a relevant issue for HPC code-developer
communities. In this work we focus on energy performance of the Knights Landing
(KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel
into the HPC market. We take into account the 64-core Xeon Phi 7230, and
analyze its energy performance using both the on-chip MCDRAM and the regular
DDR4 system memory as main storage for the application data-domain. As a
benchmark application we use a Lattice Boltzmann code heavily optimized for
this architecture and implemented using different memory data layouts to store
its lattice. We assessthen the energy consumption using different memory
data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core
Lattice QCD based on OpenCL
We present an OpenCL-based Lattice QCD application using a heatbath algorithm
for the pure gauge case and Wilson fermions in the twisted mass formulation.
The implementation is platform independent and can be used on AMD or NVIDIA
GPUs, as well as on classical CPUs. On the AMD Radeon HD 5870 our double
precision dslash implementation performs at 60 GFLOPS over a wide range of
lattice sizes. The hybrid Monte-Carlo presented reaches a speedup of four over
the reference code running on a server CPU.Comment: 19 pages, 11 figure
Towards a portable and future-proof particle-in-cell plasma physics code
We present the first reported OpenCL implementation of EPOCH3D, an extensible particle-in-cell plasma physics code developed at the University of Warwick. We document the challenges and successes of this porting effort, and compare the performance of our implementation executing on a wide variety of hardware from multiple vendors. The focus of our work is on understanding the suitability of existing algorithms for future accelerator-based architectures, and identifying the changes necessary to achieve performance portability for particle-in-cell plasma physics codes.
We achieve good levels of performance with limited changes to the algorithmic behaviour of the code. However, our results suggest that a fundamental change to EPOCH3Dâs current accumulation step (and its dependency on atomic operations) is necessary in order to fully utilise the massive levels of parallelism supported by emerging parallel architectures
Automatic calculation and evaluation of flow in complex geometries using finite volume and lattice boltzmann methods
Trotz groĂen Fortschritts kann die numerische Strömungsmechanik (englisch Computational Fluid Dynamics,
CFD) nicht als Blackbox-Verfahren verwendet werden, da Schritte wie die Gittergenerierung oder die Wahl
numerischer Parameter vertiefte Kenntnisse der Theorie von CFD erfordert. Eine Verbesserung von CFD in
Richtung einer Blackbox-Lösung wĂŒrde nicht nur die Anwendungsbarriere verringern, weil weniger spezielles
Wissen notwendig ist, sondern auch wissenschaftliche Erkenntnisse ermöglichen. Beispielsweise können viel mehr
Datenpunkte erzeugt werden, die fĂŒr die Entwicklung genauer Modelle fĂŒr manche Fragestellungen notwendig
sind. Diese Arbeit veranschaulicht die Vorteile einer automatisierten Berechnung anhand dreier beispielhafter
Anwendungen:
âą Die genaue Vorhersage des Druckverlusts einer KugelschĂŒttung ist von groĂer Bedeutung in der Verfahrenstechnik. FĂŒr SchĂŒttungen, bei denen die Kugeln relativ groĂ verglichen mit den Abmessungen des BehĂ€lters sind,
spielt zudem der Wandeffekt eine wichtige Rolle. Viele Korrelationen, die ĂŒblicherweise auf experimentellen
Messungen basieren, wurden in der Literatur vorgestellt, zeigen aber Abweichungen von ca. 20 % voneinander. Die Kombination von simulierter Generierung von KugelschĂŒttung und CFD wird hier verwendet, um
den Druckverlust einer groĂen Anzahl von Kugelpackungen mit unterschiedlichen Kugeldurchmessern und
fĂŒr unterschiedliche Abmessungen des BehĂ€lters zu berechnen. Es wird gezeigt, dass der Druckverlust eine
nicht-monotone Funktion fĂŒr kleine VerhĂ€ltnisse von Kugeldurchmesser zu hydraulischem Durchmesser des
Reaktors ist, was die Abweichungen in den experimentellen Ergebnissen erklÀren kann.
âą Die Fischer-Tropsch-Synthese ist wieder von wachsendem Interesse, da sie die Herstellung von CO2 neutralen
Treibstoffen erlaubt. Transportporen können genutzt werden, um den Stofftransport im benötigten Katalysator
zu beschleunigen und somit auch die Ausbeute zu erhöhen. Ein eindimensionales Modell aus der Literatur
wird in dieser Arbeit auf drei Dimensionen erweitert. Die Berechnung wird automatisiert wodurch die
Katalysatorschichten algorithmisch optimiert werden können. Die Ergebnisse zeigen, dass fĂŒr Transportporen
mit einem Durchmesser gröĂer als 50 ”m eine drei-dimensionale Betrachtung nötig ist. GröĂere Transportporen
mit einem Durchmesser von bis zu 250 ”m können ebenfalls verwendet werden, um die Ausbeute pro Zeit und
FlĂ€che zu erhöhen, erfordern aber dickere Katalysatorschichten und eine gröĂere TransportporenporositĂ€t um
die Nachteile der gröĂeren Poren zu kompensieren.
âą NasenscheidewandverkrĂŒmmungen sind sehr verbreitet in der Bevölkerung, aber es ist unklar, warum einige
Betroffene Beschwerden entwickeln wÀhrend andere hingegen keine EinschrÀnkungen haben. Bisherige Arbeiten
setzten den Schwerpunkt auf die Analyse einiger ausgewĂ€hlter FĂ€lle, was aufgrund der hohen natĂŒrlichen Variationen der Nasenscheidewand zu keinen klaren Ergebnissen fĂŒhrte. In dieser Arbeit wird ein vollautomatischer
Ansatz zur Berechnung integraler Beiwerte wie Druckverlust und der Strömungsverteilung zwischen den beiden
Atemwegen ausgehend von Computertomographie-Aufnahmen vorgestellt. ZusÀtzlich wird eine Methode zur
Verringerung des Rechenaufwandes durch das Entfernen der Nasennebenhöhlen in den CT-Bildern basierend
auf maschinellem Lernen vorgeschlagen. FĂŒr diesen Anwendungsfall kann die automatische Berechnung und
Auswertung verwendet werden, um eine ganze Datenbank von CT-Aufnahmen in strömungsmechanische
Kennziffern umzuwandeln, die fĂŒr eine statistische Analyse verwendet werden können. Weiterhin könnte sie
die Anwendung von CFD in der klinischen Praxis ermöglichen.
Das Lattice-Boltzmann Verfahren (LBM) ist eine alternative Methode zu âklassischenâ, Finite-Volumen basierten
Lösern der Navier-Stokes-Gleichungen. Da es eine einfache Generierung von Gittern erlaubt, wird hier eine neue
LBM-Implementierung verwendet um die Strömung durch die KugelschĂŒttung und Nasenhöhle zu berechnen. Die
Implementierung bietet gute PortabilitÀt zu unterschiedlichen Systemen und zu unterschiedlicher Hardware wie
Grafikkarten (GPUs), die aufgrund ihrer KosteneffektivitĂ€t die Anwendbarkeit von CFD erhöhen. Sie kann auĂerdem
Gitterverfeinerung verwenden und es wird ein Algorithmus zur Gittergenerierung, der auch fĂŒr Grafikkarten geeignet
ist, vorgestellt. Um den Flaschenhals langsamer Datenspeicher zu umgehen und die Auswertung zu vereinfachen,
wird eine GPU basierte in-situ Verarbeitung implementiert. Der Anwendungsfall der Fischer-Tropsch-Synthese zeigt
dennoch, dass âklassischeâ, Finite-Volumen basierte Löser wie OpenFOAM eine ebenso valide Wahl fĂŒr automatische
Berechnungen sind, wenn strukturierte Gitter verwendet werden. AuĂerdem ist es fĂŒr einige Anwendungen einfacher,
die Fragestellung mittels partieller Differenzialgleichungen zu modellieren, die mittels Finite-Volumen-Verfahren
direkt gelöst werden können.Despite significant progress, computational fluid dynamics (CFD) can still not be used as a âblack box approachâ
as meshing often requires manual intervention and the choosing of numerical parameters deep knowledge of the
methods behind CFD. Improving CFD towards such a black box solution not only reduces the barrier of application
as less specialized knowledge is required, but also allows for scientific insight. For example, much more data can be
generated that is needed to develop accurate models for some problems. This thesis illustrates these benefits with
three exemplary applications:
âą The accurate prediction of the pressure drop of a sphere packed bed is of great importance in engineering. For
geometries where the spheres are relatively large compared to the confinement, the wall effect plays another
important role. Many correlations have been presented, usually based on experimental measurements that
differ in a range of approx. 20 %. Here, the combination of simulated packing generation and CFD is used to
evaluate the pressure drop for a very large number of packings with different sphere diameters and different
geometries of the confining walls. It is shown that for small ratios of sphere diameter to hydraulic diameter of
the reactor the pressure drop is a non-monotonic function which can explain the differences in experimental
findings.
âą The Fischer-Tropsch synthesis is again of increasing interest as it allows the production of carbon-neutral fuel.
Transport pores can be added to the catalyst needed for the reaction to enhance transport and consequently
the yield. A three-dimensional extension of a one-dimensional model from literature for transport and reaction
is presented here. The automation of the calculation is used to enable the algorithmic optimization of the
catalyst layers. The results show that for transport pores larger than 50 ”m the problem must be treated as
three-dimensional. Larger transport pores up to a diameter of 250 ”m can also be used to achieve a gain in
area-time yield, but thicker catalyst layers and a higher transport pore porosity are needed to overcome the
drawbacks of larger pores.
âą Nasal septum deviation is very common in general population but it is unclear why it causes symptoms for
certain patients while others report no discomfort. Previous studies focused on the analysis of few selected
cases which did not lead to clear results as the human nose shows high natural variations in geometry. Here, a
fully automatic approach for calculating critical parameters like the pressure drop and the flow distribution
between the two airways from computed tomography (CT) scans is presented. Furthermore, a method to
reduce the computational time by removing paranasal sinuses from the scan incorporating machine learning
algorithms is proposed. For this case, fully automatic processing can be used to convert a whole database of
CT scans to fluid dynamic parameters that can be used for statistical analysis. Furthermore, it could allow
the introduction of CFD analysis to clinical practice.
The lattice Boltzmann method (LBM) is an alternative method to âclassicalâ finite-volume based solvers of the
Navier-Stokes equations. Since it offers easy generation of grids, a novel LBM implementation is used here to
calculate the flow through the sphere packings and the nasal cavity. The implementation features good portability
to various systems and hardware like GPUs which due to their cost-effectiveness broaden the applicability of CFD.
It can utilize grid refinement and a meshing algorithm suitable for GPUs is presented. To overcome slow IO and to
simplify automatic evaluation, GPU assisted co-processing is implemented. Nevertheless, the application case of
Fischer-Tropsch synthesis shows that âclassicalâ, finite volume based solvers like OpenFOAM are also valid choice
for automatic processing if structured meshes can be used. Furthermore, for some applications, it is easier to model
the problem using partial differential equations which can be directly solved using FVM
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
OpenCLâbased implementation of an unstructured edgeâbased finite element convectionâdiffusion solver on graphics hardware
The solution of problems in computational fluid dynamics (CFD) represents a classical field for the application of advanced numerical methods. Many different approaches were developed over the years to address CFD applications. Good examples are finite volumes, finite differences (FD), and finite elements (FE) but also newer approaches such as the latticeâBoltzmann (LB), smooth particle hydrodynamics or the particle finite element method. FD and LB methods on regular grids are known to be superior in terms of raw computing speed, but using such regular discretization represents an important limitation in dealing with complex geometries. Here, we concentrate on unstructured approaches which are less common in the GPU world. We employ a nonstandard FE approach which leverages an optimized edgeâbased data structure allowing a highly parallel implementation. Such technique is applied to the ‘convectionâdiffusion’ problem, which is often considered as a first step towards CFD because of similarities to the nonconservative form of the Navier–Stokes equations. In this regard, an existing highly optimized parallel OpenMP solver is ported to graphics hardware based on the OpenCL platform. The optimizations performed are discussed in detail. A number of benchmarks prove that the GPUâaccelerated OpenCL code consistently outperforms the OpenMP version
Towards enhancing coding productivity for GPU programming using static graphs
The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.his research was funded by EPEEC project from the European Unionâs Horizon 2020 Research and Innovation program under grant agreement No. 801051. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, accessed on 13 April 2022).Peer ReviewedPostprint (published version
- âŠ