171 research outputs found

    Performance and portability of accelerated lattice Boltzmann applications with OpenACC

    Get PDF
    An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics

    Energy-efficiency evaluation of Intel KNL for HPC workloads

    Get PDF
    Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In this work we focus on energy performance of the Knights Landing (KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel into the HPC market. We take into account the 64-core Xeon Phi 7230, and analyze its energy performance using both the on-chip MCDRAM and the regular DDR4 system memory as main storage for the application data-domain. As a benchmark application we use a Lattice Boltzmann code heavily optimized for this architecture and implemented using different memory data layouts to store its lattice. We assessthen the energy consumption using different memory data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core

    Lattice QCD based on OpenCL

    Get PDF
    We present an OpenCL-based Lattice QCD application using a heatbath algorithm for the pure gauge case and Wilson fermions in the twisted mass formulation. The implementation is platform independent and can be used on AMD or NVIDIA GPUs, as well as on classical CPUs. On the AMD Radeon HD 5870 our double precision dslash implementation performs at 60 GFLOPS over a wide range of lattice sizes. The hybrid Monte-Carlo presented reaches a speedup of four over the reference code running on a server CPU.Comment: 19 pages, 11 figure

    Towards a portable and future-proof particle-in-cell plasma physics code

    Get PDF
    We present the first reported OpenCL implementation of EPOCH3D, an extensible particle-in-cell plasma physics code developed at the University of Warwick. We document the challenges and successes of this porting effort, and compare the performance of our implementation executing on a wide variety of hardware from multiple vendors. The focus of our work is on understanding the suitability of existing algorithms for future accelerator-based architectures, and identifying the changes necessary to achieve performance portability for particle-in-cell plasma physics codes. We achieve good levels of performance with limited changes to the algorithmic behaviour of the code. However, our results suggest that a fundamental change to EPOCH3D’s current accumulation step (and its dependency on atomic operations) is necessary in order to fully utilise the massive levels of parallelism supported by emerging parallel architectures

    Automatic calculation and evaluation of flow in complex geometries using finite volume and lattice boltzmann methods

    Get PDF
    Trotz großen Fortschritts kann die numerische Strömungsmechanik (englisch Computational Fluid Dynamics, CFD) nicht als Blackbox-Verfahren verwendet werden, da Schritte wie die Gittergenerierung oder die Wahl numerischer Parameter vertiefte Kenntnisse der Theorie von CFD erfordert. Eine Verbesserung von CFD in Richtung einer Blackbox-Lösung wĂŒrde nicht nur die Anwendungsbarriere verringern, weil weniger spezielles Wissen notwendig ist, sondern auch wissenschaftliche Erkenntnisse ermöglichen. Beispielsweise können viel mehr Datenpunkte erzeugt werden, die fĂŒr die Entwicklung genauer Modelle fĂŒr manche Fragestellungen notwendig sind. Diese Arbeit veranschaulicht die Vorteile einer automatisierten Berechnung anhand dreier beispielhafter Anwendungen: ‱ Die genaue Vorhersage des Druckverlusts einer KugelschĂŒttung ist von großer Bedeutung in der Verfahrenstechnik. FĂŒr SchĂŒttungen, bei denen die Kugeln relativ groß verglichen mit den Abmessungen des BehĂ€lters sind, spielt zudem der Wandeffekt eine wichtige Rolle. Viele Korrelationen, die ĂŒblicherweise auf experimentellen Messungen basieren, wurden in der Literatur vorgestellt, zeigen aber Abweichungen von ca. 20 % voneinander. Die Kombination von simulierter Generierung von KugelschĂŒttung und CFD wird hier verwendet, um den Druckverlust einer großen Anzahl von Kugelpackungen mit unterschiedlichen Kugeldurchmessern und fĂŒr unterschiedliche Abmessungen des BehĂ€lters zu berechnen. Es wird gezeigt, dass der Druckverlust eine nicht-monotone Funktion fĂŒr kleine VerhĂ€ltnisse von Kugeldurchmesser zu hydraulischem Durchmesser des Reaktors ist, was die Abweichungen in den experimentellen Ergebnissen erklĂ€ren kann. ‱ Die Fischer-Tropsch-Synthese ist wieder von wachsendem Interesse, da sie die Herstellung von CO2 neutralen Treibstoffen erlaubt. Transportporen können genutzt werden, um den Stofftransport im benötigten Katalysator zu beschleunigen und somit auch die Ausbeute zu erhöhen. Ein eindimensionales Modell aus der Literatur wird in dieser Arbeit auf drei Dimensionen erweitert. Die Berechnung wird automatisiert wodurch die Katalysatorschichten algorithmisch optimiert werden können. Die Ergebnisse zeigen, dass fĂŒr Transportporen mit einem Durchmesser grĂ¶ĂŸer als 50 ”m eine drei-dimensionale Betrachtung nötig ist. GrĂ¶ĂŸere Transportporen mit einem Durchmesser von bis zu 250 ”m können ebenfalls verwendet werden, um die Ausbeute pro Zeit und FlĂ€che zu erhöhen, erfordern aber dickere Katalysatorschichten und eine grĂ¶ĂŸere TransportporenporositĂ€t um die Nachteile der grĂ¶ĂŸeren Poren zu kompensieren. ‱ NasenscheidewandverkrĂŒmmungen sind sehr verbreitet in der Bevölkerung, aber es ist unklar, warum einige Betroffene Beschwerden entwickeln wĂ€hrend andere hingegen keine EinschrĂ€nkungen haben. Bisherige Arbeiten setzten den Schwerpunkt auf die Analyse einiger ausgewĂ€hlter FĂ€lle, was aufgrund der hohen natĂŒrlichen Variationen der Nasenscheidewand zu keinen klaren Ergebnissen fĂŒhrte. In dieser Arbeit wird ein vollautomatischer Ansatz zur Berechnung integraler Beiwerte wie Druckverlust und der Strömungsverteilung zwischen den beiden Atemwegen ausgehend von Computertomographie-Aufnahmen vorgestellt. ZusĂ€tzlich wird eine Methode zur Verringerung des Rechenaufwandes durch das Entfernen der Nasennebenhöhlen in den CT-Bildern basierend auf maschinellem Lernen vorgeschlagen. FĂŒr diesen Anwendungsfall kann die automatische Berechnung und Auswertung verwendet werden, um eine ganze Datenbank von CT-Aufnahmen in strömungsmechanische Kennziffern umzuwandeln, die fĂŒr eine statistische Analyse verwendet werden können. Weiterhin könnte sie die Anwendung von CFD in der klinischen Praxis ermöglichen. Das Lattice-Boltzmann Verfahren (LBM) ist eine alternative Methode zu „klassischen“, Finite-Volumen basierten Lösern der Navier-Stokes-Gleichungen. Da es eine einfache Generierung von Gittern erlaubt, wird hier eine neue LBM-Implementierung verwendet um die Strömung durch die KugelschĂŒttung und Nasenhöhle zu berechnen. Die Implementierung bietet gute PortabilitĂ€t zu unterschiedlichen Systemen und zu unterschiedlicher Hardware wie Grafikkarten (GPUs), die aufgrund ihrer KosteneffektivitĂ€t die Anwendbarkeit von CFD erhöhen. Sie kann außerdem Gitterverfeinerung verwenden und es wird ein Algorithmus zur Gittergenerierung, der auch fĂŒr Grafikkarten geeignet ist, vorgestellt. Um den Flaschenhals langsamer Datenspeicher zu umgehen und die Auswertung zu vereinfachen, wird eine GPU basierte in-situ Verarbeitung implementiert. Der Anwendungsfall der Fischer-Tropsch-Synthese zeigt dennoch, dass „klassische“, Finite-Volumen basierte Löser wie OpenFOAM eine ebenso valide Wahl fĂŒr automatische Berechnungen sind, wenn strukturierte Gitter verwendet werden. Außerdem ist es fĂŒr einige Anwendungen einfacher, die Fragestellung mittels partieller Differenzialgleichungen zu modellieren, die mittels Finite-Volumen-Verfahren direkt gelöst werden können.Despite significant progress, computational fluid dynamics (CFD) can still not be used as a “black box approach” as meshing often requires manual intervention and the choosing of numerical parameters deep knowledge of the methods behind CFD. Improving CFD towards such a black box solution not only reduces the barrier of application as less specialized knowledge is required, but also allows for scientific insight. For example, much more data can be generated that is needed to develop accurate models for some problems. This thesis illustrates these benefits with three exemplary applications: ‱ The accurate prediction of the pressure drop of a sphere packed bed is of great importance in engineering. For geometries where the spheres are relatively large compared to the confinement, the wall effect plays another important role. Many correlations have been presented, usually based on experimental measurements that differ in a range of approx. 20 %. Here, the combination of simulated packing generation and CFD is used to evaluate the pressure drop for a very large number of packings with different sphere diameters and different geometries of the confining walls. It is shown that for small ratios of sphere diameter to hydraulic diameter of the reactor the pressure drop is a non-monotonic function which can explain the differences in experimental findings. ‱ The Fischer-Tropsch synthesis is again of increasing interest as it allows the production of carbon-neutral fuel. Transport pores can be added to the catalyst needed for the reaction to enhance transport and consequently the yield. A three-dimensional extension of a one-dimensional model from literature for transport and reaction is presented here. The automation of the calculation is used to enable the algorithmic optimization of the catalyst layers. The results show that for transport pores larger than 50 ”m the problem must be treated as three-dimensional. Larger transport pores up to a diameter of 250 ”m can also be used to achieve a gain in area-time yield, but thicker catalyst layers and a higher transport pore porosity are needed to overcome the drawbacks of larger pores. ‱ Nasal septum deviation is very common in general population but it is unclear why it causes symptoms for certain patients while others report no discomfort. Previous studies focused on the analysis of few selected cases which did not lead to clear results as the human nose shows high natural variations in geometry. Here, a fully automatic approach for calculating critical parameters like the pressure drop and the flow distribution between the two airways from computed tomography (CT) scans is presented. Furthermore, a method to reduce the computational time by removing paranasal sinuses from the scan incorporating machine learning algorithms is proposed. For this case, fully automatic processing can be used to convert a whole database of CT scans to fluid dynamic parameters that can be used for statistical analysis. Furthermore, it could allow the introduction of CFD analysis to clinical practice. The lattice Boltzmann method (LBM) is an alternative method to “classical” finite-volume based solvers of the Navier-Stokes equations. Since it offers easy generation of grids, a novel LBM implementation is used here to calculate the flow through the sphere packings and the nasal cavity. The implementation features good portability to various systems and hardware like GPUs which due to their cost-effectiveness broaden the applicability of CFD. It can utilize grid refinement and a meshing algorithm suitable for GPUs is presented. To overcome slow IO and to simplify automatic evaluation, GPU assisted co-processing is implemented. Nevertheless, the application case of Fischer-Tropsch synthesis shows that “classical”, finite volume based solvers like OpenFOAM are also valid choice for automatic processing if structured meshes can be used. Furthermore, for some applications, it is easier to model the problem using partial differential equations which can be directly solved using FVM

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    OpenCL‐based implementation of an unstructured edge‐based finite element convection‐diffusion solver on graphics hardware

    Get PDF
    The solution of problems in computational fluid dynamics (CFD) represents a classical field for the application of advanced numerical methods. Many different approaches were developed over the years to address CFD applications. Good examples are finite volumes, finite differences (FD), and finite elements (FE) but also newer approaches such as the lattice‐Boltzmann (LB), smooth particle hydrodynamics or the particle finite element method. FD and LB methods on regular grids are known to be superior in terms of raw computing speed, but using such regular discretization represents an important limitation in dealing with complex geometries. Here, we concentrate on unstructured approaches which are less common in the GPU world. We employ a nonstandard FE approach which leverages an optimized edge‐based data structure allowing a highly parallel implementation. Such technique is applied to the ‘convection‐diffusion’ problem, which is often considered as a first step towards CFD because of similarities to the nonconservative form of the Navier–Stokes equations. In this regard, an existing highly optimized parallel OpenMP solver is ported to graphics hardware based on the OpenCL platform. The optimizations performed are discussed in detail. A number of benchmarks prove that the GPU‐accelerated OpenCL code consistently outperforms the OpenMP version

    Towards enhancing coding productivity for GPU programming using static graphs

    Get PDF
    The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.his research was funded by EPEEC project from the European Union’s Horizon 2020 Research and Innovation program under grant agreement No. 801051. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, accessed on 13 April 2022).Peer ReviewedPostprint (published version
    • 

    corecore