10 research outputs found

    Evaluation of Image Pixels Similarity Measurement Algorithm Accelerated on GPU with OpenACC

    Get PDF
    OpenACC is a directive based parallel programming library that allows for easy acceleration of existing C, C++ and Fortran based applications with minimal code modifications. By annotating the bottleneck causing section of the code with OpenACC directives, the acceleration of the code can be simplified, leading for high portability of performance across different target Graphic Processing Units (GPUs). In this work, the portability of an implemented parallelizable chi-square based pixel similarity measurement algorithm has been evaluated on two consumer and professional grade GPUs. To our best knowledge, this is the first performance evaluation report that utilizes the OpenACC optimization clauses (collapse and tile) on different GPUs to process a less workload (low resolution image of 581x429 pixels) and a heavy workload (high resolution image of 4500 x 3500 pixels) to demonstrate the effectiveness and high portability of OpenACC

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics

    Evaluation of Image Pixels Similarity Measurement Algorithm Accelerated on GPU with OpenACC

    Get PDF
    OpenACC is a directive based parallel programming library that allows for the easy acceleration of existing C, C++ and Fortran based applications with minimal code modifications. By annotating the bottleneck causing a section of the code with OpenACC directives, the acceleration of the code can be simplified, leading for high portability of performance across different target Graphic Processing Units (GPUs). In this work, the portability of an implemented parallelizable chi-square based pixel similarity measurement algorithm has been evaluated on two consumer and professional grade GPUs. To our best knowledge, this is the first performance evaluation report that utilizes the OpenACC optimization clauses (collapse and tile) on different GPUs to process a less workload (low resolution image of 581x429 pixels) and a heavy workload (high resolution image of 4500 x 3500 pixels) to demonstrate the effectiveness and high portability of OpenACC

    Performance and portability of accelerated lattice Boltzmann applications with OpenACC

    Get PDF
    An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures

    Massively parallel lattice–Boltzmann codes on large GPU clusters

    Get PDF
    This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics

    Locality optimized unstructured mesh algorithms on GPUs

    Get PDF
    Unstructured-mesh based numerical algorithms such as finite volume and finite element algorithms form an important class of applications for many scientific and engineering domains. The key difficulty in achieving higher performance from these applications is the indirect accesses that lead to data-races when parallelized. Current methods for handling such data-races lead to reduced parallelism and suboptimal performance. Particularly on modern many-core architectures, such as GPUs, that has increasing core/thread counts, reducing data movement and exploiting memory locality is vital for gaining good performance. In this work we present novel locality-exploiting optimizations for the efficient execution of unstructured-mesh algorithms on GPUs. Building on a two-layered coloring strategy for handling data races, we introduce novel reordering and partitioning techniques to further improve efficient execution. The new optimizations are then applied to several well established unstructured-mesh applications, investigating their performance on NVIDIA’s latest P100 and V100 GPUs. We demonstrate significant speedups (1.1–1.75×) compared to the state-of-the-art. A range of performance metrics are benchmarked including runtime, memory transactions, achieved bandwidth performance, GPU occupancy and data reuse factors and are used to understand and explain the key factors impacting performance. The optimized algorithms are implemented as an open-source software library and we illustrate its use for improving performance of existing or new unstructured-mesh applications

    Accelerating a C++ CFD Code with OpenACC

    No full text
    Todays HPC systems are increasingly utilizing accelerators to lower time to solution for their users and reduce power consumption. To utilize the higher performance and energy efficiency of these accelerators, application developers need to rewrite at least parts of their codes. Taking the C++ flow solver ZFS as an example, we show that the directive-based programming model allows one to achieve good performance with reasonable effort, even for mature codes with many lines of code. Using OpenACC directives permitted us to incrementally accelerate ZFS, focusing on the parts of the program that are relevant for the problem at hand. The two new OpenACC 2.0 features, unstructured data regions and atomics, are required for this. OpenACC's interoperability with existing GPU libraries via the host_data use_device construct allowed to use CUDAaware MPI to achieve multi-GPU scalability comparable to the CPU version of ZFS. Like many other codes, the data structures of ZFS have been designed with traditional CPUs and their relatively large private caches in mind. This leads to suboptimal memory access patterns on accelerators, such as GPUs. We show how the texture cache on NVIDIA GPUs can be used to minimize the performance impact of these suboptimal patterns without writing platform specific code. For the kernel most affected by the memory access pattern, we compare the initial array of structures memory layout with a structure of arrays layout

    Real-Time Simulation and Prognosis of Smoke Propagation in Compartments Using a GPU

    Get PDF
    The evaluation of life safety in buildings in case of fire is often based on smoke spread calculations. However, recent simulation models – in general, based on computational fluid dynamics – often require long execution times or high-performance computers to achieve simulation results in or faster than real-time. Therefore, the objective of this study is the development of a concept for the real-time and prognosis simulation of smoke propagation in compartments using a graphics processing unit (GPU). The developed concept is summarized in an expandable open source software basis, called JuROr (Jülich's Real-time simulation within ORPHEUS). JuROr simulates buoyancy-driven, turbulent smoke spread based on a reduced modeling approach using finite differences and a Large Eddy Simulation turbulence model to solve the incompressible Navier-Stokes and energy equations. This reduced model is fully adapted to match the target hardware of highly parallel computer architectures. Thereby, the code is written in the object-oriented programming language C++ and the pragma-based programming model OpenACC. This model ensures to maintain a single source code, which can be executed in serial and parallel on various architectures. Further, the study provides a proof of JuROr's concept to balance sufficient accuracy and practicality. First, the code was successfully verified using unit and (semi-) analytical tests. Then, the underlying model was validated by comparing the numerical results to the experimental results of scenarios relevant for fire protection. Thereby, verification and validation showed acceptable accuracy for JuROr's application. Lastly, the performance criteria of JuROr – being real-time and prognosis capable with comparable performance across various architectures – was successfully evaluated. Here, JuROr also showed high speedup results on a GPU and faster time-to-solution compared to the established Fire Dynamics Simulator. These results show JuROr's practicality.Die Bewertung der Personensicherheit bei Feuer in Gebäuden basiert häufig auf Berechnungen zur Rauchausbreitung. Bisherige Simulationsmodelle – im Allgemeinen basierend auf numerischer Strömungsdynamik – erfordern jedoch lange Ausführungszeiten oder Hochleistungsrechner, um Simulationsergebnisse in und schneller als Echtzeit liefern zu können. Daher ist das Ziel dieser Arbeit die Entwicklung eines Konzeptes für die Echtzeit- und Prognosesimulation der Rauchausbreitung in Gebäuden mit Hilfe eines Grafikprozessors (GPU). Zusammengefasst ist das entwickelte Konzept in einer erweiterbaren Open-Source-Software, genannt JuROr (Jülich's Real-time Simulation in ORPHEUS). JuROr simuliert die Ausbreitung von auftriebsgetriebenem, turbulentem Rauch basierend auf einem reduzierten Modellierungsansatz mit finiten Differenzen und einem Large Eddy Simulation Turbulenzmodell, um inkompressible Navier- Stokes und Energiegleichungen zu lösen. Das reduzierte Modell ist voll- ständig angepasst an hochparallele Computerarchitekturen. Dabei ist der Code implementiert mit C++ und OpenACC. Dies hat den Vorteil mit nur einem Quellcode verschiedenste serielle und parallele Ausführungen des Programms für unterschiedliche Architekturen erstellen zu können. Die Studie liefert weiterhin einen Konzeptnachweis dafür, ausreichende Genauigkeit und Praktikabilität im Gleichgewicht zu halten. Zunächst wurde der Code erfolgreich mit Modul- und (semi-) analytischen Tests verifiziert. Dann wurde das zugrundeliegende Modell durch einen Vergleich der numerischen mit den experimentellen Ergebnissen für den Brandschutz relevanter Szenarien validiert. Die Verifizierung und Validierung zeigten dabei ausreichende Genauigkeit für JuROr. Zuletzt, wurden die Kriterien von JuROr – echtzeit- und prognosefähig zu sein mit vergleichbarer Leistung auf unterschiedlichsten Architekturen – erfolgreich geprüft. Zudem zeigte JuROr hohe Beschleunigungseffekte auf einer GPU und schnellere Lösungszeiten im Vergleich zum etablierten Fire Dynamics Simulator. Diese Ergebnisse zeigen JuROr's Praktikabilität
    corecore