78 research outputs found
MHD code using multi graphical processing units: SMAUG+
This paper introduces the Sheffield Magnetohydrodynamics Algorithm Using GPUs (SMAUG+), an advanced numerical code for solving magnetohydrodynamic (MHD) problems, using multi-GPU systems. Multi-GPU systems facilitate the development of accelerated codes and enable us to investigate larger model sizes and/or more detailed computational domain resolutions. This is a significant advancement over the parent single-GPU MHD code, SMAUG (Griffiths, M., Fedun, V., and Erd\'elyi, R. (2015). A fast MHD code for gravitationally stratified media using graphical processing units: SMAUG. Journal of Astrophysics and Astronomy,36(1):197-223). Here, we demonstrate the validity of the SMAUG+ code, describe the parallelisation techniques and investigate performance benchmarks. The initial configuration of the Orszag-Tang vortex simulations are distributed among 4, 16, 64 and 100 GPUs. Furthermore, different simulation box resolutions are applied: , , and . We also tested the code with the Brio-Wu shock tube simulations with model size of 800 employing up to 10 GPUs. Based on the test results, we observed speed ups and slow downs, depending on the granularity and the communication overhead of certain parallel tasks. The main aim of the code development is to provide massively parallel code without the memory limitation of a single GPU. By using our code, the applied model size could be significantly increased. We demonstrate that we are able to successfully compute numerically valid and large 2D MHD problems
The muphyII Code: Multiphysics Plasma Simulation on Large HPC Systems
Collsionless astrophysical and space plasmas cover regions that typically
display a separation of scales that exceeds any code's capabilities. To help
address this problem, the muphyII code utilizes a hierarchy of models with
different inherent scales, unified in an adaptive framework that allows
stand-alone use of models as well as a model-based dynamic and adaptive domain
decomposition. This requires ensuring excellent conservation properties,
careful treatment of inner-domain model boundaries for model coupling, and
robust time stepping algorithms, especially with the use of electron
subcycling. This multi-physics approach is implemented in the muphyII code,
tested on different scenarios of space plasma reconnection and evaluated
against space probe data and higher-fidelity simulation results from
literature. Adaptive model refinement is highlighted in particular, and a
hybrid model with kinetic ions, pressure-tensor fluid electrons, and Maxwell
fields is appraised
Parallel computing 2011, ParCo 2011: book of abstracts
This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu
pyCRay: A flexible and GPU-accelerated Radiative Transfer Framework for Simulating the Cosmic Epoch of Reionization
Detailed modelling of the evolution of neutral hydrogen in the intergalactic
medium during the Epoch of Reionization, , is critical in
interpreting the cosmological signals from current and upcoming 21-cm
experiments such as Low-Frequency Array (LOFAR) and the Square Kilometre Array
(SKA). Numerical radiative transfer codes offer the most physically motivated
approach for simulating the reionization process. However, they are
computationally expensive as they must encompass enormous cosmological volumes
while accurately capturing astrophysical processes occurring at small scales
(). Here, we present pyCRay, an updated version of the
massively parallel ray-tracing and chemistry code, CRay, which has been
extensively employed in reionization simulations. The most time-consuming part
of the code is calculating the hydrogen column density along the path of the
ionizing photons. Here, we present the Accelerated Short-characteristics
Octhaedral RAytracing (ASORA) method, a ray-tracing algorithm specifically
designed to run on graphical processing units (GPUs). We include a modern
Python interface, allowing easy and customized use of the code without
compromising computational efficiency. We test pyCRay on a series of
standard ray-tracing tests and a complete cosmological simulation with volume
size , mesh size of and approximately sources.
Compared to the original code, pyCRay achieves the same results with
negligible fractional differences, , and a speedup factor of two
orders of magnitude. Benchmark analysis shows that ASORA takes a few
nanoseconds per source per voxel and scales linearly for an increasing number
of sources and voxels within the ray-tracing radii.Comment: 16 pages, 13 figure
Trace-based Performance Analysis for Hardware Accelerators
This thesis presents how performance data from hardware accelerators can be included in event logs. It extends the capabilities of trace-based performance analysis to also monitor and record data from this novel parallelization layer. The increasing awareness to power consumption of computing devices has led to an interest in hybrid computing architectures as well.
High-end computers, workstations, and mobile devices start to employ hardware accelerators to offload computationally intense and parallel tasks, while at the same time retaining a highly efficient scalar compute unit for non-parallel tasks. This execution pattern is typically asynchronous so that the scalar unit can resume other work while the hardware accelerator is busy. Performance analysis tools provided by the hardware accelerator vendors cover the situation of one host using one device very well.
Yet, they do not address the needs of the high performance computing community. This thesis investigates ways to extend existing methods for recording events from highly parallel applications to also cover scenarios in which hardware accelerators aid these applications. After introducing a generic approach that is suitable for any API based acceleration paradigm, the thesis derives a suggestion for a generic performance API for hardware accelerators and its implementation with NVIDIA CUPTI. In a next step the visualization of event logs containing data from execution streams on different levels of parallelism is discussed. In order to overcome the limitations of classic performance profiles and timeline displays, a graph-based visualization using Parallel Performance Flow Graphs (PPFGs) is introduced. This novel technical approach is using program states in order to display similarities and differences between the potentially very large number of event streams and, thus, enables a fast way to spot load imbalances. The thesis concludes with the in-depth analysis of a case-study of PIConGPU---a highly parallel, multi-hybrid plasma physics simulation---that benefited greatly from the developed performance analysis methods.Diese Dissertation zeigt, wie der Ablauf von Anwendungsteilen, die auf Hardwarebeschleuniger ausgelagert wurden, als Programmspur mit aufgezeichnet werden kann. Damit wird die bekannte Technik der Leistungsanalyse von Anwendungen mittels Programmspuren so erweitert, dass auch diese neue Parallelitätsebene mit erfasst wird. Die Beschränkungen von Computersystemen bezüglich der elektrischen Leistungsaufnahme hat zu einer steigenden Anzahl von hybriden Computerarchitekturen geführt.
Sowohl Hochleistungsrechner, aber auch Arbeitsplatzcomputer und mobile Endgeräte nutzen heute Hardwarebeschleuniger um rechenintensive, parallele Programmteile auszulagern und so den skalaren Hauptprozessor zu entlasten und nur für nicht parallele Programmteile zu verwenden. Dieses Ausführungsschema ist typischerweise asynchron: der Skalarprozessor kann, während der Hardwarebeschleuniger rechnet, selbst weiterarbeiten.
Die Leistungsanalyse-Werkzeuge der Hersteller von Hardwarebeschleunigern decken den Standardfall (ein Host-System mit einem Hardwarebeschleuniger) sehr gut ab, scheitern aber an einer Unterstützung von hochparallelen Rechnersystemen. Die vorliegende Dissertation untersucht, in wie weit auch multi-hybride Anwendungen die Aktivität von Hardwarebeschleunigern aufzeichnen können. Dazu wird die vorhandene Methode zur Erzeugung von Programmspuren für hochparallele Anwendungen entsprechend erweitert. In dieser Untersuchung wird zuerst eine allgemeine Methodik entwickelt, mit der sich für jede API-gestützte Hardwarebeschleunigung eine Programmspur erstellen lässt. Darauf aufbauend wird eine eigene Programmierschnittstelle entwickelt, die es ermöglicht weitere leistungsrelevante Daten aufzuzeichnen. Die Umsetzung dieser Schnittstelle wird am Beispiel von NVIDIA CUPTI darstellt. Ein weiterer Teil der Arbeit beschäftigt sich mit der Darstellung von Programmspuren, welche Aufzeichnungen von den unterschiedlichen Parallelitätsebenen enthalten. Um die Einschränkungen klassischer Leistungsprofile oder Zeitachsendarstellungen zu überwinden, wird mit den parallelen Programmablaufgraphen (PPFGs) eine neue graphenbasisierte Darstellungsform eingeführt.
Dieser neuartige Ansatz zeigt eine Programmspur als eine Folge von Programmzuständen mit gemeinsamen und unterchiedlichen Abläufen. So können divergierendes Programmverhalten und Lastimbalancen deutlich einfacher lokalisiert werden. Die Arbeit schließt mit der detaillierten Analyse von PIConGPU -- einer multi-hybriden Simulation aus der Plasmaphysik --, die in großem Maße von den in dieser Arbeit entwickelten Analysemöglichkeiten profiert hat
GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics
We present the newly developed code, GAMER (GPU-accelerated Adaptive MEsh
Refinement code), which has adopted a novel approach to improve the performance
of adaptive mesh refinement (AMR) astrophysical simulations by a large factor
with the use of the graphic processing unit (GPU). The AMR implementation is
based on a hierarchy of grid patches with an oct-tree data structure. We adopt
a three-dimensional relaxing TVD scheme for the hydrodynamic solver, and a
multi-level relaxation scheme for the Poisson solver. Both solvers have been
implemented in GPU, by which hundreds of patches can be advanced in parallel.
The computational overhead associated with the data transfer between CPU and
GPU is carefully reduced by utilizing the capability of asynchronous memory
copies in GPU, and the computing time of the ghost-zone values for each patch
is made to diminish by overlapping it with the GPU computations. We demonstrate
the accuracy of the code by performing several standard test problems in
astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster
system. We measure the performance of the code by performing purely-baryonic
cosmological simulations in different hardware implementations, in which
detailed timing analyses provide comparison between the computations with and
without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are
demonstrated using 1 GPU with 4096^3 effective resolution and 16 GPUs with
8192^3 effective resolution, respectively.Comment: 60 pages, 22 figures, 3 tables. More accuracy tests are included.
Accepted for publication in ApJ
Plasma Physics Computations on Emerging Hardware Architectures
This thesis explores the potential of emerging hardware architectures to increase the impact of high performance computing in fusion plasma physics research. For next generation tokamaks like ITER, realistic simulations and data-processing tasks will become significantly more demanding of computational resources than current facilities. It is therefore essential to investigate how emerging hardware such as the graphics processing unit (GPU) and field-programmable gate array (FPGA) can provide the required computing power for large data-processing tasks and large scale simulations in plasma physics specific computations.
The use of emerging technology is investigated in three areas relevant to nuclear fusion: (i) a GPU is used to process the large amount of raw data produced by the synthetic aperture microwave imaging (SAMI) plasma diagnostic, (ii) the use of a GPU to accelerate the solution of the Bateman equations which model the evolution of nuclide number densities when subjected to neutron irradiation in tokamaks, and (iii) an FPGA-based dataflow engine is applied to compute massive matrix multiplications, a feature of many computational problems in fusion and more generally in scientific computing. The GPU data processing code for SAMI provides a 60x acceleration over the previous IDL-based code, enabling inter-shot analysis in future campaigns and the data-mining (and therefore analysis) of stored raw data from previous MAST campaigns. The feasibility of porting the whole Bateman solver to a GPU system is demonstrated and verified against the industry standard FISPACT code. Finally a dataflow approach to matrix multiplication is shown to provide a substantial acceleration compared to CPU-based approaches and, whilst not performing as well as a GPU for this particular problem, is shown to be much more energy efficient.
Emerging hardware technologies will no doubt continue to provide a positive contribution in terms of performance to many areas of fusion research and several exciting new developments are on the horizon with tighter integration of GPUs and FPGAs with their host central processor units. This should not only improve performance and reduce data transfer bottlenecks, but also allow more user-friendly programming tools to be developed. All of this has implications for ITER and beyond where emerging hardware technologies will no doubt provide the key to delivering the computing power required to handle the large amounts of data and more realistic simulations demanded by these complex systems
Recommended from our members
Gaussian Process Modeling for Upsampling Algorithms With Applications in Computer Vision and Computational Fluid Dynamics
Across a variety of fields, interpolation algorithms have been used to upsample lowresolution or coarse data fields. In this work, novel Gaussian Process based methodsare employed to solve a variety of upsampling problems. Specifically threeapplications are explored: coarse data prolongation in Adaptive Mesh Refinement(AMR) in the field of Computational Fluid Dynamics, accurate document imageupsampling to enhance Optical Character Recognition (OCR) accuracy, and fastand accurate Single Image Super Resolution (SISR). For AMR, a new, efficient,and “3rd order accurate” algorithm called GP-AMR is presented. Next, a novel,non-zero mean, windowed GP model is generated to upsample low resolution documentimages to generate a higher OCR accuracy, when compared to the industrystandard. Finally, a hybrid GP convolutional neural network algorithm is used togenerate a computationally efficient and high quality SISR model
- …