819 research outputs found
Immersive ExaBrick: Visualizing Large AMR Data in the CAVE
Rendering large adaptive mesh refinement (AMR) data in real-time in virtual reality (VR) environments is a complex challenge that demands sophisticated techniques and tools. The proposed solution harnesses the ExaBrick framework and integrates it as a plugin in COVISE, a robust visualization system equipped with the VR-centric OpenCOVER render module. This setup enables direct navigation and interaction within the rendered volume in a VR environment. The user interface incorporates rendering options and functions, ensuring a smooth and interactive experience. We show that high-quality volume rendering of AMR data in VR environments at interactive rates is possible using GPUs
Efficient GPU Offloading with OpenMP for a Hyperbolic Finite Volume Solver on Dynamically Adaptive Meshes
We identify and show how to overcome an OpenMP bottleneck in the administration of GPU memory. It arises for a wave equation solver on dynamically adaptive block-structured Cartesian meshes, which keeps all CPU threads busy and allows all of them to offload sets of patches to the GPU. Our studies show that multithreaded, concurrent, non-deterministic access to the GPU leads to performance breakdowns, since the GPU memory bookkeeping as offered through OpenMP’s map clause, i.e., the allocation and freeing, becomes another runtime challenge besides expensive data transfer and actual computation. We, therefore, propose to retain the memory management responsibility on the host: A caching mechanism acquires memory on the accelerator for all CPU threads, keeps hold of this memory and hands it out to the offloading threads upon demand. We show that this user-managed, CPU-based memory administration helps us to overcome the GPU memory bookkeeping bottleneck and speeds up the time-to-solution of Finite Volume kernels by more than an order of magnitude
Efficient GPU Offloading with OpenMP for a Hyperbolic Finite Volume Solver on Dynamically Adaptive Meshes
We identify and show how to overcome an OpenMP bottleneck in the administration of GPU memory. It arises for a wave equation solver on dynamically adaptive block-structured Cartesian meshes, which keeps all CPU threads busy and allows all of them to offload sets of patches to the GPU. Our studies show that multithreaded, concurrent, non-deterministic access to the GPU leads to performance breakdowns, since the GPU memory bookkeeping as offered through OpenMP’s map clause, i.e., the allocation and freeing, becomes another runtime challenge besides expensive data transfer and actual computation. We, therefore, propose to retain the memory management responsibility on the host: A caching mechanism acquires memory on the accelerator for all CPU threads, keeps hold of this memory and hands it out to the offloading threads upon demand. We show that this user-managed, CPU-based memory administration helps us to overcome the GPU memory bookkeeping bottleneck and speeds up the time-to-solution of Finite Volume kernels by more than an order of magnitude
GPU-Native Adaptive Mesh Refinement with Application to Lattice Boltzmann Simulations
The Lattice Boltzmann Method (LBM) has garnered significant interest in
General-Purpose Graphics Processing Unit (GPGPU) programming for computational
fluid dynamics due to its straightforward GPU parallelization and could benefit
greatly from Adaptive Mesh Refinement (AMR). AMR can assist in efficiently
resolving flows with regions of interest requiring a high degree of resolution.
An AMR scheme that could manage a computational mesh entirely on the GPU
without intermediate data transfers to/from the host device would provide a
substantial speedup to GPU-accelerated solvers, however, implementations
commonly employ CPU/hybrid frameworks instead, due to lack of a recursive data
structure. A block-based GPU-native algorithm will be presented for AMR in the
context of GPGPU programming and implemented in an open-source C++ code. The
meshing code is equipped with a Lattice Boltzmann solver for assessing
performance. Different AMR approaches and consequences in implementation are
considered before careful selection of data structures enabling efficient
refinement and coarsening compatible with single instruction multiple data
architecture is detailed. Inter-level communication is achieved by tricubic
interpolation and standard spatial averaging. Although the present open-source
implementation is tailored for LBM simulations, the outlined grid refinement
procedure is compatible with solvers for cell-centered block-structured grids.
Link to repository: https://github.com/KhodrJ/AGALComment: 30 pages, 16 figure
Matching non-uniformity for program optimizations on heterogeneous many-core systems
As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns
Performance Portable Solid Mechanics via Matrix-Free -Multigrid
Finite element analysis of solid mechanics is a foundational tool of modern
engineering, with low-order finite element methods and assembled sparse
matrices representing the industry standard for implicit analysis. We use
performance models and numerical experiments to demonstrate that high-order
methods greatly reduce the costs to reach engineering tolerances while enabling
effective use of GPUs. We demonstrate the reliability, efficiency, and
scalability of matrix-free -multigrid methods with algebraic multigrid
coarse solvers through large deformation hyperelastic simulations of multiscale
structures. We investigate accuracy, cost, and execution time on multi-node CPU
and GPU systems for moderate to large models using AMD MI250X (OLCF Crusher),
NVIDIA A100 (NERSC Perlmutter), and V100 (LLNL Lassen and OLCF Summit),
resulting in order of magnitude efficiency improvements over a broad range of
model properties and scales. We discuss efficient matrix-free representation of
Jacobians and demonstrate how automatic differentiation enables rapid
development of nonlinear material models without impacting debuggability and
workflows targeting GPUs
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
- …