Search CORE

819 research outputs found

Immersive ExaBrick: Visualizing Large AMR Data in the CAVE

Author: Wang Zhaoyang
Wesner Stefan
Zellmann Stefan
Publication venue
Publication date: 04/10/2023
Field of study

Rendering large adaptive mesh refinement (AMR) data in real-time in virtual reality (VR) environments is a complex challenge that demands sophisticated techniques and tools. The proposed solution harnesses the ExaBrick framework and integrates it as a plugin in COVISE, a robust visualization system equipped with the VR-centric OpenCOVER render module. This setup enables direct navigation and interaction within the rendered volume in a VR environment. The user interface incorporates rendering options and functions, ensuring a smooth and interactive experience. We show that high-quality volume rendering of AMR data in VR environments at interactive rates is possible using GPUs

Kölner UniversitätsPublikationsServer

Efficient GPU Offloading with OpenMP for a Hyperbolic Finite Volume Solver on Dynamically Adaptive Meshes

Author: Baboulin M.
Bader M
Bhatele A.
Brito Gadeschi G
Hammond J.
Kruse C.
Weinzierl T
Wille M
Publication venue: Springer Verlag
Publication date: 10/05/2023
Field of study

We identify and show how to overcome an OpenMP bottleneck in the administration of GPU memory. It arises for a wave equation solver on dynamically adaptive block-structured Cartesian meshes, which keeps all CPU threads busy and allows all of them to offload sets of patches to the GPU. Our studies show that multithreaded, concurrent, non-deterministic access to the GPU leads to performance breakdowns, since the GPU memory bookkeeping as offered through OpenMP’s map clause, i.e., the allocation and freeing, becomes another runtime challenge besides expensive data transfer and actual computation. We, therefore, propose to retain the memory management responsibility on the host: A caching mechanism acquires memory on the accelerator for all CPU threads, keeps hold of this memory and hands it out to the offloading threads upon demand. We show that this user-managed, CPU-based memory administration helps us to overcome the GPU memory bookkeeping bottleneck and speeds up the time-to-solution of Finite Volume kernels by more than an order of magnitude

Durham Research Online

Efficient GPU Offloading with OpenMP for a Hyperbolic Finite Volume Solver on Dynamically Adaptive Meshes

Author: Bader M
Brito Gadeschi G
Weinzierl T
Wille M
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2023
Field of study

Durham Research Online

GPU-Native Adaptive Mesh Refinement with Application to Lattice Boltzmann Simulations

Author: Essel Ebenezer
Jaber Khodr
Sullivan Pierre
Publication venue
Publication date: 15/08/2023
Field of study

The Lattice Boltzmann Method (LBM) has garnered significant interest in General-Purpose Graphics Processing Unit (GPGPU) programming for computational fluid dynamics due to its straightforward GPU parallelization and could benefit greatly from Adaptive Mesh Refinement (AMR). AMR can assist in efficiently resolving flows with regions of interest requiring a high degree of resolution. An AMR scheme that could manage a computational mesh entirely on the GPU without intermediate data transfers to/from the host device would provide a substantial speedup to GPU-accelerated solvers, however, implementations commonly employ CPU/hybrid frameworks instead, due to lack of a recursive data structure. A block-based GPU-native algorithm will be presented for AMR in the context of GPGPU programming and implemented in an open-source C++ code. The meshing code is equipped with a Lattice Boltzmann solver for assessing performance. Different AMR approaches and consequences in implementation are considered before careful selection of data structures enabling efficient refinement and coarsening compatible with single instruction multiple data architecture is detailed. Inter-level communication is achieved by tricubic interpolation and standard spatial averaging. Although the present open-source implementation is tailored for LBM simulations, the outlined grid refinement procedure is compatible with solvers for cell-centered block-structured grids. Link to repository: https://github.com/KhodrJ/AGALComment: 30 pages, 16 figure

arXiv.org e-Print Archive

Matching non-uniformity for program optimizations on heterogeneous many-core systems

Author: Wu Bo
Publication venue: W&M ScholarWorks
Publication date: 01/01/2014
Field of study

As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns

College of William & Mary: W&M Publish

Performance Portable Solid Mechanics via Matrix-Free $p$ -Multigrid

Author: Barra Valeria
Beams Natalie
Brown Jed
Ghaffari Leila
Knepley Matthew
Moses William
Shakeri Rezgar
Stengel Karen
Thompson Jeremy L.
Zhang Junchao
Publication venue
Publication date: 04/04/2022
Field of study

Finite element analysis of solid mechanics is a foundational tool of modern engineering, with low-order finite element methods and assembled sparse matrices representing the industry standard for implicit analysis. We use performance models and numerical experiments to demonstrate that high-order methods greatly reduce the costs to reach engineering tolerances while enabling effective use of GPUs. We demonstrate the reliability, efficiency, and scalability of matrix-free

p

-multigrid methods with algebraic multigrid coarse solvers through large deformation hyperelastic simulations of multiscale structures. We investigate accuracy, cost, and execution time on multi-node CPU and GPU systems for moderate to large models using AMD MI250X (OLCF Crusher), NVIDIA A100 (NERSC Perlmutter), and V100 (LLNL Lassen and OLCF Summit), resulting in order of magnitude efficiency improvements over a broad range of model properties and scales. We discuss efficient matrix-free representation of Jacobians and demonstrate how automatic differentiation enables rapid development of nonlinear material models without impacting debuggability and workflows targeting GPUs

arXiv.org e-Print Archive

Gunrock: GPU Graph Analytics

Author: Davidson Andrew
Liu Weitang
Osama Muhammad
Owens John D.
Pan Yuechao
Riffel Andy T.
Wang Leyuan
Wang Yangzihao
Wu Yuduo
Yang Carl
Yuan Chenshan
Publication venue
Publication date: 04/01/2017
Field of study

For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

arXiv.org e-Print Archive

eScholarship - University of California

FigShare