1,919 research outputs found
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
In this paper, we address the problem of efficient execution of a computation
pattern, referred to here as the irregular wavefront propagation pattern
(IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in
several image processing operations. In the IWPP, data elements in the
wavefront propagate waves to their neighboring elements on a grid if a
propagation condition is satisfied. Elements receiving the propagated waves
become part of the wavefront. This pattern results in irregular data accesses
and computations. We develop and evaluate strategies for efficient computation
and propagation of wavefronts using a multi-level queue structure. This queue
structure improves the utilization of fast memories in a GPU and reduces
synchronization overheads. We also develop a tile-based parallelization
strategy to support execution on multiple CPUs and GPUs. We evaluate our
approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs
and 2 multicore CPUs) using the IWPP implementations of two widely used image
processing operations: morphological reconstruction and euclidean distance
transform. Our results show significant performance improvements on GPUs. The
use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x
with respect to single core CPU executions for morphological reconstruction and
euclidean distance transform, respectively.Comment: 37 pages, 16 figure
HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU
The end of Dennard scaling and the slowdown of Moore's law led to a shift in
technology trends toward parallel architectures, particularly in HPC systems.
To continue providing performance benefits, HPC should embrace Approximate
Computing (AC), which trades application quality loss for improved performance.
However, existing AC techniques have not been extensively applied and evaluated
in state-of-the-art hardware architectures such as GPUs, the primary execution
vehicle for HPC applications today.
This paper presents HPAC-Offload, a pragma-based programming model that
extends OpenMP offload applications to support AC techniques, allowing portable
approximations across different GPU architectures. We conduct a comprehensive
performance analysis of HPAC-Offload across GPU-accelerated HPC applications,
revealing that AC techniques can significantly accelerate HPC applications
(1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our
analysis offers deep insights into the performance of GPU-based AC that guide
the future development of AC algorithms and systems for these architectures.Comment: 12 pages, 12 pages. Accepted at SC2
Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond
In this and a set of companion whitepapers, the USQCD Collaboration lays out
a program of science and computing for lattice gauge theory. These whitepapers
describe how calculation using lattice QCD (and other gauge theories) can aid
the interpretation of ongoing and upcoming experiments in particle and nuclear
physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers
Generalized Neighbor Search using Commodity Hardware Acceleration
Tree-based Nearest Neighbor Search (NNS) is hard to parallelize on GPUs.
However, newer Nvidia GPUs are equipped with Ray Tracing (RT) cores that can
build a spatial tree called Bounding Volume Hierarchy (BVH) to accelerate
graphics rendering. Recent work proposed using RT cores to implement NNS, but
they all have a hardware-imposed constraint on the type of distance metric,
which is the Euclidean distance. We propose and implement two approaches for
generalized distance computations: filter-refine, and monotone transformation,
each of which allows non-euclidean nearest neighbor queries to be performed in
terms of Euclidean distances. We find that our reductions improve the time
taken to perform distance computations during the search, thereby improving the
overall performance of the NNS
- …