697 research outputs found
GPUs as Storage System Accelerators
Massively multicore processors, such as Graphics Processing Units (GPUs),
provide, at a comparable price, a one order of magnitude higher peak
performance than traditional CPUs. This drop in the cost of computation, as any
order-of-magnitude drop in the cost per unit of performance for a class of
system components, triggers the opportunity to redesign systems and to explore
new ways to engineer them to recalibrate the cost-to-performance relation. This
project explores the feasibility of harnessing GPUs' computational power to
improve the performance, reliability, or security of distributed storage
systems. In this context, we present the design of a storage system prototype
that uses GPU offloading to accelerate a number of computationally intensive
primitives based on hashing, and introduce techniques to efficiently leverage
the processing power of GPUs. We evaluate the performance of this prototype
under two configurations: as a content addressable storage system that
facilitates online similarity detection between successive versions of the same
file and as a traditional system that uses hashing to preserve data integrity.
Further, we evaluate the impact of offloading to the GPU on competing
applications' performance. Our results show that this technique can bring
tangible performance gains without negatively impacting the performance of
concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201
PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability
We propose a novel computing runtime that exposes remote compute devices via
the cross-vendor open heterogeneous computing standard OpenCL and can execute
compute tasks on the MEC cluster side across multiple servers in a scalable
manner. Intermittent UE connection loss is handled gracefully even if the
device's IP address changes on the way. Network-induced latency is minimized by
transferring data and signaling command completions between remote devices in a
peer-to-peer fashion directly to the target server with a streamlined TCP-based
protocol that yields a command latency of only 60 microseconds on top of
network round-trip latency in synthetic benchmarks. The runtime can utilize
RDMA to speed up inter-server data transfers by an additional 60% compared to
the TCP-based solution. The benefits of the proposed runtime in MEC
applications are demonstrated with a smartphone-based augmented reality
rendering case study. Measurements show up to 19x improvements to frame rate
and 17x improvements to local energy consumption when using the proposed
runtime to offload AR rendering from a smartphone. Scalability to multiple GPU
servers in real-world applications is shown in a computational fluid dynamics
simulation, which scales with the number of servers at roughly 80% efficiency
which is comparable to an MPI port of the same simulation.Comment: 13 pages, 17 figure
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
Enabling GPU Accelerated Computing in the SUNDIALS Time Integration Library
As part of the Exascale Computing Project (ECP), a recent focus of
development efforts for the SUite of Nonlinear and DIfferential/ALgebraic
equation Solvers (SUNDIALS) has been to enable GPU-accelerated time integration
in scientific applications at extreme scales. This effort has resulted in
several new GPU-enabled implementations of core SUNDIALS data structures,
support for programming paradigms which are aware of the heterogeneous
architectures, and the introduction of utilities to provide new points of
flexibility. In this paper, we discuss our considerations, both internal and
external, when designing these new features and present the features
themselves. We also present performance results for several of the features on
the Summit supercomputer and early access hardware for the Frontier
supercomputer, which demonstrate negligible performance overhead resulting from
the additional infrastructure and significant speedups when using both NVIDIA
and AMD GPUs
Exploring Fully Offloaded GPU Stream-Aware Message Passing
Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and
high-speed network interconnects. Communication libraries supporting efficient
data transfers involving memory buffers from the GPU memory typically require
the CPU to orchestrate the data transfer operations. A new offload-friendly
communication strategy, stream-triggered (ST) communication, was explored to
allow offloading the synchronization and data movement operations from the CPU
to the GPU. A Message Passing Interface (MPI) one-sided active target
synchronization based implementation was used as an exemplar to illustrate the
proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used
to explore the various performance aspects of the implementation. The offloaded
implementation shows significant on-node performance advantages over standard
MPI active RMA (36%) and point-to-point (61%) communication. The current
multi-node improvement is less (23% faster than standard active RMA but 11%
slower than point-to-point), but plans are in progress to purse further
improvements.Comment: 12 pages, 17 figure
ACCL+: an FPGA-Based Collective Engine for Distributed Applications
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs
or network-attached accelerators. Despite their potential, developing
distributed FPGA-accelerated applications remains cumbersome due to the lack of
appropriate infrastructure and communication abstractions. To facilitate the
development of distributed applications with FPGAs, in this paper we propose
ACCL+, an open-source versatile FPGA-based collective communication library.
Portable across different platforms and supporting UDP, TCP, as well as RDMA,
ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective
communication. Additionally, it can serve as a collective offload engine for
CPU applications, freeing the CPU from networking tasks. It is user-extensible,
allowing new collectives to be implemented and deployed without having to
re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100
Gb/s networking, comparing its performance against software MPI over RDMA. The
results demonstrate ACCL+'s significant advantages for FPGA-based distributed
applications and highly competitive performance for CPU applications. We
showcase ACCL+'s dual role with two use cases: seamlessly integrating as a
collective offload engine to distribute CPU-based vector-matrix multiplication,
and serving as a crucial and efficient component in designing fully FPGA-based
distributed deep-learning recommendation inference
- …