6,562 research outputs found
GPUs as Storage System Accelerators
Massively multicore processors, such as Graphics Processing Units (GPUs),
provide, at a comparable price, a one order of magnitude higher peak
performance than traditional CPUs. This drop in the cost of computation, as any
order-of-magnitude drop in the cost per unit of performance for a class of
system components, triggers the opportunity to redesign systems and to explore
new ways to engineer them to recalibrate the cost-to-performance relation. This
project explores the feasibility of harnessing GPUs' computational power to
improve the performance, reliability, or security of distributed storage
systems. In this context, we present the design of a storage system prototype
that uses GPU offloading to accelerate a number of computationally intensive
primitives based on hashing, and introduce techniques to efficiently leverage
the processing power of GPUs. We evaluate the performance of this prototype
under two configurations: as a content addressable storage system that
facilitates online similarity detection between successive versions of the same
file and as a traditional system that uses hashing to preserve data integrity.
Further, we evaluate the impact of offloading to the GPU on competing
applications' performance. Our results show that this technique can bring
tangible performance gains without negatively impacting the performance of
concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201
An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor
This paper reports the implementation and performance evaluation of the
OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the
Parallella single-board computer. The Epiphany architecture exhibits massive
many-core scalability with a physically compact 2D array of RISC CPU cores and
a fast network-on-chip (NoC). While fully capable of MPMD execution, the
physical topology and memory-mapped capabilities of the core and network
translate well to Partitioned Global Address Space (PGAS) programming models
and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and
Related Technologie
The impact of global communication latency at extreme scales on Krylov methods
Krylov Subspace Methods (KSMs) are popular numerical tools for solving large linear systems of equations. We consider their role in solving sparse systems on future massively parallel distributed memory machines, by estimating future performance of their constituent operations. To this end we construct a model that is simple, but which takes topology and network acceleration into account as they are important considerations. We show that, as the number of nodes of a parallel machine increases to very large numbers, the increasing latency cost of reductions may well become a problematic bottleneck for traditional formulations of these methods. Finally, we discuss how pipelined KSMs can be used to tackle the potential problem, and appropriate pipeline depths
A Sparse SCF algorithm and its parallel implementation: Application to DFTB
We present an algorithm and its parallel implementation for solving a self
consistent problem as encountered in Hartree Fock or Density Functional Theory.
The algorithm takes advantage of the sparsity of matrices through the use of
local molecular orbitals. The implementation allows to exploit efficiently
modern symmetric multiprocessing (SMP) computer architectures. As a first
application, the algorithm is used within the density functional based tight
binding method, for which most of the computational time is spent in the linear
algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that
with this algorithm (i) single point calculations on very large systems
(millions of atoms) can be performed on large SMP machines (ii) calculations
involving intermediate size systems (1~000--100~000 atoms) are also strongly
accelerated and can run efficiently on standard servers (iii) the error on the
total energy due to the use of a cut-off in the molecular orbital coefficients
can be controlled such that it remains smaller than the SCF convergence
criterion.Comment: 13 pages, 11 figure
Recommended from our members
UPC++ v1.0 Programmer’s Guide, Revision 2020.3.0
UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores
Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations
AbstractSimulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems
- …