1,095 research outputs found
A complete and efficient CUDA-sharing solution for HPC clusters
In this paper we detail the key features, architectural design, and implementation of rCUDA,
an advanced framework to enable remote and transparent GPGPU acceleration in HPC
clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators,
which brings enhanced flexibility to cluster configurations. This opens the door to configurations
with fewer accelerators than nodes, as well as permits a single node to exploit the
whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly
interact with any GPU in the cluster, independently of its physical location. Thus,
GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU
servers, depending on the cluster administrator’s policy. This proposal leads to savings not
only in space but also in energy, acquisition, and maintenance costs. The performance evaluation
in this paper with a series of benchmarks and a production application clearly demonstrates
the viability of this proposal. Concretely, experiments with the matrix–matrix
product reveal excellent performance compared with regular executions on the local
GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up
to 11x speedup employing 8 remote accelerators from a single node with respect to a
12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration
in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization
frameworks are also evaluated.
2014 Elsevier B.V. All rights reserved.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-004-01. It was also supported by MINECO, FEDER funds, under Grant TIN2011-23283, and by the Fundacion Caixa-Castello Bancaixa, Grant P11B2013-21. This work was also supported in part by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. Authors are grateful for the generous support provided by Mellanox Technologies to the rCUDA Project. The authors would also like to thank Adrian Castello, member of The rCUDA Development Team, for his hard work on rCUDA.Peña Monferrer, AJ.; Reaño González, C.; Silla JimĂ©nez, F.; Mayo Gual, R.; Quintana-Orti, ES.; Duato MarĂn, JF. (2014). A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing. 40(10):574-588. https://doi.org/10.1016/j.parco.2014.09.011S574588401
Acceleration-as-a-Service: Exploiting Virtualised GPUs for a Financial Application
'How can GPU acceleration be obtained as a service in a cluster?' This
question has become increasingly significant due to the inefficiency of
installing GPUs on all nodes of a cluster. The research reported in this paper
is motivated to address the above question by employing rCUDA (remote CUDA), a
framework that facilitates Acceleration-as-a-Service (AaaS), such that the
nodes of a cluster can request the acceleration of a set of remote GPUs on
demand. The rCUDA framework exploits virtualisation and ensures that multiple
nodes can share the same GPU. In this paper we test the feasibility of the
rCUDA framework on a real-world application employed in the financial risk
industry that can benefit from AaaS in the production setting. The results
confirm the feasibility of rCUDA and highlight that rCUDA achieves similar
performance compared to CUDA, provides consistent results, and more
importantly, allows for a single application to benefit from all the GPUs
available in the cluster without loosing efficiency.Comment: 11th IEEE International Conference on eScience (IEEE eScience) -
Munich, Germany, 201
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern
High-Performance Computing (HPC) clusters. Installing GPUs on each node of the
cluster is not efficient resulting in high costs and power consumption as well
as underutilisation of the accelerator. The research reported in this paper is
motivated towards the use of few physical GPUs by providing cluster nodes
access to remote GPUs on-demand for a financial risk application. We
hypothesise that sharing GPUs between several nodes, referred to as
multi-tenancy, reduces the execution time and energy consumed by an
application. Two data transfer modes between the CPU and the GPUs, namely
concurrent and sequential, are explored. The key result from the experiments is
that multi-tenancy with few physical GPUs using sequential data transfers
lowers the execution time and the energy consumed, thereby improving the
overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC),
10 June 201
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
- …