2,053 research outputs found
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern
High-Performance Computing (HPC) clusters. Installing GPUs on each node of the
cluster is not efficient resulting in high costs and power consumption as well
as underutilisation of the accelerator. The research reported in this paper is
motivated towards the use of few physical GPUs by providing cluster nodes
access to remote GPUs on-demand for a financial risk application. We
hypothesise that sharing GPUs between several nodes, referred to as
multi-tenancy, reduces the execution time and energy consumed by an
application. Two data transfer modes between the CPU and the GPUs, namely
concurrent and sequential, are explored. The key result from the experiments is
that multi-tenancy with few physical GPUs using sequential data transfers
lowers the execution time and the energy consumed, thereby improving the
overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC),
10 June 201
Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics
Graphics Processing Units (GPUs) are having a transformational effect on
numerical lattice quantum chromodynamics (LQCD) calculations of importance in
nuclear and particle physics. The QUDA library provides a package of mixed
precision sparse matrix linear solvers for LQCD applications, supporting single
GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This
library, interfaced to the QDP++/Chroma framework for LQCD calculations, is
currently in production use on the "9g" cluster at the Jefferson Laboratory,
enabling unprecedented price/performance for a range of problems in LQCD.
Nevertheless, memory constraints on current GPU devices limit the problem sizes
that can be tackled. In this contribution we describe the parallelization of
the QUDA library onto multiple GPUs using MPI, including strategies for the
overlapping of communication and computation. We report on both weak and strong
scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in
excess of 4 Tflops.Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing
2010 (submitted April 12, 2010
The End of Slow Networks: It's Time for a Redesign
Next generation high-performance RDMA-capable networks will require a
fundamental rethinking of the design and architecture of modern distributed
DBMSs. These systems are commonly designed and optimized under the assumption
that the network is the bottleneck: the network is slow and "thin", and thus
needs to be avoided as much as possible. Yet this assumption no longer holds
true. With InfiniBand FDR 4x, the bandwidth available to transfer data across
network is in the same ballpark as the bandwidth of one memory channel, and it
increases even further with the most recent EDR standard. Moreover, with the
increasing advances of RDMA, the latency improves similarly fast. In this
paper, we first argue that the "old" distributed database design is not capable
of taking full advantage of the network. Second, we propose architectural
redesigns for OLTP, OLAP and advanced analytical frameworks to take better
advantage of the improved bandwidth, latency and RDMA capabilities. Finally,
for each of the workload categories, we show that remarkable performance
improvements can be achieved
QPACE 2 and Domain Decomposition on the Intel Xeon Phi
We give an overview of QPACE 2, which is a custom-designed supercomputer
based on Intel Xeon Phi processors, developed in a collaboration of Regensburg
University and Eurotech. We give some general recommendations for how to write
high-performance code for the Xeon Phi and then discuss our implementation of a
domain-decomposition-based solver and present a number of benchmarks.Comment: plenary talk at Lattice 2014, to appear in the conference proceedings
PoS(LATTICE2014), 15 pages, 9 figure
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
Acceleration-as-a-Service: Exploiting Virtualised GPUs for a Financial Application
'How can GPU acceleration be obtained as a service in a cluster?' This
question has become increasingly significant due to the inefficiency of
installing GPUs on all nodes of a cluster. The research reported in this paper
is motivated to address the above question by employing rCUDA (remote CUDA), a
framework that facilitates Acceleration-as-a-Service (AaaS), such that the
nodes of a cluster can request the acceleration of a set of remote GPUs on
demand. The rCUDA framework exploits virtualisation and ensures that multiple
nodes can share the same GPU. In this paper we test the feasibility of the
rCUDA framework on a real-world application employed in the financial risk
industry that can benefit from AaaS in the production setting. The results
confirm the feasibility of rCUDA and highlight that rCUDA achieves similar
performance compared to CUDA, provides consistent results, and more
importantly, allows for a single application to benefit from all the GPUs
available in the cluster without loosing efficiency.Comment: 11th IEEE International Conference on eScience (IEEE eScience) -
Munich, Germany, 201
- …