104,010 research outputs found
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern
High-Performance Computing (HPC) clusters. Installing GPUs on each node of the
cluster is not efficient resulting in high costs and power consumption as well
as underutilisation of the accelerator. The research reported in this paper is
motivated towards the use of few physical GPUs by providing cluster nodes
access to remote GPUs on-demand for a financial risk application. We
hypothesise that sharing GPUs between several nodes, referred to as
multi-tenancy, reduces the execution time and energy consumed by an
application. Two data transfer modes between the CPU and the GPUs, namely
concurrent and sequential, are explored. The key result from the experiments is
that multi-tenancy with few physical GPUs using sequential data transfers
lowers the execution time and the energy consumed, thereby improving the
overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC),
10 June 201
A Survey of Methods For Analyzing and Improving GPU Energy Efficiency
Recent years have witnessed a phenomenal growth in the computational
capabilities and applications of GPUs. However, this trend has also led to
dramatic increase in their power consumption. This paper surveys research works
on analyzing and improving energy efficiency of GPUs. It also provides a
classification of these techniques on the basis of their main research idea.
Further, it attempts to synthesize research works which compare energy
efficiency of GPUs with other computing systems, e.g. FPGAs and CPUs. The aim
of this survey is to provide researchers with knowledge of state-of-the-art in
GPU power management and motivate them to architect highly energy-efficient
GPUs of tomorrow.Comment: Accepted with minor revision in ACM Computing Survey Journal (impact
factor 3.85, five year impact of 7.85
Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
Convolution is a fundamental operation in many applications, such as computer
vision, natural language processing, image processing, etc. Recent successes of
convolutional neural networks in various deep learning applications put even
higher demand on fast convolution. The high computation throughput and memory
bandwidth of graphics processing units (GPUs) make GPUs a natural choice for
accelerating convolution operations. However, maximally exploiting the
available memory bandwidth of GPUs for convolution is a challenging task. This
paper introduces a general model to address the mismatch between the memory
bank width of GPUs and computation data width of threads. Based on this model,
we develop two convolution kernels, one for the general case and the other for
a special case with one input channel. By carefully optimizing memory access
patterns and computation patterns, we design a communication-optimized kernel
for the special case and a communication-reduced kernel for the general case.
Experimental data based on implementations on Kepler GPUs show that our kernels
achieve 5.16X and 35.5% average performance improvement over the latest cuDNN
library, for the special case and the general case, respectively
Tree Contraction, Connected Components, Minimum Spanning Trees: a GPU Path to Vertex Fitting
Standard parallel computing operations are considered in the context of algorithms for solving 3D graph problems which have applications, e.g., in vertex finding in HEP. Exploiting GPUs for tree-accumulation and graph algorithms is challenging: GPUs offer extreme computational power and high memory-access bandwidth, combined with a model of fine-grained parallelism perhaps not suiting the irregular distribution of linked representations of graph data structures. Achieving data-race free computations may demand serialization through atomic transactions, inevitably producing poor parallel performance. A Minimum Spanning Tree algorithm for GPUs is presented, its implementation discussed, and its efficiency evaluated on GPU and multicore architectures
Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library
Application programming for GPUs (Graphics Processing Units) is complex and error-prone, because the popular approaches — CUDA and OpenCL — are intrinsically low-level and offer no special support for systems consisting of multiple GPUs. The SkelCL library presented in this paper
is built on top of the OpenCL standard and offers preimplemented recurring computation and communication patterns (skeletons) which greatly simplify programming for multiGPU systems. The library also provides an abstract vector data type and a high-level data (re)distribution mechanism to shield the programmer from the low-level data transfers between the system’s main memory and multiple GPUs. In this
paper, we focus on the specific support in SkelCL for systems with multiple GPUs and use a real-world application study from the area of medical imaging to demonstrate the reduced programming effort and competitive performance of SkelCL as compared to OpenCL and CUDA. Besides, we illustrate how SkelCL adapts to large-scale, distributed heterogeneous
systems in order to simplify their programming
- …
