29,238 research outputs found
Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application
Graphics Processing Units (GPUs) are becoming popular accelerators in modern
High-Performance Computing (HPC) clusters. Installing GPUs on each node of the
cluster is not efficient resulting in high costs and power consumption as well
as underutilisation of the accelerator. The research reported in this paper is
motivated towards the use of few physical GPUs by providing cluster nodes
access to remote GPUs on-demand for a financial risk application. We
hypothesise that sharing GPUs between several nodes, referred to as
multi-tenancy, reduces the execution time and energy consumed by an
application. Two data transfer modes between the CPU and the GPUs, namely
concurrent and sequential, are explored. The key result from the experiments is
that multi-tenancy with few physical GPUs using sequential data transfers
lowers the execution time and the energy consumed, thereby improving the
overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC),
10 June 201
Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well
suited to an efficient implementation for massively parallel computing, due to
the prevalence of local operations in the algorithm. This paper presents and
analyses the performance of a 3D lattice Boltzmann solver, optimized for third
generation nVidia GPU hardware, also known as `Kepler'. We provide a review of
previous optimisation strategies and analyse data read/write times for
different memory types. In LBM, the time propagation step (known as streaming),
involves shifting data to adjacent locations and is central to parallel
performance; here we examine three approaches which make use of different
hardware options. Two of which make use of `performance enhancing' features of
the GPU; shared memory and the new shuffle instruction found in Kepler based
GPUs. These are compared to a standard transfer of data which relies instead on
optimised storage to increase coalesced access. It is shown that the more
simple approach is most efficient; since the need for large numbers of
registers per thread in LBM limits the block size and thus the efficiency of
these special features is reduced. Detailed results are obtained for a D3Q19
LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter
case the use of a read-only data cache is explored, and peak performance of
over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The
appearance of a periodic bottleneck in the solver performance is also reported,
believed to be hardware related; spikes in iteration-time occur with a
frequency of around 11Hz for both GPUs, independent of the size of the problem.Comment: 12 page
A parallel priority queue with fast updates for GPU architectures
The high computational throughput of modern graphics processing units (GPUs)
make them the de-facto architecture for high-performance computing
applications. However, to achieve peak performance, GPUs require highly
parallel workloads, as well as memory access patterns that exhibit good
locality of reference. As a result, many state-of-the-art algorithms and data
structures designed for GPUs sacrifice work-optimality to achieve the necessary
parallelism. Furthermore, some abstract data types are avoided completely due
to there being no corresponding data structure that performs well on the GPU.
One such abstract data type is the priority queue. Many well-known algorithms
rely on priority queue operations as a building block. While various priority
queue structures have been developed that are parallel, cache-aware, or
cache-oblivious, none has been shown to be efficient on GPUs. In this paper, we
present the parBucketHeap, a parallel, cache-efficient data structure designed
for modern GPU architectures that supports standard priority queue operations,
as well as bulk update. We analyze the structure in several well-known
computational models and show that it provides both optimal parallelism and is
cache-efficient. We implement the parBucketHeap and, using it, we solve the
single-source shortest path (SSSP) problem. Experimental results indicate that,
for sufficiently large, dense graphs with high diameter, we out-perform current
state-of-the-art SSSP algorithms on the GPU by up to a factor of 5. Unlike
existing GPU SSSP algorithms, our approach is work-optimal and places
significantly less load on the GPU, reducing power consumption
An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters
Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi-GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations
Dynamic Hardware Resource Management for Efficient Throughput Processing.
High performance computing is evolving at a rapid pace, with throughput oriented processors such as graphics processing units (GPUs), substituting for traditional processors as the computational workhorse. Their adoption has seen a tremendous increase as they provide high peak performance and energy efficiency while maintaining a friendly programming interface. Furthermore, many existing desktop, laptop, tablet, and smartphone systems support accelerating non-graphics, data parallel workloads on their GPUs. However, the
multitude of systems that use GPUs as an accelerator run different genres of data parallel applications, which have significantly contrasting runtime characteristics.
GPUs use thousands of identical threads to efficiently exploit the on-chip hardware resources. Therefore, if one thread uses a resource (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource. This contention will eventually saturate the performance of the GPU due to contention for the bottleneck resource,leaving other resources underutilized at the same time. Traditional policies of managing the massive hardware resources work adequately, on well designed traditional scientific
style applications. However, these static policies, which are oblivious to the application’s resource requirement, are not efficient for the large spectrum of data parallel workloads with varying resource requirements. Therefore, several standard hardware policies such as using maximum concurrency, fixed operational frequency and round-robin style scheduling are not efficient for modern GPU applications.
This thesis defines dynamic hardware resource management mechanisms which improve the efficiency of the GPU by regulating the hardware resources at runtime. The first step in successfully achieving this goal is to make the hardware aware of the application’s characteristics at runtime through novel counters and indicators. After this detection, dynamic hardware modulation provides opportunities for increased performance, improved energy consumption, or both, leading to efficient execution. The key mechanisms for modulating the hardware at runtime are dynamic frequency regulation, managing the amount of
concurrency, managing the order of execution among different threads and increasing cache utilization. The resultant increased efficiency will lead to improved energy consumption of the systems that utilize GPUs while maintaining or improving their performance.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113356/1/asethia_1.pd
- …