6,852 research outputs found
GPUs as Storage System Accelerators
Massively multicore processors, such as Graphics Processing Units (GPUs),
provide, at a comparable price, a one order of magnitude higher peak
performance than traditional CPUs. This drop in the cost of computation, as any
order-of-magnitude drop in the cost per unit of performance for a class of
system components, triggers the opportunity to redesign systems and to explore
new ways to engineer them to recalibrate the cost-to-performance relation. This
project explores the feasibility of harnessing GPUs' computational power to
improve the performance, reliability, or security of distributed storage
systems. In this context, we present the design of a storage system prototype
that uses GPU offloading to accelerate a number of computationally intensive
primitives based on hashing, and introduce techniques to efficiently leverage
the processing power of GPUs. We evaluate the performance of this prototype
under two configurations: as a content addressable storage system that
facilitates online similarity detection between successive versions of the same
file and as a traditional system that uses hashing to preserve data integrity.
Further, we evaluate the impact of offloading to the GPU on competing
applications' performance. Our results show that this technique can bring
tangible performance gains without negatively impacting the performance of
concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201
Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures
Efficient implementations of parallel applications on heterogeneous hybrid
architectures require a careful balance between computations and communications
with accelerator devices. Even if most of the communication time can be
overlapped by computations, it is essential to reduce the total volume of
communicated data. The literature therefore abounds with ad-hoc methods to
reach that balance, but that are architecture and application dependent. We
propose here a generic mechanism to automatically optimize the scheduling
between CPUs and GPUs, and compare two strategies within this mechanism: the
classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new,
parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which
consists in grouping the tasks by affinity before running a fast dual
approximation. We ran experiments on a heterogeneous parallel machine with six
CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra
kernels from the PLASMA library have been ported on top of the Xkaapi runtime.
We report their performances. It results that HEFT and DADA perform well for
various experimental conditions, but that DADA performs better for larger
systems and number of GPUs, and, in most cases, generates much lower data
transfers than HEFT to achieve the same performance
Direct Communication Methods for Distributed GPUs
Today, GPUs and other parallel accelerators are widely used in high performance computing, due to their high computational power and high performance per watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. Often, a data transfer between two distributed GPUs even requires intermediate copies in host memory. This overhead penalizes small data movements and synchronization operations.
In this work, different communication methods for distributed GPUs are implemented and evaluated. First, a new technique, called GPUDirect RDMA, is implemented for the Extoll device and evaluated. The performance results show that this technique brings performance benefits for small- and mediums-sized data transfers, but for larger transfer sizes, a staged protocol is preferable since the PCIe-bus does not well support peer-to-peer data transfers.
In the next step, GPUs are integrated to the one-sided communication library GPI-2. Since this interface was designed for heterogeneous memory structures, it allows an easy integration of GPUs. The performance results show that using one-sided communication for GPUs brings some performance benefits compared to two-sided communication which is the current state-of-the-art. However, using GPI-2 for communication still requires a host thread to control GPU-related communication, although the data is transferred directly between the GPUs without any host copies. Therefore, the subsequent part of the work analyze GPU-controlled communication.
First, a put/get communication interface, based on Infiniband verbs, for the GPU is implemented. This interface enables the GPU to independently source and synchronize communication requests without any involvements of the CPU. However, the Infiniband verbs protocol adds a lot of sequential overhead to the communication, so the performance of GPU-controlled put/get communication is far behind the performance of CPU-controlled put/get communication.
Another problem is intra-GPU synchronization, since GPU blocks are non-preemptive. The use of communication requests within a GPU can easily result in a deadlock. Dynamic parallelism solves this problem. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, the performance per watt increases, since the CPU can be relieved from the communication work.
As a communication model that is more in line with the massive parallelism of GPUs, the performance of a hardware-supported global address space for GPUs is evaluated. This global address space allows communication with simple load and store instructions which can be performed by multiple threads in parallel. With this method, the latency for a GPU-to-GPU data transfer can be reduced to 3us, using an FPGA. The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers. However, the main bottleneck of this method is that is does not allow overlapping of communication and computation which is the case for put/get communication.
However, by using GPU optimized communication models, depending on the application, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication
Enabling GPU Support for the COMPSs-Mobile Framework
Using the GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. This article presents a task-based solution for adaptative, collaborative heterogeneous computing on mobile cloud environments. To implement our proposal, we extend the COMPSs-Mobile framework – an implementation of the COMPSs programming model for building mobile applications that offload part of the computation to the Cloud – to support offloading computation to GPUs through OpenCL. To evaluate our solution, we subject the prototype to three benchmark applications representing different application patterns.This work is partially supported by the Joint-Laboratory on Extreme Scale Computing (JLESC), by the European Union through the Horizon 2020 research and innovation programme under contract 687584 (TANGO Project), by the Spanish Goverment (TIN2015-65316-P, BES-2013-067167, EEBB-2016-11272, SEV-2011-00067) and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
Brook Auto: High-Level Certification-Friendly Programming for GPU-powered Automotive Systems
Modern automotive systems require increased performance to implement Advanced Driving Assistance Systems (ADAS). GPU-powered platforms are promising candidates for such computational tasks, however current low-level programming models challenge the accelerator software certification process, while they limit the hardware selection to a fraction of the available platforms. In this paper we present Brook Auto, a high-level programming language for automotive GPU systems which removes these limitations. We describe the challenges and solutions we faced in its implementation, as well as a complete evaluation in terms of performance and productivity, which shows the effectiveness of our method.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P and the HiPEAC Network of Excellence.Peer ReviewedPostprint (author's final draft
Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8–4.3 speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9 and 48.2 speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively
- …