    Multi-Tenant Virtual GPUs for Optimising Performance of a Financial Risk Application

    Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as underutilisation of the accelerator. The research reported in this paper is motivated towards the use of few physical GPUs by providing cluster nodes access to remote GPUs on-demand for a financial risk application. We hypothesise that sharing GPUs between several nodes, referred to as multi-tenancy, reduces the execution time and energy consumed by an application. Two data transfer modes between the CPU and the GPUs, namely concurrent and sequential, are explored. The key result from the experiments is that multi-tenancy with few physical GPUs using sequential data transfers lowers the execution time and the energy consumed, thereby improving the overall performance of the application.Comment: Accepted to the Journal of Parallel and Distributed Computing (JPDC), 10 June 201

    Acceleration-as-a-Service: Exploiting Virtualised GPUs for a Financial Application

    'How can GPU acceleration be obtained as a service in a cluster?' This question has become increasingly significant due to the inefficiency of installing GPUs on all nodes of a cluster. The research reported in this paper is motivated to address the above question by employing rCUDA (remote CUDA), a framework that facilitates Acceleration-as-a-Service (AaaS), such that the nodes of a cluster can request the acceleration of a set of remote GPUs on demand. The rCUDA framework exploits virtualisation and ensures that multiple nodes can share the same GPU. In this paper we test the feasibility of the rCUDA framework on a real-world application employed in the financial risk industry that can benefit from AaaS in the production setting. The results confirm the feasibility of rCUDA and highlight that rCUDA achieves similar performance compared to CUDA, provides consistent results, and more importantly, allows for a single application to benefit from all the GPUs available in the cluster without loosing efficiency.Comment: 11th IEEE International Conference on eScience (IEEE eScience) - Munich, Germany, 201

    GPU-Job Migration: The rCUDA Case

    © 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Virtualization techniques have been shown to report benefits to data centers and other computing facilities. In this regard, not only virtual machines allow to reduce the size of the computing infrastructure while increasing overall resource utilization, but also virtualizing individual components of computers may provide significant benefits. This is the case, for instance, for the remote GPU virtualization technique, implemented in several frameworks during the recent years. The large degree of flexibility provided by the remote GPU virtualization technique can be further increased by applying the migration mechanism to it, so that the GPU part of applications can be live-migrated to another GPU elsewhere in the cluster during execution time in a transparent way. In this paper we present the implementation of the migration mechanism within the rCUDA remote GPU virtualization middleware. Furthermore, we present a thorough performance analysis of the implementation of the migration mechanism within rCUDA. To that end, we leverage both synthetic and real production applications as well as three different generations of NVIDIA GPUs. Additionally, two different versions of the InfiniBand interconnect are used in this study. Several use cases are provided in order to show the extraordinary benefits that the GPU-job migration mechanism can report to data centers.

    A performance comparison of CUDA remote GPU virtualization frameworks

    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Using GPUs reduces execution time of many applications but increases acquisition cost and power consumption. Furthermore, GPUs usually attain a relatively low utilization. In this context, remote GPU virtualization solutions were recently created to overcome the drawbacks of using GPUs. Currently, many different remote GPU virtualization frameworks exist, all of them presenting very different characteristics. These differences among them may lead to differences in performance. In this work we present a performance comparison among the only three CUDA remote GPU virtualization frameworks publicly available at no cost. Results show that performance greatly depends on the exact framework used, being the rCUDA virtualization solution the one that stands out among them. Furthermore, rCUDA doubles performance over CUDA for pageable memory copies.

    Tuning remote GPU virtualization for InfiniBand networks

    The final publication is available at Springer via http://dx.doi.org/ 10.1007/s11227-016-1754-3In the past few years, a tendency towards using InfiniBand networks to interconnect high performance computing clusters can be observed. Thus, most of the supercomputers appearing in the TOP500 list either use Ethernet or InfiniBand interconnects. Regarding the latter, the complexity of the InfiniBand programming API (i.e., InfiniBand Verbs) makes it difficult for applications to get the maximum performance of these networks. In this paper we expose how we have tuned a remote GPU virtualization framework whose communications module is implemented using InfiniBand Verbs. The net result is a noticeable increase in the performance of this framework, significantly reducing the gap between remote and local GPUs. Unified Communication X (UCX), 2015 [Online]. http://www.openucx.orgNVIDIA (2014) CUDA C Programming Guide 6.5Peña AJ, Reaño C, Silla F, Mayo R, Quintana-Ortí ES, Duato J (2014) A complete and efficient cuda-sharing solution for hpc clusters. Parallel Comput 40(10):574– 588 [Online]. http://www.sciencedirect.com/science/article/pii/S0167819114001227Reaño C, Silla F, Gimeno AC, Peña AJ, Mayo R, Quintana-Ortí ES, Duato J (2015) Improving the user experience of the rcuda remote GPU virtualization framework. Concurr Comput Pract Exp 27(14)3746–3770 [Online]. doi: 10.1002/cpe.3409Prades J, Reaño C, Silla F (2016) Flexible access to CUDA accelerators from Xen virtual machines in InfiniBand clusters using rCUDA.     InfiniBand verbs optimizations for remote GPU virtualization

    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The use of InfiniBand networks to interconnect high performance computing clusters has considerably increased during the last years. So much so that the majority of the supercomputers included in the TOP500 list either use Ethernet or InfiniBand interconnects. Regarding the latter, due to the complexity of the InfiniBand programming API (i.e., InfiniBand Verbs) and the lack of documentation, there are not enough recent available studies explaining how to optimize applications to get the maximum performance from this fabric. In this paper we expose two different optimizations to be used when developing applications using InfiniBand Verbs, each providing an average bandwidth improvement of 3.68% and 217.14%, respectively. In addition, we show that when combining both optimizations, the average bandwidth gain is 43.29%. This bandwidth increment is key for remote GPU virtualization frameworks. Actually, this noticeable gain translates into a reduction of up to 35% in execution time of applications using remote GPU virtualization frameworks.

    Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality

    Graphics Processing Units (GPUs) have become widely used to accelerate scientific applications; therefore, it is important that Computer Science and Computer Engineering curricula include the fundamentals of parallel computing with GPUs. Regarding the practical part of the training, one important concern is how to introduce GPUs into a laboratory: installing GPUs in all the computers of the lab may not be affordable, while sharing a remote GPU server among several students may result in a poor learning experience because of its associated overhead. In this paper we propose a solution to address this problem: the use of the rCUDA (remote CUDA) middleware, which enables programs being executed in a computer to make concurrent use of GPUs located in remote servers. Hence, students would be able to concurrently and transparently share a single remote GPU from their local machines in the laboratory without having to log into the remote server. In order to demonstrate that our proposal is feasible, we present results of a real scenario. The results show that the cost of the laboratory is noticeably reduced while the learning experience quality is maintained.

    Exploring the use of data compression for accelerating machine learning in the edge with remote virtual graphics processing units

    [EN] Internet of Things (IoT) devices are usually low performance nodes connected by low bandwidth networks. To improve performance in such scenarios, some computations could be done at the edge of the network. However, edge devices may not have enough computing power to accelerate applications such as the popular machine learning ones. Using remote virtual graphics processing units (GPUs) can address this concern by accelerating applications leveraging a GPU installed in a remote device. However, this requires exchanging data with the remote GPU across the slow network. To address the problem with the slow network, the data to be exchanged with the remote GPU could be compressed. In this article, we explore the suitability of using data compression in the context of remote GPU virtualization frameworks in edge scenarios executing machine learning applications. We use popular machine learning applications to carry out such exploration. After characterizing the GPU data transfers of these applications, we analyze the usage of existing compression libraries for compressing those data transfers to/from the remote GPU. Our exploration shows that transferring compressed data becomes more beneficial as networks get slower, reducing transfer time by up to 10 times. Our analysis also reveals that efficient integration of compression into remote GPU virtualization frameworks is strongly required.

    On the Effect of using rCUDA to Provide CUDA Acceleration to Xen Virtual Machines

    [EN] Nowadays, many data centers use virtual machines (VMs) in order to achieve a more efficient use of hardware resources. The use of VMs provides a reduction in equipment and maintenance expenses as well as a lower electricity consumption. Nevertheless, current virtualization solutions, such as Xen, do not easily provide graphics processing units (GPUs) to applications running in the virtualized domain with the flexibility usually required in data centers (i.e., managing virtual GPU instances and concurrently sharing them among several VMs). Therefore, the execution of GPU-accelerated applications within VMs is hindered by this lack of flexibility. In this regard, remote GPU virtualization solutions may address this concern. In this paper we analyze the use of the remote GPU virtualization mechanism to accelerate scientific applications running inside Xen VMs. We conduct our study with six different applications, namely CUDA-MEME, CUDASW++, GPU-BLAST, LAMMPS, a triangle count application, referred to as TRICO, and a synthetic benchmark used to emulate different application behaviors. Our experiments show that the use of remote GPU virtualization is a feasible approach to address the current concerns of sharing GPUs among several VMs, featuring a very low overhead if an InfiniBand fabric is already present in the cluster.     Accelerator Virtualization in Fog Computing: Moving from the Cloud to the Edge

    [EN] Hardware accelerators are available on the cloud for enhanced analytics. Next-generation clouds aim to bring enhanced analytics using accelerators closer to user devices at the edge of the network for improving quality of service (QoS) by minimizing end-to-end latencies and response times. The collective computing model that utilizes resources at the cloud-edge continuum in a multi-tier hierarchy comprising the cloud, edge, and user devices is referred to as fog computing. This article identifies challenges and opportunities in making accelerators accessible at the edge. A holistic view of the fog architecture is key to pursuing meaningful research in this area.
