26 research outputs found
GPU-Job Migration: The rCUDA Case
© 2019 IEEE. Personal use of this material is permitted. PermissĂon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisĂng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Virtualization techniques have been shown to report benefits to data centers and other computing facilities. In this regard, not only virtual machines allow to reduce the size of the computing infrastructure while increasing overall resource utilization, but also virtualizing individual components of computers may provide significant benefits. This is the case, for instance, for the remote GPU virtualization technique, implemented in several frameworks during the recent years. The large degree of flexibility provided by the remote GPU virtualization technique can be further increased by applying the migration mechanism to it, so that the GPU part of applications can be live-migrated to another GPU elsewhere in the cluster during execution time in a transparent way. In this paper we present the implementation of the migration mechanism within the rCUDA remote GPU virtualization middleware. Furthermore, we present a thorough performance analysis of the implementation of the migration mechanism within rCUDA. To that end, we leverage both synthetic and real production applications as well as three different generations of NVIDIA GPUs. Additionally, two different versions of the InfiniBand interconnect are used in this study. Several use cases are provided in order to show the extraordinary benefits that the GPU-job migration mechanism can report to data centers.This work was funded by the Generalitat Valenciana under Grant PROMETEO/2017/77. Authors are grateful for the generous support provided by Mellanox Technologies Inc.Prades, J.; Silla JimĂ©nez, F. (2019). GPU-Job Migration: The rCUDA Case. IEEE Transactions on Parallel and Distributed Systems. 30(12):2718-2729. https://doi.org/10.1109/TPDS.2019.292443327182729301
A performance comparison of CUDA remote GPU virtualization frameworks
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Using GPUs reduces execution time of many applications
but increases acquisition cost and power consumption.
Furthermore, GPUs usually attain a relatively low utilization.
In this context, remote GPU virtualization solutions were
recently created to overcome the drawbacks of using GPUs.
Currently, many different remote GPU virtualization frameworks
exist, all of them presenting very different characteristics.
These differences among them may lead to differences in
performance. In this work we present a performance comparison
among the only three CUDA remote GPU virtualization
frameworks publicly available at no cost. Results show that
performance greatly depends on the exact framework used,
being the rCUDA virtualization solution the one that stands
out among them. Furthermore, rCUDA doubles performance
over CUDA for pageable memory copies.This work was funded by the Generalitat Valenciana under
Grant PROMETEOII/2013/009 of the PROMETEO program
phase II. Authors are also grateful for the generous support
provided by Mellanox TechnologiesReaño González, C.; Silla Jiménez, F. (2015). A performance comparison of CUDA remote GPU virtualization frameworks. IEEE. https://doi.org/10.1109/CLUSTER.2015.76
InfiniBand verbs optimizations for remote GPU virtualization
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The use of InfiniBand networks to interconnect
high performance computing clusters has considerably increased
during the last years. So much so that the majority of
the supercomputers included in the TOP500 list either use
Ethernet or InfiniBand interconnects. Regarding the latter,
due to the complexity of the InfiniBand programming API
(i.e., InfiniBand Verbs) and the lack of documentation, there
are not enough recent available studies explaining how to
optimize applications to get the maximum performance from
this fabric. In this paper we expose two different optimizations
to be used when developing applications using InfiniBand
Verbs, each providing an average bandwidth improvement of
3.68% and 217.14%, respectively. In addition, we show that
when combining both optimizations, the average bandwidth
gain is 43.29%. This bandwidth increment is key for remote
GPU virtualization frameworks. Actually, this noticeable gain
translates into a reduction of up to 35% in execution time of
applications using remote GPU virtualization frameworks.This work was funded by the Generalitat Valenciana under
Grant PROMETEOII/2013/009 of the PROMETEO program
phase II. Authors are also grateful for the generous support
provided by Mellanox TechnologiesReaño González, C.; Silla Jiménez, F. (2015). InfiniBand verbs optimizations for remote GPU virtualization. IEEE. https://doi.org/10.1109/CLUSTER.2015.139
Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality
Graphics Processing Units (GPUs) have become widely used to accelerate scientific applications;
therefore, it is important that Computer Science and Computer Engineering curricula include the
fundamentals of parallel computing with GPUs. Regarding the practical part of the training, one
important concern is how to introduce GPUs into a laboratory: installing GPUs in all the computers of
the lab may not be affordable, while sharing a remote GPU server among several students may result
in a poor learning experience because of its associated overhead.
In this paper we propose a solution to address this problem: the use of the rCUDA (remote CUDA)
middleware, which enables programs being executed in a computer to make concurrent use of GPUs
located in remote servers. Hence, students would be able to concurrently and transparently share a
single remote GPU from their local machines in the laboratory without having to log into the remote
server. In order to demonstrate that our proposal is feasible, we present results of a real scenario. The
results show that the cost of the laboratory is noticeably reduced while the learning experience quality
is maintained.Reaño González, C.; Silla Jiménez, F. (2015). Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality. En INTED2015 Proceedings. IATED. 3651-3660. http://hdl.handle.net/10251/70229S3651366
Accelerator Virtualization in Fog Computing: Moving from the Cloud to the Edge
[EN] Hardware accelerators are available on the cloud for enhanced analytics. Next-generation clouds aim to bring enhanced analytics using accelerators closer to user devices at the edge of the network for improving quality of service (QoS) by minimizing end-to-end latencies and response times. The collective computing model that utilizes resources at the cloud-edge continuum in a multi-tier hierarchy comprising the cloud, edge, and user devices is referred to as fog computing. This article identifies challenges and opportunities in making accelerators accessible at the edge. A holistic view of the fog architecture is key to pursuing meaningful research in this area.Varghese, B.; Reaño González, C.; Silla Jiménez, F. (2018). Accelerator Virtualization in Fog Computing: Moving from the Cloud to the Edge. IEEE Cloud Computing. 5(6):28-37. https://doi.org/10.1109/MCC.2018.064181118S28375
On the Deployment and Characterization of CUDA Teaching Laboratories
When teaching CUDA in laboratories, an important issue is the economic cost of GPUs, which may
prevent some universities from building large enough labs to teach CUDA. In this paper we propose
an efficient solution to build CUDA labs reducing the number of GPUs. It is based on the use of the
rCUDA (remote CUDA) middleware, which enables programs being executed in a computer to
concurrently use GPUs located in remote servers. To study the viability of our proposal, we first
characterize the use of GPUs in this kind of labs with statistics taken from real users, and then present
results of sharing GPUs in a real teaching lab. The experiments validate the feasibility of our proposal,
showing an overhead under 5% with respect to having a GPU at each of the students’ computers.
These results clearly improve alternative approaches, such as logging into remote GPU servers, which
presents an overhead about 30%.This work was partially funded by Escola Tècnica Superior d’Enginyeria Informà tica de la Universitat Politècnica de Valènciaand by Departament d'Informà tica de Sistemes i Computadors de la Universitat Politècnica de València.Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories. En EDULEARN15 Proceedings. IATED. http://hdl.handle.net/10251/70225
Intra-node Memory Safe GPU Co-Scheduling
[EN] GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a GPU can be safely shared among multiple applications by taking memory constraints into account. Two approaches, namely a client-server and a shared memory approach are explored. However, the shared memory approach is more suitable due to lower overheads when compared to the former approach. Four policies are proposed in schedGPU to handle applications that are waiting to access the GPU, two of which account for priorities. The feasibility of schedGPU is validated on three real-world applications. The key observation is that a performance gain is achieved. For single applications, a gain of over 10 times, as measured by GPU utilisation and GPU memory utilisation, is obtained. For workloads comprising multiple applications, a speed-up of up to 5x in the total execution time is noted. Moreover, the average GPU utilisation and average GPU memory utilisation is increased by 5 and 12 times, respectively.This work was funded by Generalitat Valenciana under grant PROMETEO/2017/77.Reaño González, C.; Silla Jiménez, F.; Nikolopoulos, DS.; Varghese, B. (2018). Intra-node Memory Safe GPU Co-Scheduling. IEEE Transactions on Parallel and Distributed Systems. 29(5):1089-1102. https://doi.org/10.1109/TPDS.2017.2784428S1089110229
Enhancing large-scale docking simulation on heterogeneous systems: An MPI vs rCUDA study
[EN] Virtual Screening (VS) methods can considerably aid clinical research by predicting how ligands interact with pharmacological targets, thus accelerating the slow and critical process of finding new drugs. VS methods screen large databases of chemical compounds to find a candidate that interacts with a given target. The computational requirements of VS models, along with the size of the databases, containing up to millions of biological macromolecular structures, means computer clusters are a must. However, programming current clusters of computers is no easy task, as they have become heterogeneous and distributed systems where various programming models need to be used together to fully leverage their resources. This paper evaluates several strategies to provide peak performance to a GPU-based molecular docking application called METADOCK in heterogeneous clusters of computers based on CPU and NVIDIA Graphics Processing Units (GPUs). Our developments start with an OpenMP, MPI and CUDA METADOCK version as a baseline case of cluster utilization. Next, we explore the virtualized GPUs provided by the rCUDA framework in order to facilitate the programming process. rCUDA allows us to use remote GPUs, i.e. installed in other nodes of the cluster, as if they were installed in the local node, so enabling access to them using only OpenMP and CUDA. Finally, several load balancing strategies are analyzed in a search to enhance performance. Our results reveal that the use of middleware like rCUDA is a convincing alternative to leveraging heterogeneous clusters, as it offers even better performance than traditional approaches and also makes it easier to program these emerging clusters.This work is jointly supported by the Fundacion Seneca (Agencia Regional de Ciencia y Tecnologia, Region de Murcia) under grant 18946/JLI/13, and by the Spanish MEC and European Commission FEDER under grants TIN2015-66972-C5-3-R and TIN2016-78799-P (AEI/FEDER, UE). We also thank NVIDIA for hardware donation under GPU Educational Center 2014-2016 and Research Center 2015-2016. Furthermore, researchers from Universitat Politecnica de Valencia are supported by the Generalitat Valenciana under Grant PROMETEO/2017/077. Authors are also grateful for the generous support provided by Mellanox Technologies Inc.Imbernón, B.; Prades Gasulla, J.; Gimenez Canovas, D.; Cecilia, JM.; Silla Jiménez, F. (2018). Enhancing large-scale docking simulation on heterogeneous systems: An MPI vs rCUDA study. Future Generation Computer Systems. 79:26-37. https://doi.org/10.1016/j.future.2017.08.050S26377
On the design of a demo for exhibiting rCUDA
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.CUDA is a technology developed by NVIDIA
which provides a parallel computing platform and programming
model for NVIDIA GPUs and compatible ones. It takes
benefit from the enormous parallel processing power of GPUs
in order to accelerate a wide range of applications, thus
reducing their execution time.
rCUDA (remote CUDA) is a middleware which grants
applications concurrent access to CUDA-compatible devices
installed in other nodes of the cluster in a transparent way so
that applications are not aware of accessing a remote device.
In this paper we present a demo which shows, in real time,
the overhead introduced by rCUDA in comparison to CUDA
when running image filtering applications. The approach
followed in this work is to develop a graphical demo which
contains both an appealing design and technical contents.This work was funded by the Spanish MINECO and
FEDER funds under Grant TIN2012-38341-C04-01. Authors are also grateful for the generous support provided
by Mellanox Technologies and the equipment donated by
NVIDIA CorporationReaño González, C.; Pérez López, F.; Silla Jiménez, F. (2015). On the design of a demo for exhibiting rCUDA. IEEE. https://doi.org/10.1109/CCGrid.2015.53
Fault-tolerant vertical link design for effective 3D stacking
[EN] Recently, 3D stacking has been proposed to alleviate the memory bandwidth limitation arising in chip multiprocessors
(CMPs). As the number of integrated cores in the chip increases the access to external memory becomes the bottleneck, thus
demanding larger memory amounts inside the chip. The most accepted solution to implement vertical links between stacked dies
is by using Through Silicon Vias (TSVs). However, TSVs are exposed to misalignment and random defects compromising the yield of
the manufactured 3D chip. A common solution to this problem is by over-provisioning, thus impacting on area and cost. In this paper,
we propose a fault-tolerant vertical link design. With its adoption, fault-tolerant vertical links can be implemented in a 3D chip design
at low cost without the need of adding redundant TSVs (no over-provision). Preliminary results are very promising as the fault-tolerant
vertical link design increases switch area only by 6.69% while the achieved interconnect yield tends to 100%.This work was supported by the Spanish MEC and MICINN, as well as European Comission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04. It was also partly supported by the project NaNoC (project label 248972) which is funded by the European Commission within the Research Programme FP7.Hernández Luz, C.; Roca PĂ©rez, A.; Flich Cardo, J.; Silla JimĂ©nez, F.; Duato MarĂn, JF. (2011). Fault-tolerant vertical link design for effective 3D stacking. IEEE Computer Architecture Letters. 10(2):41-44. https://doi.org/10.1109/L-CA.2011.17S414410