15 research outputs found
Boosting the performance of remote GPU virtualization using InfiniBand Connect-IB and PCIe 3.0
© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] A clear trend has emerged involving the acceleration of scientific applications by using GPUs. However, the capabilities of these devices are still generally underutilized. Remote GPU virtualization techniques can help increase GPU utilization rates, while reducing acquisition and maintenance costs. The overhead of using a remote GPU instead of a local one is introduced mainly by the difference in performance between the internode network and the intranode PCIe link. In this paper we show how using the new InfiniBand Connect-IB network adapters (attaining similar throughput to that of the most recently emerged GPUs) boosts the performance of remote GPU virtualization, reducing the overhead to a mere 0.19% in the application tested.This work was funded by the Generalitat Valenciana under Grant PROMETEOII/2013/009 of the PROMETEO program phase II. This material is based upon work supported by the U. S. Department of Energy, Office of Science, Advanced Scientific Computing Research (SC-21), under Contract No. DE-AC02-06CH11357. Authors from the Universitat Politècnica de València and Universitat Jaume I are grateful for the generous support provided by Mellanox Technologies.Reaño González, C.; Silla Jiménez, F.; Peña Monferrer, AJ.; Shainer, G.; Schultz, S.; Castelló Gimeno, A.; Quintana Orti, ES.... (2014). Boosting the performance of remote GPU virtualization using InfiniBand Connect-IB and PCIe 3.0. En 2014 IEEE International Conference on Cluster Computing (CLUSTER). IEEE. 266-267. doi:10.1109/CLUSTER.2014.6968737S26626
Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality
Graphics Processing Units (GPUs) have become widely used to accelerate scientific applications;
therefore, it is important that Computer Science and Computer Engineering curricula include the
fundamentals of parallel computing with GPUs. Regarding the practical part of the training, one
important concern is how to introduce GPUs into a laboratory: installing GPUs in all the computers of
the lab may not be affordable, while sharing a remote GPU server among several students may result
in a poor learning experience because of its associated overhead.
In this paper we propose a solution to address this problem: the use of the rCUDA (remote CUDA)
middleware, which enables programs being executed in a computer to make concurrent use of GPUs
located in remote servers. Hence, students would be able to concurrently and transparently share a
single remote GPU from their local machines in the laboratory without having to log into the remote
server. In order to demonstrate that our proposal is feasible, we present results of a real scenario. The
results show that the cost of the laboratory is noticeably reduced while the learning experience quality
is maintained.Reaño González, C.; Silla Jiménez, F. (2015). Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality. En INTED2015 Proceedings. IATED. 3651-3660. http://hdl.handle.net/10251/70229S3651366
Simulation of reaction-diffusion processes in three dimensions using CUDA
Numerical solution of reaction-diffusion equations in three dimensions is one
of the most challenging applied mathematical problems. Since these simulations
are very time consuming, any ideas and strategies aiming at the reduction of
CPU time are important topics of research. A general and robust idea is the
parallelization of source codes/programs. Recently, the technological
development of graphics hardware created a possibility to use desktop video
cards to solve numerically intensive problems. We present a powerful parallel
computing framework to solve reaction-diffusion equations numerically using the
Graphics Processing Units (GPUs) with CUDA. Four different reaction-diffusion
problems, (i) diffusion of chemically inert compound, (ii) Turing pattern
formation, (iii) phase separation in the wake of a moving diffusion front and
(iv) air pollution dispersion were solved, and additionally both the Shared
method and the Moving Tiles method were tested. Our results show that parallel
implementation achieves typical acceleration values in the order of 5-40 times
compared to CPU using a single-threaded implementation on a 2.8 GHz desktop
computer.Comment: 8 figures, 5 table
On the Deployment and Characterization of CUDA Teaching Laboratories
When teaching CUDA in laboratories, an important issue is the economic cost of GPUs, which may
prevent some universities from building large enough labs to teach CUDA. In this paper we propose
an efficient solution to build CUDA labs reducing the number of GPUs. It is based on the use of the
rCUDA (remote CUDA) middleware, which enables programs being executed in a computer to
concurrently use GPUs located in remote servers. To study the viability of our proposal, we first
characterize the use of GPUs in this kind of labs with statistics taken from real users, and then present
results of sharing GPUs in a real teaching lab. The experiments validate the feasibility of our proposal,
showing an overhead under 5% with respect to having a GPU at each of the students’ computers.
These results clearly improve alternative approaches, such as logging into remote GPU servers, which
presents an overhead about 30%.This work was partially funded by Escola Tècnica Superior d’Enginyeria Informàtica de la Universitat Politècnica de Valènciaand by Departament d'Informàtica de Sistemes i Computadors de la Universitat Politècnica de València.Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories. En EDULEARN15 Proceedings. IATED. http://hdl.handle.net/10251/70225
Reduciendo el coste económico de las prácticas de CUDA manteniendo la calidad del aprendizaje
Resumen:
La computación de propósito general con tarjetas gráficas
se basa en el uso de estas tarjetas (GPUs) para
realizar cálculos computacionales que tradicionalmente
son realizados por los procesadores (CPUs). Debido
al creciente uso de las GPUs, es importante que los
planes de estudio de informática incluyan los fundamentos
de la computación paralela con GPUs, al tiempo
que se equipan los laboratorios docentes con GPUs
a un coste razonable. En este sentido, instalar GPUs
en todos los ordenadores del laboratorio puede resultar
costoso a nivel económico, mientras que compartir un
servidor remoto con GPU entre los estudiantes puede
derivar en unas malas condiciones de aprendizaje.
En este trabajo proponemos una solución eficaz a este
problema: el uso de la tecnología rCUDA (CUDA remoto),
que permite a las aplicaciones de un ordenador
utilizar, de forma concurrente y transparente, GPUs
instaladas en servidores remotos. De esta manera los
estudiantes pueden, desde sus puestos de trabajo, compartir
una misma GPU instalada en un servidor remoto
sin tener que iniciar sesión en el mismo. Para demostrar
que nuestra propuesta es factible, presentamos experimentos
en un escenario real que muestran cómo el
coste del laboratorio es notablemente reducido, mientras
que la calidad del aprendizaje se mantiene.Abstract:
General-Purpose computing on Graphics Processing
Units consists in using Graphics Processing Units
(GPUs) to perform the computation of applications traditionally
handled by regular processors (CPUs). Due
to their increasing use, it is important that Computer
Engineering and Computer Science curricula include
the basics of this new computing trend. As regards the
practical part of the training, one major issue is how
to introduce GPUs into a laboratory: buying GPUs for
all the workstations of the lab may be too expensive,
whereas installing one GPU in a server and requesting
the students to log into this server may lead to a low
teaching quality due to its associated overhead.
In this paper we suggest a new solution to introduce
GPUs into a laboratory: the rCUDA (remote CUDA)
framework, which allows applications running in a
computer to use GPUs installed in remote servers.
Hence, students will be capable of sharing a remote
GPU (concurrently and transparently) from their local
workstations in the lab, without logging into the server.
To prove that our approach is possible, we show
experiments in a real laboratory. The experiments demonstrate
that our proposal reduces the cost of the laboratory,
whereas the teaching quality still remains
Improving the management efficiency of GPU workloads in data centers through GPU virtualization
[EN] Graphics processing units (GPUs) are currently used in data centers to reduce the execution time of compute-intensive applications. However, the use of GPUs presents several side effects, such as increased acquisition costs and larger space requirements. Furthermore, GPUs require a nonnegligible amount of energy even while idle. Additionally, GPU utilization is usually low for most applications. In a similar way to the use of virtual machines, using virtual GPUs may address the concerns associated with the use of these devices. In this regard, the remote GPU virtualization mechanism could be leveraged to share the GPUs present in the computing facility among the nodes of the cluster. This would increase overall GPU utilization, thus reducing the negative impact of the increased costs mentioned before. Reducing the amount of GPUs installed in the cluster could also be possible. However, in the same way as job schedulers map GPU resources to applications, virtual GPUs should also be scheduled before job execution. Nevertheless, current job schedulers are not able to deal with virtual GPUs. In this paper, we analyze the performance attained by a cluster using the remote Compute Unified Device Architecture middleware and a modified version of the Slurm scheduler, which is now able to assign remote GPUs to jobs. Results show that cluster throughput, measured as jobs completed per time unit, is doubled at the same time that the total energy consumption is reduced up to 40%. GPU utilization is also increased.Generalitat Valenciana, Grant/Award Number:
PROMETEO/2017/077; MINECO and FEDER,
Grant/Award Number: TIN2014-53495-R,
TIN2015-65316-P and TIN2017-82972-RIserte, S.; Prades, J.; Reaño González, C.; Silla, F. (2021). Improving the management efficiency of GPU workloads in data centers through GPU virtualization. Concurrency and Computation: Practice and Experience. 33(2):1-16. https://doi.org/10.1002/cpe.5275S11633
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
CU2rCU: A CUDA-to-rCUDA Converter
[ES] Las GPUs (Graphics Processor Units, unidades de procesamiento gráfico) están siendo cada vez más utilizadas en el campo de la HPC (High Performance Computing, computación de altas prestaciones) como una forma eficaz de reducir el tiempo de ejecución de las aplicaciones mediante la aceleración de determinadas partes de las mismas. CUDA (Compute Unified Device Architecture, arquitectura de dispositivos de cómputo unificado) es una tecnología desarrollada por NVIDIA que permite llevar a cabo dicha aceleración, proporcionando para ello una arquitectura de cálculo paralelo. Sin embargo, la utilización de GPUs en el ámbito de la HPC presenta ciertas desventajas, principalmente, en el coste de adquisición y el aumento de energía que introducen.
Para hacer frente a estos inconvenientes se desarrolló rCUDA (remote CUDA, CUDA remoto), una tecnología que permite compartir dispositivos CUDA de forma remota, reduciendo así tanto el coste de adquisición como el consumo de energía. En las versiones iniciales de rCUDA quedó demostrada su viabilidad, pero también se identificaron algunos aspectos susceptibles de ser mejorados en relación con su usabilidad. Ésta se veía afectada por el hecho de que rCUDA no soporta las extensiones de CUDA al lenguaje C. De esta forma, era necesario convertir manualmente las aplicaciones CUDA eliminando dichas extensiones, y utilizando únicamente C plano. En este documento presentamos una herramienta que realiza éstas conversiones de manera automática, permitiendo así adaptar las aplicaciones CUDA a rCUDA de una manera sencilla[EN] GPUs (Graphics Processor Units) are being increasingly embraced by the high performance computing and computational communities as an effective way of considerably reducing application execution time by accelerating significant parts of their codes. CUDA (Compute Unified Device Architecture) is a new technology developed by NVIDIA which leverages the parallel compute engine in GPUs. However, the use of GPUs in current HPC clusters presents certain negative side-effects, mainly related with acquisition costs and power consumption.
rCUDA (remote CUDA) was recently developed as a software solution to address these concerns. Specifically, it is a middleware that allows transparently sharing a reduced number of CUDA-compatible GPUs among the nodes in a cluster, reducing acquisition costs and power consumption. While the initial prototype versions of rCUDA demonstrated its functionality, they also revealed several concerns related with usability and performance. With respect to usability, the rCUDA framework was limited by its lack of support for the CUDA extensions to the C language. Thus, it was necessary to manually convert the original CUDA source code into C plain code functionally identical but that does not include such extensions. For such purpose, in this document we present a new component of the rCUDA suite that allows an automatic transformation of any CUDA source code into plain C code, so that it can be effectively accommodated within the rCUDA technology.Reaño González, C. (2012). CU2rCU: A CUDA-to-rCUDA Converter. http://hdl.handle.net/10251/27435Archivo delegad
Structure evolution of spin-caoted phase separated EC/HPC films
Porous phase-separated films made of ethylcellulose (EC) and hydroxypropylcellulose (HPC) are commonly used for controlled drug release. The structure of these thin films is controlling the drug transport from the core to the surrounding liquid in the stomach or intestine. However, detailed understanding of the structure evolution is lacking. In this work, we use spin-coating, a widely applied technique for making thin uniform polymer films, to mimic the industrial manufacturing process of fluidized bed spraying. The aim of this work is to understand the structure evolution and phase separation kinetics of single layer and multi-layer spin-coated EC/HPC films. The structure evolution is characterized using confocal laser scanning microscopy (CLSM) and image analysis.The influence of spin-coating parameters and EC:HPC ratio on the final phase-separated structure and the film thickness was determined. Varying spin speed and EC:HPC ratio gave us precise control over the characteristic length scale and thickness of the film. The results show that the phase separation occurs through spinodal decomposition and that the characteristic length scale increases with decreasing spin speed and with increasing HPC ratio. The thickness of the spin-coated film decreases with increasing spin speed.Furthermore, optimized spin-coating parameters were selected to study the kinetics of phase separation in situ, in particular the coarsening mechanisms and the time dependence of the domain’s growth as a function of EC:HPC ratio. We identified two main coarsening mechanisms: interfacial tension driven hydrodynamic growth for the bicontinuous structure and diffusion driven coalescence for the discontinuous structures. In addition, we obtained information on the wetting, the shrinkage, and the evaporation process by looking at a film cross section, which allowed an estimation of the binodal of the phase diagram.The findings from this work give a good understanding of the mechanisms responsible for the morphology development and open the road towards tailoring thin EC/HPC film structures for controlled drug release