14 research outputs found

    Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality

    Full text link
    Graphics Processing Units (GPUs) have become widely used to accelerate scientific applications; therefore, it is important that Computer Science and Computer Engineering curricula include the fundamentals of parallel computing with GPUs. Regarding the practical part of the training, one important concern is how to introduce GPUs into a laboratory: installing GPUs in all the computers of the lab may not be affordable, while sharing a remote GPU server among several students may result in a poor learning experience because of its associated overhead. In this paper we propose a solution to address this problem: the use of the rCUDA (remote CUDA) middleware, which enables programs being executed in a computer to make concurrent use of GPUs located in remote servers. Hence, students would be able to concurrently and transparently share a single remote GPU from their local machines in the laboratory without having to log into the remote server. In order to demonstrate that our proposal is feasible, we present results of a real scenario. The results show that the cost of the laboratory is noticeably reduced while the learning experience quality is maintained.Reaño González, C.; Silla Jiménez, F. (2015). Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality. En INTED2015 Proceedings. IATED. 3651-3660. http://hdl.handle.net/10251/70229S3651366

    SLURM Support for Remote GPU Virtualization: Implementation and Performance Study

    Full text link
    © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.SLURM is a resource manager that can be leveraged to share a collection of heterogeneous resources among the jobs in execution in a cluster. However, SLURM is not designed to handle resources such as graphics processing units (GPUs). Concretely, although SLURM can use a generic resource plugin (GRes) to manage GPUs, with this solution the hardware accelerators can only be accessed by the job that is in execution on the node to which the GPU is attached. This is a serious constraint for remote GPU virtualization technologies, which aim at providing a user-transparent access to all GPUs in cluster, independently of the specific location of the node where the application is running with respect to the GPU node. In this work we introduce a new type of device in SLURM, "rgpu", in order to gain access from any application node to any GPU node in the cluster using rCUDA as the remote GPU virtualization solution. With this new scheduling mechanism, a user can access any number of GPUs, as SLURM schedules the tasks taking into account all the graphics accelerators available in the complete cluster. We present experimental results that show the benefits of this new approach in terms of increased flexibility for the job scheduler.The researchers at UPV were supported by the the Generalitat Valenciana under Grant PROMETEOII/2013/009 of the PROMETEO program phase II. Researchers at UJI were supported by MINECO, by FEDER funds under Grant TIN2011-23283, and by the Fundacion Caixa-Castelló Bancaixa (Grant P11B2013-21).Iserte Agut, S.; Castello Gimeno, A.; Mayo Gual, R.; Quintana Ortí, ES.; Silla Jiménez, F.; Duato Marín, JF.; Reaño González, C.... (2014). SLURM Support for Remote GPU Virtualization: Implementation and Performance Study. En Computer Architecture and High Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on. IEEE. 318-325. https://doi.org/10.1109/SBAC-PAD.2014.49S31832

    Design Patterns for Scientific Computations on Sparse Matrices

    Full text link

    Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters

    Get PDF
    X-ray scattering is a valuable tool for measuring the structural properties of materialsused in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture andsequestration devices) that are key to the reduction of carbon emissions. Although today's ultra-fast X-ray scattering detectors can provide tremendousinformation on the structural properties of materials, a primary challenge remains in the analyses of the resulting data. We are developing novelhigh-performance computing algorithms, codes, and software tools for the analyses of X-ray scattering data. In this paper we describe two such HPCalgorithm advances. Firstly, we have implemented a flexible and highly efficient Grazing Incidence Small Angle Scattering (GISAXS) simulation code based on theDistorted Wave Born Approximation (DWBA) theory with C++/CUDA/MPI on a cluster of GPUs. Our code can compute the scattered light intensity from any givensample in all directions of space; thus allowing full construction of the GISAXS pattern. Preliminary tests on a single GPU show speedups over 125x compared tothe sequential code, and almost linear speedup when executing across a GPU cluster with 42 nodes, resulting in an additional 40x speedup compared to usingone GPU node. Secondly, for the structural fitting problems in inverse modeling, we have implemented a Reverse Monte Carlo simulation algorithm with C++/CUDAusing one GPU. Since there are large numbers of parameters for fitting in the in X-ray scattering simulation model, the earlier single CPU code required weeks ofruntime. Deploying the AccelerEyes Jacket/Matlab wrapper to use GPU gave around 100x speedup over the pure CPU code. Our further C++/CUDA optimization deliveredan additional 9x speedup

    CU2rCU: A CUDA-to-rCUDA Converter

    Full text link
    [ES] Las GPUs (Graphics Processor Units, unidades de procesamiento gráfico) están siendo cada vez más utilizadas en el campo de la HPC (High Performance Computing, computación de altas prestaciones) como una forma eficaz de reducir el tiempo de ejecución de las aplicaciones mediante la aceleración de determinadas partes de las mismas. CUDA (Compute Unified Device Architecture, arquitectura de dispositivos de cómputo unificado) es una tecnología desarrollada por NVIDIA que permite llevar a cabo dicha aceleración, proporcionando para ello una arquitectura de cálculo paralelo. Sin embargo, la utilización de GPUs en el ámbito de la HPC presenta ciertas desventajas, principalmente, en el coste de adquisición y el aumento de energía que introducen. Para hacer frente a estos inconvenientes se desarrolló rCUDA (remote CUDA, CUDA remoto), una tecnología que permite compartir dispositivos CUDA de forma remota, reduciendo así tanto el coste de adquisición como el consumo de energía. En las versiones iniciales de rCUDA quedó demostrada su viabilidad, pero también se identificaron algunos aspectos susceptibles de ser mejorados en relación con su usabilidad. Ésta se veía afectada por el hecho de que rCUDA no soporta las extensiones de CUDA al lenguaje C. De esta forma, era necesario convertir manualmente las aplicaciones CUDA eliminando dichas extensiones, y utilizando únicamente C plano. En este documento presentamos una herramienta que realiza éstas conversiones de manera automática, permitiendo así adaptar las aplicaciones CUDA a rCUDA de una manera sencilla[EN] GPUs (Graphics Processor Units) are being increasingly embraced by the high performance computing and computational communities as an effective way of considerably reducing application execution time by accelerating significant parts of their codes. CUDA (Compute Unified Device Architecture) is a new technology developed by NVIDIA which leverages the parallel compute engine in GPUs. However, the use of GPUs in current HPC clusters presents certain negative side-effects, mainly related with acquisition costs and power consumption. rCUDA (remote CUDA) was recently developed as a software solution to address these concerns. Specifically, it is a middleware that allows transparently sharing a reduced number of CUDA-compatible GPUs among the nodes in a cluster, reducing acquisition costs and power consumption. While the initial prototype versions of rCUDA demonstrated its functionality, they also revealed several concerns related with usability and performance. With respect to usability, the rCUDA framework was limited by its lack of support for the CUDA extensions to the C language. Thus, it was necessary to manually convert the original CUDA source code into C plain code functionally identical but that does not include such extensions. For such purpose, in this document we present a new component of the rCUDA suite that allows an automatic transformation of any CUDA source code into plain C code, so that it can be effectively accommodated within the rCUDA technology.Reaño González, C. (2012). CU2rCU: A CUDA-to-rCUDA Converter. http://hdl.handle.net/10251/27435Archivo delegad

    Exploiting the capabilities of modern GPUs for dense matrix computations

    No full text
    We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. We compare single and double precision performance of a modern GPU with unified architecture, and show how iterative refinement with mixed precision can be used to regain full accuracy in the solution of linear systems, exploiting the potential of the processor for single precision arithmetic. Experimental results on a GTX280 using CUBLAS 2.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed

    Applications on emerging paradigms in parallel computing

    Get PDF
    The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi-threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi-core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google\u27s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations

    Analysis and evaluation of Binary Cascade Iterative Refinement and comparison to other iterative refinement algorithms for solving linear systems

    Get PDF
    Iterative Refinement ist eine weitverbreitete Methode um die Rundungsfehler einer Lösung eines linearen Gleichungssystems zu verbessern. Die Kosten der iterativen Verbesserung sind sehr gering im Vergleich zu den Kosten der Matrixfaktorisierung, das Verfahren führt aber zu einem Ergebnis welches bis zur Maschinengenauigkeit korrekt sein kann. Es existieren viele Variationen des Standard Iterative Refinements, welche verschiedenen Arbeitsgenauigkeiten für die Berechnungen verwenden. Extra Precise Iterative Refinement verwendet eine höhere Genauigkeit, um das Ergebnis zu verbessern. Mixed Precision Iterative Refinement versucht die Vorteile von niedrigeren Genauigkeiten auszunutzen, um das Ergebnis zu berechnen und verwendet anschließend Iterative Refinement um die höhere Genauigkeit des Ergebnisses zu gewährleisten. Der Fokus dieser Masterarbeit liegt auf dem Binary Cascade Iterative Refinement, welches die Arbeitsgenauigkeiten basierend auf den Eingabedaten wählt. Dieser Algorithmus beruht auf der Verwendung von beliebigen Genauigkeiten, welche nicht auf die IEEE Standarddatentypen beschränkt sind, welche von den meisten Hardwareherstellern unterstützt werden. Die Masterarbeit wird die Eigenschaften von BCIR analysieren und Experimente durchführen, welche diesen Algorithmus mit anderen Iterative Refinement Methoden vergleichen und besondere Aufmerksamkeit auf die numerische Genauigkeit und die Konvergenz der Verfahren legen. Die beliebige Genauigkeit wird mit Hilfe der Software Bibliothek GNU MPFR implementiert. Die verschiedenen Genauigkeiten werden in Software simuliert und liefern daher keine aussagekräftigen Informationen über einen Performancegewinn oder -verlust durch die Verwendung der verschiedenen Genauigkeiten. Daher wird ein Performancemodel vorgestellt, um die Performance der verschiedenen Methoden miteinander vergleichen zu können. Dies wird Aufschluss über mögliche Performancegewinne geben.Iterative refinement is a widely used method to improve the round-off errors of a solution of a linear system. The cost of the iterative improvement is very low compared to the cost of the factorization of the matrix but results in a solution which can be accurate to machine precision. Many variations on the standard iterative refinement method exist, which use different working precisions to refine the solution. The extra precise iterative refinement can use extended precision to improve the result. The mixed precision iterative refinement tries to exploit the benefits of using lower precisions to compute a solution and then uses iterative refinement to achieve the higher precision accuracy. The focus of this thesis will be the binary cascade iterative refinement, which chooses the working precisions according to the input data. This algorithm depends on arbitrary precision arithmetic to support working precisions outside the IEEE standard data types provided by most hardware vendors. The thesis will analyse the properties of BCIR and conduct experiments which will compare the algorithm to other iterative refinement methods and focus on the numerical accuracy and the convergence. The arbitrary precision arithmetic will be implemented using the GNU MPFR software library. The simulated arbitrary precision does not provide accurate information about the gains and losses in performance due to the use of the different precisions. Therefore a performance model will be introduced in order to be able to compare the performance of the algorithms and to analyse the possible performance gains

    Diseño de un sistema de comunicaciones para virtualización remota de aceleradores gráficos sobre sistemas heterogéneos

    Get PDF
    El consumo de energía es una de las principales preocupaciones en el diseño de cualquier sistema de HPC y ha sido recientemente reconocido como uno de los grandes retos para alcanzar el siguiente hito en el rendimiento de los supercomputadores: un EXAFLOPS. Para lograr este ambicioso objetivo, es necesario diseñar supercomputadores cada vez más eficientes desde el punto de vista energético, sin perder de vista el rendimiento. En este contexto, la incorporación de los aceleradores gráficos a los sistemas HPC actuales ha dado lugar a clústeres de máquinas con varios núcleos donde cada nodo está equipado con su propio acelerador. En principio, esto ha supuesto un aumento de la eficiencia energética de estas configuraciones. Sin embargo, los aceleradores pueden permanecer inactivos gran parte del tiempo, durante el cual siguen consumiendo una importante cantidad de energía. Para conseguir un uso más eficiente de las GPUs se han desarrollado varias tecnologías de virtualización de GPUs que permiten ejecutar aplicaciones aceleradas con GPUs accediendo a un acelerador gráfico instalado en un nodo remoto. En la actualidad, la solución más destacada por su robustez, flexibilidad y eficiencia es rCUDA. Otra de las estrategias para aumentar la eficiencia energética de los clústeres consiste en reemplazar los nodos que incluyen procesadores de propósito general, con un elevado consumo energético, por un número mayor de plataformas con núcleos de menor capacidad de cálculo, pero bajo consumo de potencia eléctrica. Ahora bien, estas configuraciones incrementan el tiempo de ejecución de las aplicaciones de HPC, lo que a larga puede redundar en un mayor consumo de energía. Este trabajo de investigación aborda el diseño, implementación y evaluación de un sistema de comunicaciones para la virtualización remota de GPUs basado en rCUDA, utilizando redes de alto rendimiento sobre sistemas heterogéneos. En concreto, las propuestas desarrolladas en esta tesis permiten aprovechar las posibilidades de ahorro energético que pueden conseguirse al aplicar la virtualización de GPUs en un clúster heterogéneo que cuenta con nodos basados en procesadores propósito general, plataformas multinúcleo de bajo consumo y arquitecturas híbridas (CPU-GPU) interconectadas por redes de alto rendimiento que soportan \mbox{el protocolo RDMA}. La evaluación experimental del rendimiento y del consumo energético se efectúa en base a un conjunto de aplicaciones aceleradas con GPUs remotas. El marco de trabajo contempla varias configuraciones representativas de los futuros sistemas de HPC, caracterizados por arquitecturas heterogéneas dirigidas a aumentar la potencia de cálculo teniendo en cuenta la eficiencia energética. Los resultados obtenidos demuestran el potencial de las propuestas desarrolladas en este trabajo para incrementar la eficiencia energética de la solución de virtualización de rCUDA
    corecore