586 research outputs found

    General Purpose Computation on Graphics Processing Units Using OpenCL

    Get PDF
    Computational Science has emerged as a third pillar of science along with theory and experiment, where the parallelization for scientific computing is promised by different shared and distributed memory architectures such as, super-computer systems, grid and cluster based systems, multi-core and multiprocessor systems etc. In the recent years the use of GPUs (Graphic Processing Units) for General purpose computing commonly known as GPGPU made it an exciting addition to high performance computing systems (HPC) with respect to price and performance ratio. Current GPUs consist of several hundred computing cores arranged in streaming multi-processors so the degree of parallelism is promising. Moreover with the development of new and easy to use interfacing tools and programming languages such as OpenCL and CUDA made the GPUs suitable for different computation demanding applications such as micromagnetic simulations. In micromagnetic simulations, the study of magnetic behavior at very small time and space scale demands a huge computation time, where the calculation of magnetostatic field with complexity of O(Nlog(N)) using FFT algorithm for discrete convolution is the main contribution towards the whole simulation time, and it is computed many times at each time step interval. This study and observation of magnetization behavior at sub-nanosecond time-scales is crucial to a number of areas such as magnetic sensors, non volatile storage devices and magnetic nanowires etc. Since micromagnetic codes in general are suitable for parallel programming as it can be easily divided into independent parts which can run in parallel, therefore current trend for micromagnetic code concerns shifting the computationally intensive parts to GPUs. My PhD work mainly focuses on the development of highly parallel magnetostatic field solver for micromagnetic simulators on GPUs. I am using OpenCL for GPU implementation, with consideration that it is an open standard for parallel programming of heterogeneous systems for cross platform. The magnetostatic field calculation is dominated by the multidimensional FFTs (Fast Fourier Transform) computation. Therefore i have developed the specialized OpenCL based 3D-FFT library for magnetostatic field calculation which made it possible to fully exploit the zero padded input data with out transposition and symmetries inherent in the field calculation. Moreover it also provides a common interface for different vendors' GPUs. In order to fully utilize the GPUs parallel architecture the code needs to handle many hardware specific technicalities such as coalesced memory access, data transfer overhead between GPU and CPU, GPU global memory utilization, arithmetic computation, batch execution etc. In the second step to further increase the level of parallelism and performance, I have developed a parallel magnetostatic field solver on multiple GPUs. Utilizing multiple GPUs avoids dealing with many of the limitations of GPUs (e.g., on-chip memory resources) by exploiting the combined resources of multiple on board GPUs. The GPU implementation have shown an impressive speedup against equivalent OpenMp based parallel implementation on CPU, which means the micromagnetic simulations which require weeks of computation on CPU now can be performed very fast in hours or even in minutes on GPUs. In parallel I also worked on ordered queue management on GPUs. Ordered queue management is used in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm for priority queues. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this work i have presented the analysis of different sorting algorithms with respect to sorting time, sorting rate and speedup on different GPU and CPU architectures and provided a new sorting technique on GPU

    Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond

    Full text link
    In this and a set of companion whitepapers, the USQCD Collaboration lays out a program of science and computing for lattice gauge theory. These whitepapers describe how calculation using lattice QCD (and other gauge theories) can aid the interpretation of ongoing and upcoming experiments in particle and nuclear physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers

    Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach

    Get PDF
    Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26× to over 20× speedup on integrated and external GPUs

    PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

    Full text link
    High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local
    corecore