6 research outputs found

    Finite Element Algorithms and Data Structures on Graphical Processing Units

    Get PDF
    The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used

    Developing A New Storage Format And A Warp-Based Spmv Kernel For Configuration Interaction Sparse Matrices On The Gpu

    Get PDF
    Configuration interaction (CI) is a post Hartree–Fock method that is commonly used for solving the nonrelativistic Schrödinger equation for quantum many-electron systems of molecular scale. CI includes instantaneous electron correlation and it can deal with the ground state as well as multiple excited states. The CI matrix is a sparse matrix, and the bigger the CI matrix, the more electron correlation can be captured. However, due to the large size of the CI sparse matrix that is involved in CI computations, a good amount of the time spent on the eigenvalue computations is associated with the multiplication of the CI sparse matrix by numerous dense vectors, which is basically known as Sparse matrix-vector multiplication (SpMV). Sparse matrix-vector multiplication (SpMV) can be used to solve diverse-scaled linear systems and eigenvalue problems that exist in numerous and varying scientific applications. One of the scientific applications that SpMV is involved in is Configuration Interaction (CI). In this work, we have developed a new hybrid approach to deal with CI sparse matrices. The proposed model includes a newly-developed hybrid format for storing CI sparse matrices on the Graphics Processing Unit (GPU). In addition to the new developed format, the proposed model includes the SpMV kernel for multiplying the CI matrix (proposed format) by a vector using the C language and the CUDA platform. The proposed SpMV kernel is a vector kernel that uses the warp approach. We have gauged the newly developed model in terms of two primary factors, memory usage and performance. Our proposed kernel was compared to the cuSPARSE library and the CSR5 (Compressed Sparse Row 5) format and already outperformed both. Our proposed kernel outperformed the CSR5 format by 250.7% and the cuSPARSE library by 395.1% Keywords— CI, SpMV, Linear System, GPU, Kernel, CUDA

    Consumo energético de métodos iterativos para sistemas dispersos en procesadores gráficos

    Get PDF
    La resolución de sistemas de ecuaciones lineales dispersos de gran dimensión es una de las operaciones más comunes en aplicaciones científicas y de ingeniería. El aumento de sus tamaños propicia el desarrollo de técnicas de Green Computing, que permiten diseñar aplicaciones conscientes de la energía, en las que la eficiencia energética es el objetivo prioritario. En este Tesis Doctoral se ha diseñado una metodología basada en “técnicas de fusionado de kernels CUDA” que reduce el número de kernels, y con ello, costes de lanzamiento y transferencias de información. Su uso, junto con la sincronización de las GPUs en modo blocking, permite reducir el consumo energético en sistemas de cómputo heterogéneo, CPU-GPU. Estas técnicas tienen especial interés en GPUs que soporten paralelismo dinámico. La aplicación de esta metodología en la resolución de sistemas de ecuaciones lineales dispersos muestra mejoras destacables en eficiencia energética, obteniendo un compromiso entre rendimiento y consumo energético

    Performance Optimization Of GPU ELF-Codes

    Get PDF
    GPUs (Graphic Processing Units) are of interest for their favorable ratio GF/sprice\frac{GF/s}{price}. Compared to the beginning - early 1980's - nowadays GPU architectures are more similar to general purpose architectures but with (much) larger numbers of cores - the GF100 architecture released by NVIDIA in 2009-2010, for example, has a true hardware cache hierarchy, a unified memory address space, double precision performance and has a maximum of 512 cores. Exploiting the computational power of GPUs for non-graphics applications - past or present - has, however, always been hard. Initially, in the early 2000's, the way to program GPUs was by using graphic libraries API's (exclusively), which made writing non-graphics codes non-trivial and tedious at best, and virtually impossible in the worst case. In 2003, the Brook compiler and runtime system was introduced, giving users the ability to generate GPU code from a high level programming language. In 2006 NVIDIA introduced CUDA (Compute Unified Device Architecture). CUDA, a parallel computing platform and programming model specifically developed by NVIDIA for its GPUs, attempts to further facilitate general purpose programming of GPUs. Code edited using CUDA is portable between different NVIDIA GPU architectures and this is one of the reasons because NVIDIA claims that the user's productivity is much higher than previous solutions, however optimizing GPU code for utmost performance remains very hard, especially for NVIDIA GPUs using the GF100 architecture - e.g., Fermi GPUs and some Tesla GPUs - because a) the real instruction set architecture (ISA) is not publicly available, b) the code of the NVIDIA compiler - nvcc - is not open and c) users can not edit code using the real assembly - ELF in NVIDIA parlance. Compilers, while enabling immense increases in programmer productivity, by eliminating the need to code at the (tedious) assembly level, are incapable of achieving, to date, performance similar to that of an expert assembly programmer with good knowledge of the underlying architecture. In fact, it is widely accepted that high-level language programming and compiling even with a state-of-the-art compilers loose, on average, a factor of 3 in performance - and sometimes much more - over what a good assembly programmer could achieve, and that even on a conventional, simple, single-core machine. Compilers for more complex machines, such as NVIDIA GPUs, are likely to do much worse because among other things, they face (even more) complex trade-offs between often undecidable and NP-hard problems. However, because NVIDIA a) makes it virtually impossible to gain access to the actual assembly language used by its GF100 architecture, b) does not publicly explain many of the internal mechanisms implemented in its compiler - nvcc - and c) makes it virtually impossible to learn the details of its very complex GF100 architecture in sufficient detail to be able to exploit them, obtaining an estimate of the performance difference between CUDA programming and machine-level programming for NVIDIA GPUs using the GF100 architecture - let alone achieving some a priori performance guarantees of shortest execution time - has been, prior to this current work, impossible. To optimize GPU code, users have to use CUDA or PTX (Parallel Thread Execution) - a virtual instruction set architecture. The CUDA or PTX files are given in input to nvcc that produces as output fatbin files. The fatbin files are produced considering the target GPU architecture selected by the user - this is done setting a flag used by nvcc. In a fatbin file, zero or more parts of the fatbin file will be executed by the CPU - think of these parts as the C/C++ parts - while the remaining parts of the fatbin file - think of these parts as the ELF parts - will be executed by the specific model of the GPU for which the CUDA or PTX file has been compiled. The fatbin files are usually very different from the corresponding CUDA or PTX files and this lack of control can completely ruin any effort made at CUDA or PTX level to optimize the ELF part/parts of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled. We therefore reverse engineer the real ISA used by the GF100 architecture and generate a set of editing guidelines to force nvcc to generate fatbin files with at least the minimum number of resources later necessary to modify them to get the wanted ELF algorithmic implementations - this gives control on the ELF code that is executed by any GPU using the GF100 architecture. During the process of reverse engineering we also discover all the correspondences between PTX instructions and ELF instructions - a single PTX instruction can be transformed in one or more ELF instructions - and the correspondences between PTX registers and ELF registers. Our procedure is completely repeatable for any NVIDIA Kepler GPU - we do not need to rewrite our code. Being able to get the wanted ELF algorithmic implementations is not enough to optimize the ELF code of a fatbin file, we need in fact also to discover, understand, and quantify some not disclosed GPU behaviors that could slow down the execution of ELF code. This is necessary to understand how to execute the optimization process and while we can not report here all the results we have got, we can however say that we will explain to the reader a) how to force even distributions of the GPU thread blocks to the streaming multiprocessors, b) how we have discovered and quantified several warp scheduling phenomenons, c) how to avoid phenomenons of warp scheduling load unbalancing, that it is not possible to control, in the streaming multiprocessors, d) how we have determined, for each ELF instruction, the minimum quantity of time that it is necessary to wait before a warp scheduler can schedule again a warp - yes, the quantity of time can be different for different ELF instructions - e) how we have determined the time that it is necessary to wait before to be able to read again the data in a register previously read or written - this too can be different for different ELF instructions and different whether the data has been previously read or written - and f) how we have discovered the presence of an overhead time for the management of the warps that does not grow linearly to a liner increase of the number of residents warps in a streaming multiprocessor. Next we explain a) the procedures of transformation that it is necessary to apply to the ELF code of a fatbin file to optimize the ELF code and so making its execution time as short as possible, b) why we need to classify the fatbin files generated from the original fatbin file during the process of optimization and how we do this using several criteria that as final result allow us to determine the positions, occupied by each one of the fatbin files generated, in a taxonomy that we have created, c) how using the position of a fatbin file in the taxonomy we determine whether the fatbin file is eligible for an empirical analysis - that we explain - a theoretical analysis or both, and d) how - if the fatbin file is eligible for a theoretical analysis - we execute the theoretical analysis that we have devised and give an a priori - without any previous execution of the fatbin file - shortest ELF code execution time guarantee - this if the fatbin file satisfies all the requirements of the theoretical analysis - for the ELF code of the fatbin file that will be executed by the target GPU for which the fatbin file has been compiled
    corecore