    Acceleration of Linear Finite-Difference Poisson-Boltzmann Methods on Graphics Processing Units

    Electrostatic interactions play crucial roles in biophysical processes such as protein folding and molecular recognition. Poisson-Boltzmann equation (PBE)-based models have emerged as widely used in modeling these important processes. Though great efforts have been put into developing efficient PBE numerical models, challenges still remain due to the high dimensionality of typical biomolecular systems. In this study, we implemented and analyzed commonly used linear PBE solvers for the ever-improving graphics processing units (GPU) for biomolecular simulations, including both standard and preconditioned conjugate gradient (CG) solvers with several alternative preconditioners. Our implementation utilizes standard Nvidia CUDA libraries cuSPARSE, cuBLAS, and CUSP. Extensive tests show that good numerical accuracy can be achieved given that the single precision is often used for numerical applications on GPU platforms. The optimal GPU performance was observed with the Jacobi-preconditioned CG solver, with a significant speedup over standard CG solver on CPU in our diversified test cases. Our analysis further shows that different matrix storage formats also considerably affect the efficiency of different linear PBE solvers on GPU, with the diagonal format best suited for our standard finite-difference linear systems. Further efficiency may be possible with matrix-free operations and integrated grid stencil setup specifically tailored for the banded matrices in PBE-specific linear systems.Comment: 5 figures, 2 table

    Communication reduction techniques in numerical methods and deep neural networks

    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version

    High Performance Matrix-Fee Method for Large-Scale Finite Element Analysis on Graphics Processing Units

    This thesis presents a high performance computing (HPC) algorithm on graphics processing units (GPU) for large-scale numerical simulations. In particular, the research focuses on the development of an efficient matrix-free conjugate gradient solver for the acceleration and scalability of the steady-state heat transfer finite element analysis (FEA) on a three-dimension uniform structured hexahedral mesh using a voxel-based technique. One of the greatest challenges in large-scale FEA is the availability of computer memory for solving the linear system of equations. Like in large-scale heat transfer simulations, where the size of the system matrix assembly becomes very large, the FEA solver requires huge amounts of computational time and memory that very often exceed the actual memory limits of the available hardware resources. To overcome this problem a matrix-free conjugate gradient (MFCG) method is designed and implemented to finite element computations which avoids the global matrix assembly. The main difference of the MFCG to the classical conjugate gradient (CG) solver lies on the implementation of the matrix-vector product operation. Matrix-vector operation found to be the most expensive process consuming more than 80% out of the total computations for the numerical solution and thus a matrix-free matrix-vector (MFMV) approach becomes beneficial for saving memory and computational time throughout the execution of the FEA. In summary, the MFMV algorithm consists of three nested loops: (a) a loop over the mesh elements of the domain, (b) a loop on the element nodal values to perform the element matrix-vector operations and (c) the summation and transformation of the nodal values into their correct positions in the global index. A performance analysis on a serial and a parallel implementation on a GPU shows that the MFCG solver outperforms the classical CG consuming significantly lower amounts of memory allowing for much larger size simulations. The outcome of this study suggests that the MFCG can also speed-up and scale the execution of large-scale finite element simulations

    Austrian High-Performance-Computing meeting (AHPC2020)

    This booklet is a collection of abstracts presented at the AHPC conference

    Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

    This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present an approach that first identifies the performance bottlenecks of SpMV for a given sparse matrix on the target platform either through profiling or by matrix property inspection, and then selects suitable optimizations to tackle those bottlenecks. Our optimization pool is based on the widely used Compressed Sparse Row (CSR) sparse matrix storage format and has low preprocessing overheads, making our overall approach practical even in cases where fast decision making and optimization setup is required. We evaluate our optimizer on three x86-based computing platforms and demonstrate that it is able to distinguish and appropriately optimize SpMV for the majority of matrices in a representative test suite, leading to significant speedups over the CSR and Inspector-Executor CSR SpMV kernels available in the latest release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201

    Multi-GPU acceleration of large-scale density-based topology optimization

    This work presents a parallel implementation of density-based topology optimization using distributed GPU computing systems. The use of multiple GPU devices allows us accelerating the computing process and increasing the device memory available for GPU computing. This increment of device memory enables us to address large models that commonly do not fit into one GPU device. The most modern scientific computers incorporate these devices to design energy-efficient, low-cost, and high-computing power systems. However, we should adopt the proper techniques to take advantage of the computational resources of such high-performance many-core computing systems. It is well-known that the bottleneck of density-based topology optimization is the solving of the linear elasticity problem using Finite Element Analysis (FEA) during the topology optimization iterations. We solve the linear system of equations obtained from FEA using a distributed conjugate gradient solver preconditioned by a smooth aggregation-based algebraic multigrid (AMG) using GPU computing with multiple devices. The use of aggregation-based AMG reduces memory requirements and improves the efficiency of the interpolation operation. This fact is rewarding for GPU computing. We evaluate the performance and scalability of the distributed GPU system using structured and unstructured meshes. We also test the performance using different 3D finite elements and relaxing operators. Besides, we evaluate the use of numerical approaches to increase the topology optimization performance. Finally, we present a comparison between the many-core computing instance and one efficient multi-core implementation to highlight the advantages of using GPU computing in large-scale density-based topology optimization problems.This work has been supported by the AEI/FEDER and UE under the contract DPI2016-77538-R, and by the “Fundación Séneca – Agencia de Ciencia y Tecnología de la Región de Murcia” of Spain under the contract 20911/PI/18