3,432 research outputs found

    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS

    Full text link
    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin

    The inherent overlapping in the parallel calculation of the Laplacian

    Get PDF
    Producción CientíficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla León (grant number VA296P18

    Multi-GPU support on the marrow algorithmic skeleton framework

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaWith the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems. Higher-level programming is a very important asset in a multi-GPU environment, due to the complexity inherent to the currently used GPGPU APIs (OpenCL and CUDA), because of their low-level and code overhead. This can be obtained by introducing an abstraction layer, which has the advantage of enabling implicit optimizations and orchestrations such as transparent load balancing mechanism and reduced explicit code overhead. Algorithmic Skeletons, previously used in cluster environments, have recently been adapted to the GPGPU context. Skeletons abstract most sources of code overhead, by defining computation patterns of commonly used algorithms. The Marrow algorithmic skeleton library is one of these, taking advantage of the abstractions to automate the orchestration needed for an efficient GPU execution. This thesis proposes the extension of Marrow to leverage the use of algorithmic skeletons in the modular and efficient programming of multiple heterogeneous GPUs, within a single machine. We were able to achieve a good balance between simplicity of the programming model and performance, obtaining good scalability when using multiple GPUs, with an efficient load distribution, although at the price of some overhead when using a single-GPU.projects PTDC/EIA-EIA/102579/2008 and PTDC/EIA-EIA/111518/200

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201
    • …
    corecore