956 research outputs found

    Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

    Full text link
    New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure

    Feedback Driven Annotation and Refactoring of Parallel Programs

    Get PDF

    A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials

    Full text link
    We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and Topological Materials), a library and code generator based on a domain-specific language tailored to implement the specific stencil-like algorithms that can describe Dirac and topological materials such as graphene and topological insulators in a matrix-free way. The generated hybrid-parallel (MPI+OpenMP) code is fully vectorized using Single Instruction Multiple Data (SIMD) extensions. It is significantly faster than matrix-based approaches on the node level and performs in accordance with the roofline model. We demonstrate the chip-level performance and distributed-memory scalability of basic building blocks such as sparse matrix-(multiple-) vector multiplication on modern multicore CPUs. As an application example, we use the PVSC-DTM scheme to (i) explore the scattering of a Dirac wave on an array of gate-defined quantum dots, to (ii) calculate a bunch of interior eigenvalues for strong topological insulators, and to (iii) discuss the photoemission spectra of a disordered Weyl semimetal.Comment: 16 pages, 2 tables, 11 figure

    A class-based approach to parallelization of legacy codes

    Get PDF
    Computation-intensive legacy codes for numerical models stand to benefit from application of parallel computing. However, parallelization of legacy codes poses special challenges. These codes are very large and complex. Manual parallelization has proven to be extremely time-consuming and error-prone. Furthermore, while a large number of parallelization tools exist, they cannot handle these complex legacy codes. Development of automatic parallelization tools for legacy codes remains a research area of considerable interest;This thesis describes a new approach to automatic parallelization of legacy codes. Our approach focuses on special classes of codes as opposed to parallelization of arbitrary codes. The advantage is that we are able to use high-level knowledge of the special class to manage the complexity of the parallelization problem. This approach provides a pragmatic solution for parallelization: it requires the user to specify the high-level knowledge, but automates tasks which are time-consuming, tedious, and error-prone for the user;Using this new approach, we have developed parAgent--a parallelizing tool which facilitates quick development of efficient parallel codes for legacy Fortran-77 codes based on the explicit time-marching finite difference model. parAgent has been used on several well-known and widely-used Mesoscale Meteorological codes. It took only a few weeks to parallelize each of these legacy codes. Qualitatively, the performance of parallelization have been found to be on par with manual parallelization;This new approach can be applied to a variety of problem domains. The key benefits are: substantial reuse of existing software and considerable saving of time and effort for developing efficient parallel code. Although the new approach and parAgent have been developed with parallelization as the main objective, the information provided by the tool can be used for various purposes. For example, the information about the underlying numerical method and the exchange of data is valuable to the application scientist

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin
    • …
    corecore