956 research outputs found
Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory
New algorithms and optimization techniques are needed to balance the
accelerating trend towards bandwidth-starved multicore chips. It is well known
that the performance of stencil codes can be improved by temporal blocking,
lessening the pressure on the memory interface. We introduce a new pipelined
approach that makes explicit use of shared caches in multicore environments and
minimizes synchronization and boundary overhead. For clusters of shared-memory
nodes we demonstrate how temporal blocking can be employed successfully in a
hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure
A domain-specific language and matrix-free stencil code for investigating electronic properties of Dirac and topological materials
We introduce PVSC-DTM (Parallel Vectorized Stencil Code for Dirac and
Topological Materials), a library and code generator based on a domain-specific
language tailored to implement the specific stencil-like algorithms that can
describe Dirac and topological materials such as graphene and topological
insulators in a matrix-free way. The generated hybrid-parallel (MPI+OpenMP)
code is fully vectorized using Single Instruction Multiple Data (SIMD)
extensions. It is significantly faster than matrix-based approaches on the node
level and performs in accordance with the roofline model. We demonstrate the
chip-level performance and distributed-memory scalability of basic building
blocks such as sparse matrix-(multiple-) vector multiplication on modern
multicore CPUs. As an application example, we use the PVSC-DTM scheme to (i)
explore the scattering of a Dirac wave on an array of gate-defined quantum
dots, to (ii) calculate a bunch of interior eigenvalues for strong topological
insulators, and to (iii) discuss the photoemission spectra of a disordered Weyl
semimetal.Comment: 16 pages, 2 tables, 11 figure
A class-based approach to parallelization of legacy codes
Computation-intensive legacy codes for numerical models stand to benefit from application of parallel computing. However, parallelization of legacy codes poses special challenges. These codes are very large and complex. Manual parallelization has proven to be extremely time-consuming and error-prone. Furthermore, while a large number of parallelization tools exist, they cannot handle these complex legacy codes. Development of automatic parallelization tools for legacy codes remains a research area of considerable interest;This thesis describes a new approach to automatic parallelization of legacy codes. Our approach focuses on special classes of codes as opposed to parallelization of arbitrary codes. The advantage is that we are able to use high-level knowledge of the special class to manage the complexity of the parallelization problem. This approach provides a pragmatic solution for parallelization: it requires the user to specify the high-level knowledge, but automates tasks which are time-consuming, tedious, and error-prone for the user;Using this new approach, we have developed parAgent--a parallelizing tool which facilitates quick development of efficient parallel codes for legacy Fortran-77 codes based on the explicit time-marching finite difference model. parAgent has been used on several well-known and widely-used Mesoscale Meteorological codes. It took only a few weeks to parallelize each of these legacy codes. Qualitatively, the performance of parallelization have been found to be on par with manual parallelization;This new approach can be applied to a variety of problem domains. The key benefits are: substantial reuse of existing software and considerable saving of time and effort for developing efficient parallel code. Although the new approach and parAgent have been developed with parallelization as the main objective, the information provided by the tool can be used for various purposes. For example, the information about the underlying numerical method and the exchange of data is valuable to the application scientist
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
- …