1,283 research outputs found

    Acceleration of a Full-scale Industrial CFD Application with OP2

    Get PDF

    A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence

    Get PDF
    A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a maximum mean efficiency of 83%. Data are presented that demonstrate how to choose the optimal number of MPI processes and OpenMP threads in order to optimize code performance on two different platforms.Comment: Submitted to Parallel Computin

    A parallel compact-TVD method for compressible fluid dynamics employing shared and distributed-memory paradigms

    Get PDF
    A novel multi-block compact-TVD finite difference method for the simulation of compressible flows is presented. The method combines distributed and shared-memory paradigms to take advantage of the configuration of modern supercomputers that host many cores per shared-memory node. In our approach a domain decomposition technique is applied to a compact scheme using explicit flux formulas at block interfaces. This method offers great improvement in performance over earlier parallel compact methods that rely on the parallel solution of a linear system. A test case is presented to assess the accuracy and parallel performance of the new method

    MPI+X: task-based parallelization and dynamic load balance of finite element assembly

    Get PDF
    The main computing tasks of a finite element code(FE) for solving partial differential equations (PDE's) are the algebraic system assembly and the iterative solver. This work focuses on the first task, in the context of a hybrid MPI+X paradigm. Although we will describe algorithms in the FE context, a similar strategy can be straightforwardly applied to other discretization methods, like the finite volume method. The matrix assembly consists of a loop over the elements of the MPI partition to compute element matrices and right-hand sides and their assemblies in the local system to each MPI partition. In a MPI+X hybrid parallelism context, X has consisted traditionally of loop parallelism using OpenMP. Several strategies have been proposed in the literature to implement this loop parallelism, like coloring or substructuring techniques to circumvent the race condition that appears when assembling the element system into the local system. The main drawback of the first technique is the decrease of the IPC due to bad spatial locality. The second technique avoids this issue but requires extensive changes in the implementation, which can be cumbersome when several element loops should be treated. We propose an alternative, based on the task parallelism of the element loop using some extensions to the OpenMP programming model. The taskification of the assembly solves both aforementioned problems. In addition, dynamic load balance will be applied using the DLB library, especially efficient in the presence of hybrid meshes, where the relative costs of the different elements is impossible to estimate a priori. This paper presents the proposed methodology, its implementation and its validation through the solution of large computational mechanics problems up to 16k cores

    From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

    Full text link
    Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemora's capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific Programmin
    corecore