9,536 research outputs found
Distributed-memory parallelization of an explicit time-domain volume integral equation solver on Blue Gene/P
Two distributed-memory schemes for efficiently parallelizing the explicit marching-on in-time based solution of the time domain volume integral equation on the IBM Blue Gene/P platform are presented. In the first scheme, each processor stores the time history of all source fields and only the computationally dominant step of the tested field computations is distributed among processors. This scheme requires all-to-all global communications to update the time history of the source fields from the tested fields. In the second scheme, the source fields as well as all steps of the tested field computations are distributed among processors. This scheme requires sequential global communications to update the time history of the distributed source fields from the tested fields. Numerical results demonstrate that both schemes scale well on the IBM Blue Gene/P platform and the memory efficient second scheme allows for the characterization of transient wave interactions on composite structures discretized using three million spatial elements without an acceleration algorithm
MPI+X: task-based parallelization and dynamic load balance of finite element assembly
The main computing tasks of a finite element code(FE) for solving partial
differential equations (PDE's) are the algebraic system assembly and the
iterative solver. This work focuses on the first task, in the context of a
hybrid MPI+X paradigm. Although we will describe algorithms in the FE context,
a similar strategy can be straightforwardly applied to other discretization
methods, like the finite volume method. The matrix assembly consists of a loop
over the elements of the MPI partition to compute element matrices and
right-hand sides and their assemblies in the local system to each MPI
partition. In a MPI+X hybrid parallelism context, X has consisted traditionally
of loop parallelism using OpenMP. Several strategies have been proposed in the
literature to implement this loop parallelism, like coloring or substructuring
techniques to circumvent the race condition that appears when assembling the
element system into the local system. The main drawback of the first technique
is the decrease of the IPC due to bad spatial locality. The second technique
avoids this issue but requires extensive changes in the implementation, which
can be cumbersome when several element loops should be treated. We propose an
alternative, based on the task parallelism of the element loop using some
extensions to the OpenMP programming model. The taskification of the assembly
solves both aforementioned problems. In addition, dynamic load balance will be
applied using the DLB library, especially efficient in the presence of hybrid
meshes, where the relative costs of the different elements is impossible to
estimate a priori. This paper presents the proposed methodology, its
implementation and its validation through the solution of large computational
mechanics problems up to 16k cores
- …