12 research outputs found
Effects of partitioning and scheduling sparse matrix factorization on communication and load balance
A block based, automatic partitioning and scheduling methodology is presented for sparse matrix factorization on distributed memory systems. Using experimental results, this technique is analyzed for communication and load imbalance overhead. To study the performance effects, these overheads were compared with those obtained from a straightforward 'wrap mapped' column assignment scheme. All experimental results were obtained using test sparse matrices from the Harwell-Boeing data set. The results show that there is a communication and load balance tradeoff. The block based method results in lower communication cost whereas the wrap mapped scheme gives better load balance
OPTIMIZATION ISSUES IN FINITE ELEMENT CODES FOR SOLVING OPEN DOMAIN 3D ELECTROMAGNETIC SCATTERING AND CONFORMAL ANTENNA PROBLEMS
The first part of the paper presents the implementation and performance of a new absorbing boundary condition (ABC) for truncating finite element meshes. This ABC can be applied conformally to the surface of the structure for scattering and antenna radiation calculations. Consequently, the computational domain is reduced dramatically, thus allowing the simulation of much larger structures, and results are presented for three-dimensional bodies. The latter part of the paper discusses optimization issues relating to the solver's CPU speed on parallel and vector processors. It is shown that a jagged diagonal storage scheme leads to a four-fold increase in the FLOP rate of the code, and a standard matrix profile reduction algorithm substantially reduces the inter-processor communication.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/50414/1/241_ftp.pd
Run-time parallelization and scheduling of loops
Run time methods are studied to automatically parallelize and schedule iterations of a do loop in certain cases, where compile-time information is inadequate. The methods presented involve execution time preprocessing of the loop. At compile-time, these methods set up the framework for performing a loop dependency analysis. At run time, wave fronts of concurrently executable loop iterations are identified. Using this wavefront information, loop iterations are reordered for increased parallelism. Symbolic transformation rules are used to produce: inspector procedures that perform execution time preprocessing and executors or transformed versions of source code loop structures. These transformed loop structures carry out the calculations planned in the inspector procedures. Performance results are presented from experiments conducted on the Encore Multimax. These results illustrate that run time reordering of loop indices can have a significant impact on performance. Furthermore, the overheads associated with this type of reordering are amortized when the loop is executed several times with the same dependency structure
Software Support for Irregular and Loosely Synchronous Problems
A large class of scientific and engineering applications may be classified as irregular and loosely synchronous from the perspective of parallel processing. We present a partial classification of such problems. This classification has motivated us to enhance Fortran D to provide language support for irregular, loosely synchronous problems. We present techniques for parallelization of such problems in the context of Fortran D
Software Support for Irregular and Loosely Synchronous Problems
A large class of scientific and engineering applications may be classified as irregular and loosely synchronous from the perspective of parallel processing. We present a partial classification of such problems. This classification has motivated us to enhance Fortran D to provide language support for irregular, loosely synchronous problems. We present techniques for parallelization of such problems in the context of Fortran D
Compiler Support for Sparse Tensor Computations in MLIR
Sparse tensors arise in problems in science, engineering, machine learning,
and data analytics. Programs that operate on such tensors can exploit sparsity
to reduce storage requirements and computational time. Developing and
maintaining sparse software by hand, however, is a complex and error-prone
task. Therefore, we propose treating sparsity as a property of tensors, not a
tedious implementation task, and letting a sparse compiler generate sparse code
automatically from a sparsity-agnostic definition of the computation. This
paper discusses integrating this idea into MLIR
Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUs
Sparse linear iterative solvers are essential for many large-scale
simulations. Much of the runtime of these solvers is often spent in the
implicit evaluation of matrix polynomials via a sequence of sparse
matrix-vector products. A variety of approaches has been proposed to make these
polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial
preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular
practice to approximate triangular solves by a matrix polynomial to increase
parallelism. Such algorithms allow to evaluate the polynomial using a so-called
matrix power kernel (MPK), which computes the product between a power of a
sparse matrix A and a dense vector x, or a related operation. Recently we have
shown that using the level-based formulation of sparse matrix-vector
multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we
can perform temporal cache blocking of MPK to increase its performance. In this
work, we demonstrate the application of this cache-blocking optimization in
sparse iterative solvers.
By integrating the RACE library into the Trilinos framework, we demonstrate
the speedups achieved in preconditioned) s-step GMRES, polynomial
preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we
achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms
with moderate contributions from subspace orthogonalization, the gain reduces
significantly, which is often caused by the insufficient quality of the
orthogonalization routines. Finally, we showcase the application of
RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind)
and highlight the new opportunities and perspectives opened up by RACE as a
cache-blocking technique for MPK-enabled sparse solvers.Comment: 25 pages, 11 figures, 3 table