Abstra ct. One of the major challenges facing high performance computing is the daunting task of producing programs that will achieve acceptable levels of performance when run on parallel architectures. Although many organizations have been actively working in this area for some time, many programs have yet to be parallelized. Furthermore, some programs that were parallelized were done so for obsolete systems. These programs may run poorly, if at all, on the current generation of parallel computers. Therefore, a straightforward approach to parallelizing vectorizable codes is needed without introducing any changes to the algorithm or the convergence properties of the codes. Using the combination of loop-level parallelism, and RISC-based shared memory SMPs has proven to be a successful approach to solving this problem. Keywor ds: Parallel programming, high performance computer, super computer, loop-level parallelism.
Abstra ct. One of the major challenges facing high performance computing is the daunting task of producing programs that will achieve acceptable levels of performance when run on parallel architectures. Although many organizations have been actively working in this area for some time, many programs have yet to be parallelized. Furthermore, some programs that were parallelized were done so for obsolete systems. These programs may run poorly, if at all, on the current generation of parallel computers. Therefore, a straightforward approach to parallelizing vectorizable codes is needed without introducing any changes to the algorithm or the convergence properties of the codes. Using the combination of loop-level parallelism, and RISC-based shared memory SMPs has proven to be a successful approach to solving this problem. Keywor ds: Parallel programming, high performance computer, super computer, loop-level parallelism.
Intro duction
One of the major challenges facing h igh performance computing is the daunting task of producing programs that will achieve acceptable levels of performance when run on parallel architectures.
1 In order to meet this challenge, the program must simultaneously achieve three goals: 1) Achieve a reasonable level of parallel speedup at an acceptable cost. 2) Demonstrate an acceptable level of serial performance so that moderate sized problems do not require enormous levels of resources. 3) Use an algorithm with a high enough level of algorithmic efficiency that the problem remains tractable.
Even though many organizations have been actively working in this area for 5 10 y ears (or longer), many programs have yet to be parallelized. Furthermore, some programs that were parallelized, were done so for what are now obsolete systems (e.g., SIMD computers from Thinking Machines and MASPAR), and these programs run poorly, if at all, on the current generation of parallel computers. There has also been a problem that some approaches to parallelization can subtly change the algorithm and result in convergence problems when using large numbers of processors [1] . This is a common problem, particularly when using domain decomposition w ith i mplicit CFD 2 codes. There are algorithmic solutions to this problem (e.g., multigrid codes or the use of a preconditioner); however, many of these solutions have problems in their own right (e.g., poor scalability).
At the other end of the spectrum, there are those who champion automatic parallelization. They expect to soon be able to parallelize production codes for efficient execution on modern production hardware. Unfortunately, as a general rule, this has not happened.
From discussions such as this, talks with numerous researchers, and the authors' research at the U.S. Army Research Laboratory, the following can be concluded:
Writing parallel programs is a challenge. Writing efficient serial programs on today's RISC and CISC processors with their memory hierarchies (i.e., cache) is a challenge.
Requiring the program to show near-linear scalability out to hundreds/thousands of processors greatly complicates matters.
Requiring the program to show portable performance across all or most modern parallel architectures greatly complicates matters. Modern processors are fast enough that for many problems that have traditionally been considered to be the sole domain of supercomputers, they may now be solvable using a moderate sized system (e.g., 10 100 GFLOPS of peak processing power) given a sufficiently efficient algorithm and implementation. Therefore, a straightforward approach to the parallelization of one or more important classes of codes is needed. This approach will meet the following requirements:
It will work with a class of machines that has more than one member in it, but it need not include the entire universe of parallel computers. It will not require an unreasonable amount of effort. The results achieved by using this approach must satisfy the needs of the user community.
The combined hardware and software costs must be acceptable. At least for small-to moderate-sized problems, it must be possible to complete the project before the equipment is obsolete. The remainder of this paper will begin by discussing an approach developed at the U.S. Army Research Laboratory that was designed to meet all of these requirements for a large and important class of codes. Following this, the results of applying this
