Since its beginnings in the 1950's, the Fortran language has been the language of choice for most scientific and engineering programming. Compilers, seizing on the simplicity of the language, have historically generated highly-optimized machine code. High performance Fortran (HPF) is a relative new addition to the Fortran dialect. It is an attempt to provide an efficient high-level Fortran parallel programming language for the latest generation of parallel computers. Its success has been debatable. By operating at a high level, the HPF standard does not provide some low-level detail required to achieve maximum performance in a multiprocessor system. Message passing using highly-tuned libraries, such as the message passing interface (MPI), will more often than not require less wall clock time than a comparable HPF code. However, the HPF language and its compilers continue to mature and improve. HPF provides a convenient way to represent parallelism for those most comfortable with data parallel methodologies. As such, it can require a shorter time to solution and provide an acceptable level of efficiency. This report discusses our experiences with the language, as well as coding strategies and vendor-specific "hooks" that can be used to boost performance.
Introduction
The Fortran programming language has a long history in the scientific and engineering communities. A vast array of Fortran legacy and dusty-deck codes exist, and the simplicity of the language and related compiler-generated code efficiency help to keep it to a large extent the language of choice for parallel program development in scientific computing [l] .
The language continues to evolve and mature with compilers now available for the newest Fortran 90 and Fortran 95 standards. These latest versions begin to provide some facets of object-oriented program design and language extension by providing modules and abstract data typing [2] .
However, Fortran has continued to lack a set of features needed for portable and efficient For example, array syntax notation in Fortran 90 implies than an identical operation be applied to all elements of the array. In most cases, this parallelism is at the statement level in a source program, and is therefore often known as fine-grain parallelism. This was a natural fit for the massively parallel, fine-grain computers popular at the time, such as the Connection Machine CM-2 and Mas Par MP-2.
Several vendors supply HPF compilers for today's high performance computing systems.
Data parallel, also known as SIMD (single instruction, multiple data), computers have largely been replaced by coarse-grain computers where processing units can work independently [4] .
The data parallel methodology is general enough that it can be translated in practice to work quite well on these multiple instruction, multiple data (MIMD) computers. This report presents our experiences in using HPF for the COMPOSE (Composite Manufacturing
Process Simulation Environment) suite of codes under development at the U.S. Army Re-1 search Laboratory (ARL) and the University of Minnesota. It describes the state of the HPF compiler and provides some programming strategies. It also gives some performance statistics. A more detailed comparison between performance achieved by the HPF compiler and an explicit message passing approach using MPI is beyond the intended scope of this report, but will be provided in follow-on manuscripts. Indeed, extension to shared memory parallelism using techniques such as OpenMP are also planned for comparison.
Algorithm Design for Parallel Computers

Parallel Architectures.
The obvious distinguishing feature of parallel computers is their ability to have numerous processors executing at once. There are, however, several means to this end. Numerous classes of architectures and topologies have been developed during the short history of parallel computing. Symmetric multi-processors (SMPs) provide an easy parallel programming environment and can be inexpensive to assemble. These machines use a shared memory in which all processors have the same access penalty to memory.
Memory access can potentially limit scalability due to contention in a bus-based topology.
Accordingly, most of these architectures use no more than 64 processors. Tiered memory systems using cache have been used to mitigate these effects. The Sun Microsystems Ultra
Enterprise 10000 is an example of an SMP. Sun has reported good scalability in clustering its SMPs [5] .
Arrays and clusters have been developed where each node could be an SMP or something as simple as a personal computer [6] . They are most often connected using high-speed networks. They are scalable, but provide no shared memory and can be difficult to program.
Massively parallel processors (MPPs) most often feature distributed memories with large numbers of processors. The processor interconnection network topologies have ranged from hypercubes to three-dimensional (3-D) meshes. Examples include the Connection Machine-5
(CM-5) that had the ability to function in multiple instruction-multiple data (MIMD) mode, where each processor can execute different instructions at the same time on different data sets, or single instruction-multiple data (SIMD) mode, where each processor executed the same instruction in lockstep. The Cray T3E-1200 is also an MPP using a tightly-coupled 3-D bidirectional torus configuration. Lastly, scalable symmetric multiprocessors (S2MPs) have been designed to provide the best features of,SMPs, but they allow for MPP scalability.
The SGI Origin 2000 and Origin 3800 are examples of these architectures. They employ a distributed shared memory and large cache architecture built on a bristled hypercube topology. The design intent is to remain scalable by limiting the number of potential "hops" data must make before arriving at the processor requesting the information.
It should be noted that any discussion of interconnection topologies is mostly educational.
New cross-platform tools and standardized parallel programming methodologies have greatly reduced the need for a detailed knowledge of the processor interconnection network. It does remain relevant, however, to have a good understanding of the memory systems on multiprocessor computers. For example, the main solution to overcoming the widening gap between memory and CPU performance has been the development of layered caches.
Algorithms or coding practices that do not try to remain "cache friendly" are invariably doomed to poor execution on the majority of high performance computers in use today.
Parallel Algorithms and Programming.
The parallel programmer has several tasks. The first is to design or select an appropriate algorithm for parallel execution. Often, the best sequential algorithm, or the most apparent one, is not the algorithm of choice in a parallel environment. Criteria for selection can include any number of metrics, including wall clock execution time, FLOP (Floating Point Operations per Second) rates, throughput, and implementation difficulty. In most cases, an absolute definition of "performance" is ubiquitous and remains a complex issue. For example, spending an inordinate amount of time to boost FLOP rates on an algorithm that will only be executed infrequently at most hardly seems worth the effort.
Once an algorithm is selected or designed within the performance requirements, it must then be coded, tested, and optimized. Several parallel programming models may be apparent and applicable to the application at hand. Message passing is widely used. Many tasks may be created with data exchange and interaction between the processes being carried out by sending and receiving messages. The MPI has become the predominate tool in this regard.
Data parallelism is also widely found in many applications. Data parallelism exploits the fact that the same operation is often performed on each item in a set of data. A data parallel program is a sequence of such operations. Currently, High Performance Fortran (HPF) is the most widely used language to represent data parallelism. Fortran 90 is a data parallel language in its own right, but it is limited to strict array syntax parallelism. A final methodology is shared memory programming, which allows processes to execute concurrently using a common memory space. Examples include OpenMP, SGI loop-level parallelism, and Cray shared memory programming (SHMEM). Most often, these approaches require programmer-inserted directives into the code to establish parallel regions or indicate safe loops for multithreaded execution.
When mapped to the underlying architecture, each of these models has several issues.
Chiefly, these are data placement, the number of processors available, the size of the data, and the coordination of the processors. There is, however, no best answer as to which programming paradigm best fits which particular architecture. Data parallel codes can perform well in distributed shared memory machines as well as systolic SIMD computers. When used on distributed memory or distributed shared memory machines, the compiler typically generates a Single Program-Multiple Data (SPMD) program in which each processor executes the same program on a different data set [7] . However, data communication and layout are controlled by the compiler and cannot be easily optimized if required. Furthermore, the maturation rate of data parallel compilers, especially HPF, has been slow due to the complexity of the task. Therefore, they remain, at least somewhat, in a developmental phase.
MPI codes also create SPMD parallelism. These codes require explicit programmer data decomposition and can be very efficient; they are known as the "assembly language" of par-allel programming because of the low-level details surrounding the code. Directive-based loop-level parallelism can be scalable, but on shared memory machines, tedious and sometimes pejorative coding styles are often required to achieve the desired cache optimization.
Furthermore, loop-level parallelism alone may not properly address all numerical formulations.
The choice in parallel methodology can sometimes be difficult. The following sections describe some of our experiences in parallelizing the main solver in the COMPOSE suite.
The physical solutions have been based on domains rendered by finite elements. This code has been parallelized using an element-by-element data parallel approach, and by a message passing approach. This discussion is limited to our experiences and strategies garnered from using the HPF compiler for the data parallel approach. A detailed comparison between the two methods will be reported at a later time.
HPF Data Parallel Programming
The High Performance Fortran model encompasses both communication and parallelism.
It augments Fortran 90, itself a data parallel language that provides constructs to repre- In the following sections, some of the parallel constructs available in HPF are discussed.
Proper strategies to ensure optimized code generation, as well as good and bad coding practices, are also highlighted. Also discussed is the importance of data mapping to achieve locality of reference and avoid potentially costly communication.
In this regard, we also discuss various actions that can be taken to mitigate communications when dealing with Since HPF is a language, it requires a compiler to generate executable code. This starkly contrasts with MPI, which is not a separate language, but rather calls to a library of functions. The compiler for MPI code is a native compiler (usually C or FORTRAN) that sees the calls to MPI simply as calls to functions. The native compiler has no ability to optimize parallelism through message passing library calls. As such, the MPI programmer has sole responsibility for writing the best possible parallel code. The HPF user can also write clear and precise parallel code, but the compiler is more forgiving of poorly written code. It can also restructure code to promote parallelism.
Consider the case of processor synchronization. In parallel programming, synchronization points are costly. Most often, these points are collective barriers at which each process must arrive before they can all continue. These points may also exist at entry and exit points for code that is sequential and must be processed by only one processor. The HPF compiler tries very hard to remove or limit barrier code generation. One way it does this is through 7 loop fusing. As an example, consider the following two FORALL loops and scalar initialization of the variable a:
As long as it maintains the intent of the FORALL construct, the compiler can generate the most efficient code possible to achieve the objective. The compiler can determine that there are no dependencies between the two loops, and that the second FORALL construct contains no references to the scalar computation. In this case, the compiler can simply move the scalar computation before or after the FORALL statements and fuse the two loops without losing correctness or semantics. This allows for the removal of library calls to begin and end local parallel sections. The HPF compiler moved the scalar computation to a point following the second FORALL, fused the separate FORALL loops, and generated the following intermediate code. The code generated for the programmer-fused loop is identical:
This example is somewhat contrived; since these FORALL loops are independent, the source These statements appear to be well suited for vector processing. On MIMD shared memory machines, the HPF compiler will convert this code to SPMD execution by creating equivalent In the past, many data parallel languages required the extensive use of array syntax to describe parallelism. CM-Fortran codes for the CM-5 relied heavily on array syntax to achieve parallel execution. While it can be argued that this syntax is easier to read, it has several potential faults. A major drawback in using array syntax notation for parallelism is that many temporary multidimensional arrays are often required. This problem can quickly get out of hand for codes with large data sets. This was a major reason for establishing the INDEPENDENT do loop and NEW clause construct in HPF.
Consider the following array syntax statements: = fel(l,l,j)*tl+fel(l,2,j)*t2+fe1(1,3,j)*t3 aelpk(2,j) = fel(2,l,j)*tl+fel(2,2,j)*t2+fe1(2,3,j)*t3 aelpk(3,j) = fel(3,l,j)*tl+fel(3,2,j)*t2+fe1(3,3,j)*t3 enddo Here, the programmer assisted the compiler by telling it certain values will be reused. The code required to compute the communication schedules and other overhead was cut by over 50% to approximately 49 lines.
Also, the concurrent do loop contained direct references to the scalars t 1, t2, and t3, and generally had less complicated memory referencing. Most current architectures must use at least two layers of cache to overcome the discrepancy between memory access speeds and processor speeds. The opportunity for independent scalar quantities gives the parallel programmer the ability to write more cache-friendly code. This INDEPENDENT directive should also extend well to new techniques and capabilities in the Fortran 90 language, such as abstract data types and pointers, and new approaches for cache optimizations [8] . The execution time for a very small problem was reduced from 24 s using the array syntax version to 21 s using the INDEPENDENT formulation. Obviously, larger problems should see more pronounced gains.
Locality.
As with any parallel code, the paramount concern rests in limiting to the greatest extent possible the amount of communication that must occur between the processors in the parallel pool. Determining the optimal distribution of data objects operated on by a program is a global optimization problem, and as such is not tractable. Accordingly, HPF provides directives for data mapping (alignment and distribution) to advise the compiler on how best to distribute data elements to the parallel processors. As would be expected, these directives work best at reducing communication in an environment comprised of regular, grid-based data. For instance, with a 2-D or 3-D grid of data, it is relatively straightforward for the compiler to distribute data evenly across the parallel processors. For "ghost points," those items on the data mapping borders which are shared between processors, it is also feasible for the compiler to vectorize and agglomerate the data that must be communicated between processors, thus reducing the overall time spent in communication.
Efficient scheduling in these cases is also possible to hide memory hierarchy latency.
Communication is implicit in HPF as compared to explicit calls found in message passing codes. While this in principle is a factor to make coding in HPF "easier" than traditional There is a sum-scatter HPF library function to perform the reduction wgnode = sum-scatter(we1, wgnode, lm)
The full details of the implementation are hidden, but most likely this call computes a communication schedule for data going to and arriving from remote nodes, moves the data, Since it is a library routine, our assumption was that each time it was called, a schedule was being computed and executed, and any information gathered by the scheduling algorithm was being discarded before the next call.
Conversations with the Portland Group confirmed that this was the case.
Obviously, the ability to reuse communication schedules is essential to getting good performance with this code. The Portland Group has already recognized this need. They While'gather operations do not contain the arithmetic reduction, they are equally problematic. As mentioned previously, gathers generate a lot of code to compute off-node data locations. In test runs, the gathers started to take longer than the reusable schedule scatter operations. The ability to reuse communication schedules inside of do independent loops used for gathers is also under investigation. Syntax to accomplish this has been proposed by the Japan Association for HPF (JAHPF) [9] . C urrently, the ability to reuse communication schedules remains a vendor-specific feature, as the HPF 3.0 standard does not address the issue. It will undoubtedly be incorporated into future standards.
Mesh Reconfiguration.
The locality of reference greatly impacts the performance of a data parallel program. HPF provides several directives and distributions to map data and promote locality. Of these, the DISTRIBUTE and ALIGN directives are most common. The DISTRIBUTE directive indicates how an array is to be partitioned to the various processors. Array alignment, to make sure that corresponding entries of different arrays are on the same processor, can be specified using the ALIGN directive. For example, in the earlier example of A = B * C, by aligning B and C with A, we know that A(l), B(l), and C (1) all reside on the same processor. The array dimensions can be distributed as *, BLOCK, or CYCLIC.
For some applications, such as 2-D image processing routines, these distributions map easily and intuitively to the data to promote locality of reference. However, for unstructured finite element mesh-based data, the data sets are usually element and node based.
Depending on the quality of the original finite element mesh, a BLOCK or CYCLIC distribution of the data will require differing amounts of communication. For example, consider the Accordingly, it is extremely advantageous to have a preprocessing step which reorders the data in a smart fashion based on the expected number of processors to be used. This technique is already used in SPMD message passing codes where the input domain is decomposed into a number of partitions equal to the number of processors. The following strategy was employed.
First, the mesh was partitioned using unstructured graph or mesh partitioning software.
These packages attempt to divide the mesh, either according the the nodal or element data, into a number of partitions while attempting to limit the number of shared nodes between partitions. This step provided a list of elements for each domain. Second, the domain shared node vectors were computed. Third, for each domain or partition, the nodes that are shared between domains are grouped and renumber first. will also have to be loaded and may not be used on the next iteration. The overall performance is hard to determine a priori due to the differing cache architectures (set associative, n-way, etc.) and hardware-software interactions (instruction prefetching, speculative code execution, etc.), but it is obvious that this does not lead to cache affinity.
Renumbering techniques were implemented that were proven to improve cache performance by reducing the bandwidth and envelope for sparse matrices [lo] . The Reverse CuthillMcKee (RCM) approach renumbers poorly numbered meshes to reduce the dramatic cardinality changes between connected element nodes. On message passing implementations of COMPOSE, these preprocessing steps have shown roughly a 10% reduction in execution time requirements on the Cray T3E system. The timings reported in this monograph are based on meshes with somewhat poor numbering, and were done prior to implementing RCM. We can roughly assume a similar reduction in the reported timings had the RCM pass been available.
While the aforementioned techniques are very effective at reducing communications, they still do not address the problem of using a mesh that cannot be rigidly broken along domains to fit into the compiler partition sizes. With BLOCK distributions, the data is partitioned into contiguous, equal-sized blocks the size of $, where N is the cardinality of the data set and P is the number of processors. So, even while using an advanced renumbering technique as described previously, some communication will be required to move data for these overlapping points.
One possible way to eliminate these communications for ghost points is through "mesh padding." In this process, a graph partition is determined and renumbered as before. Then, the block distribution boundaries are computed based on the mesh size and compared to how well the graph/finite element partition maps to these block boundaries. Nodes and elements are then created as needed to make the finite element mesh fit the mesh partition. Depending upon the initial mesh configuration, the number of nodes and elements that have to be created vary from the t,ens to the hundreds. However, this approach has several drawbacks. The original mesh intent may be lost by adding new nodes and elements. Furthermore, we are adding local computations at the expense of reducing communications.
A better approach has been developed. A new data distribution technique available in the Portland Group HPF compiler release 3.0 addresses these concerns by providing asymmetrical block distribution, thus removing the need for mesh padding [9] . The new approach allows for "soft" BLOCK boundaries that can be set by the software or the user at runtime and not the compiler at compile time. Combined with the mesh renumber technique described above, it should limit all communications except for true domain boundary entries.
The syntax is !HPF$ DISTRIBUTE A(GEN-BLOCK(DIST) 1.
Here, A is the array to be distributed, and DIST is an array whose cardinality is the number of processors on which the code will run. Each element of DIST contains the size of the data block that is to be placed on the processor. Since the mesh partitioning and renumbering pass has already decomposed the data into domains, it can also provide the DIST array with precise numbers as to the size of the asymmetrical blocks. The overall effect should be to limit communication to all but the essential shared nodes and hence reduce execution time. Figure 3 shows the configuration of this optimal data-parallel mesh.
Unfortunately, new languages and compilers do have their faults, as we were unable to utilize this new feature. As released, the compiler is unable to generate an executable code that uses generic block distribution and is compiled with the shared memory addressing mode enabled. In the case of the T3E system, shared memory referencing causes the compiler to generate code that uses very efficient Cray one-sided "get" and "put" communication calls. 
Performance Data
For two reasons, the discussion of the performance of HPF is limited to the Cray T3E system alone. First, the ability to reuse communication schedules is only supported on the HPF compiler for the Cray T3E. Second, PGHPF has been highly optimized for the Cray T3E system compared to other major architectures [II] . The wall clock execution times for a representative problem with varying processor counts are given in Table 1. This table lists the time of execution using reusable communication schedules for the SUM-SCATTER operation.
Since we overwhelmingly showed the need for reusable schedules in the 16-processor case, we considered further benchmarking using the SUM-SCATTER library routine to be a waste of CPU hours. Table 1 shows the sometimes dramatic improvement of simple mesh preprocessing for data parallel execution. The limits to speedup appear to reside wholly in areas of the code requiring interprocessor communication. As such, the ability to perform precommunication reductions and reuse communication schedules appear to be the best opportunity at increasing code speed. The hope is that asymmetric block distribution and schedule reuse be incorporated into future releases of the compiler. Mesh renumbering, combined with these techniques, present the best hope for making HPF viable for unstructured grid problems.
Currently, MPI provides the best wall clock solution for this class of problems. More information will be provided in future monographs, but an example is telling. On one comparision problem (45,547 nodes, 89,945 elements), the MPI implementation of COMPOSE required about 1038 s to complete on a 16-processor run on the Cray T3E system. The preprocessed and optimized HPF run required almost 3x as long at 2949 s. HPF must overcome its current limitations to be more broadly applicable.
Conclusion
The HPF language and the data parallel model it is built upon have both shown applicability to past, current, and no doubt future parallel computer platforms. As more and more choices become available to achieve parallelism, it should be pointed out that there is no single panacea for parallel programming. Some researchers believe that data parallel is a natural way of thinking and achieving parallelism, compared to something like shared memory programming or message passing. Others, first exposed to parallelism with loop-level directives or MPI, would probably say that shared memory or SPMD programming is more natural. To a large extent, it no doubt depends on the individual and past history.
Furthermore, it is also difficult to judge effort and payoff between various parallel programming tools and languages. Some researchers have expressed beliefs that it is easier to write and maintain code in HPF rather than message passing with MPI [12] . Such statements are misleading. Good performance, however measured, always takes time to achieve.
HPF code requires careful and time consuming tuning, as does a code written using other languages and libraries.
In summary, HPF provides a viable parallel programming capability for some classes of problems, but it needs improvements. The standard suffers by not addressing very important considerations like interprocessor communication schedules. These issues can only be addressed at the compiler and standardization levels. No amount of code tuning can correct these problems. Therefore, it is up to the individual to decide if this degraded performance is acceptable over some possibly more "expensive" alternative. As the compilers and the standard continue to mature, they will only improve-their survival depends upon it. Since its beginnings in the 1950's, the Fortran language has been the language of choice for most scientific and engineering programming.
Compilers, seizing on the simplicity of the language, have historically generated highly-optimized machine code. High performance Fortran (HPF) is a relative new addition to the Fortran dialect. It is an attempt to provide an efficient high-level Fortran parallel programming language for the latest generation 01 parallel computers. Its success has been debatable. By operating at a high level, the HPF standard does not provide some low-level detail required to achieve maximum performance in a multiprocessor system. Message passing using highly-tuned libraries, such as the message passing interface (MPI), will more often than not require less wall clock time than a comparable HPF code. However, the HPF language and its compilers continue to mature and improve. HPF provides a convenient way to represent parallelism for those most comfortable with data parallel methodologies. As such, it can require a shorter time to solution and provide an acceptable level of efficiency. This report discusses our experiences with the language, as well as coding strategies and vendor-specific "hooks" that can be used to bcmsl pfOllllallCe.
14. SUBJECT 
