In this paper we examine the use of a shared memory programming model to address the problem of portability of application codes between distributed memory and shared memory architectures. We do this with an extension of the Parallel C Preprocessor. The extension, borrowed from Split-C and AC, uses type qualifiers instead of storage class modifiers to declare variables that are shared among processors. The type qualifier declaration supports an abstract shared memory facility on distributed memory machines while making direct use of hardware support on shared memory architectures.
Introduction
Message passing has evolved as the portability vehicle of choice for parallel programs that must run on both distributed memory and shared memory architectures. This has occurred because message passing is relatively easy to implement. Widely accepted standards such as the Message Passing Interface (MPI) [1] exist with implementations available on many shared and distributed memory platforms. The use of message passing is a natural choice on distributed memory platforms, but its use on shared memory systems can sacrifice performance in applications that are sensitive to communication latency and bandwidth. Whether one can devise alternative parallel programming paradigms that efficiently span shared and distributed memory architectures in a portable manner is the subject of this paper.
We investigate a shared memory paradigm as a portability vehicle for programs that must run on both shared and distributed memory architectures. The goal of the paradigm is to make the low latency communication capability of a shared memory platform available to the programmer while preserving portability to a distributed memory platform. Applications written in such a programming model have a communication granularity that is driven by the characteristics of the algorithm itself the ability to indicate sharing status at all levels of indirection. It also provides the compiler with the needed hooks to handle shared memory references on a distributed memory platform.
Unlike the Split-C and AC compilers that are based on the GNU C compiler and produce assembly language, we have implemented our extension of the PCP programming model as a source-to-source translator. The backend target is the vendor-supported C compiler combined with a runtime library that handles synchronization and communication operations that are not part of the architectural support provided by the vendor. Implementing PCP as a source-to-source translator creates a more easily ported compiler and leverages the substantial effort vendors usually make to optimize the performance of their proprietary C compilers.
Target Architectures
In this paper we evaluate the utility of the type qualifier approach to shared memory declarations on the following platforms: the DEC 8400 symmetric shared memory multiprocessor, the SGI Origin 2000 1 coherent cache non-uniform memory access architecture, the Cray T3D and T3E 2 distributed memory platforms which support remote memory references in hardware, and the Meiko CS-2 which supports one-sided messages in software.
DEC 8400
The DEC Alpha 8400 symmetric shared memory multiprocessor is a conventional bus-based parallel system. The DEC platform may have up to 12 processors that share a system bus possessing a sustainable bandwidth of 1600 megabytes per second. The memory system can support up to 16-way interleaving, the actual amount of memory interleave depending upon the specific memory configuration. Further details can be found in the Technical Summary [7] for the DEC 8400 system. For our benchmarking we use a DEC 8400 with 8 processors running at 440 megahertz with 4-way interleaved memory.
The DEC Alpha processor provides a weakly consistent memory model [8] . If program correctness requires the ordering of memory operations, a memory barrier instruction is used to force pending memory operations to complete before further instructions are allowed to issue. Load-linked and store-conditional instructions are provided in the DEC Alpha processor to support synchronization operations.
SGI Origin 2000
The SGI Origin 2000 provides a shared memory architecture implemented as a distributed shared memory system as opposed to a symmetric bus architecture. The Origin 2000 system [9] is built from a number of nodes, each containing two R10000 processors and their associated second level caches. A portion of the system memory and the cache directory are positioned on the node as well. The nodes are interconnected by a communications fabric implementing a hypercube for modest configurations of up to 32 nodes. A directory-based cache coherence protocol is implemented over this fabric in order to provide a system-wide cache coherent shared memory.
The SGI Origin 2000 offers increased scalability because it does not depend upon a single system bus to provide for shared memory. Unlike the case for a symmetric shared memory platform, however, application performance can be sensitive to data layout. Data layout is controlled on the Origin 2000 by influencing the page placement algorithms of the virtual memory system at runtime. The memory model provided on the Origin 2000 is sequentially consistent [10] . Load-linked and store-conditional instructions are provided to support synchronization.
Cray T3D and T3E
The Cray T3D and T3E platforms are distributed memory machines that employ DEC Alpha microprocessors and a torus network to implement interprocessor communication. Each processor operates on its own local address space but mechanisms are provided to support remote memory references through the communications network. In the T3D, remote memory references are implemented in the support circuits surrounding the processor. Before a load or store is issued by the processor, a special instruction may be used to set the target CPU to a non-local processor.
A prefetch queue and a block transfer engine permit the programmer to hide latency on the Cray T3D. We employ the prefetch queue to implement vector fetches from distributed to local memory. A remote read-modify-write cycle and a hardware barrier instruction are provided to support synchronization. To obtain the best possible performance, a large fraction of the PCP remote memory reference runtime library support is written in assembly language on the T3D.
The multiprocessing support architecture introduced in the T3D has been substantially refined in the Cray T3E. Memory mapped registers, called E registers, are used to access all of the multiprocessing support operations. Through the use of the E register mechanism remote memory references, read-modify-write operations, and a barrier instruction can be accessed. E register operations can also be used to implement efficient vector transfers between local and distributed memory. A key advantage of the T3E is that the E register mechanism is directly accessible from an optimizing C compiler.
Both the Cray T3D and T3E provide a weakly consistent memory model. This occurs at two levels. The DEC Alpha processor, used in both platforms, implements weakly ordered memory operations. The remote memory reference operations implemented in the multiprocessing support logic are also weakly ordered. One must wait on remote reads to complete in order to ensure that they precede other remote memory references. Remote memory writes are tracked in the support logic and their completion is explicitly waited for if algorithm correctness depends upon memory ordering.
Meiko CS-2
The Meiko CS-2 is a distributed memory computing platform based on the SUN SPARC processor, using a separate Elan [11] processor to implement interprocessor communication across a network. The Elan processor on the local node executes a communications protocol that communicates with the Elan processor on a remote node to implement memory-to-memory (DMA) transfers. Since the communication protocol runs in software on the Elan, the startup latency for data transfers is significant. This requires data movement to occur in large block transfers in order to obtain good performance.
The memory-to-memory transfers implemented by the Elan communications processor on the Meiko CS-2 are weakly ordered. One must explicitly wait on an event associated with a DMA operation to ensure that it is complete before continuing with further data transfers if algorithm correctness depends upon ordering. There are no remote read-modify-write cycles implemented in the Elan library. As a result, we were forced to resort to Lamport's algorithm [12] for mutual exclusion.
PCP Translations and Runtime Libraries
The PCP programming model is implemented as a source-to-source translator that performs a full parse of the PCP programming language. It produces ANSI C augmented by either calls to communication and synchronization routines in the PCP runtime library or vendor-specific functionality. The translator is implemented as a single program with several output options selected by a flag. One or more output options may be viable on a given target machine and a runtime library is generally associated with each output option.
Shared Memory Platforms
Some vendors of shared memory multiprocessor systems do not provide shared static variables in their C compiler that we can target with our PCP translator. We have used two strategies to make up for this lack of direct support for shared variables. In the first strategy we convert static addresses to shared status in place. In the second approach we create a shared duplicate of the program data area at an offset address.
Conversion in Place
Some shared memory multiprocessor systems provide system calls that are capable of remapping an existing valid region of virtual memory. If the address ordering of variables defined in a source file is preserved by the loading process, we can arrange to convert the variables to shared status in place. To do this, the PCP translator produces two backend files for each PCP source file. One file contains the code, private data definitions and externalized declarations for the shared variables. The second file contains the data definitions for the shared variables.
When the parallel program is built, all of the data definitions for shared variables are concatenated into a single file along with a header and trailer that mark the beginning and end of the shared memory region. When the resulting binary program is run, a page aligned region of memory is found that starts in the header and ends in the trailer. This data region, which contains all of the static shared variables, is then written out to a file and mapped back in to give it shared semantics under process fork. Once the shared data segment is created, the PCP runtime forks the required number of processes to start the parallel job.
In addition to the above tasks of creating the static shared data segment and starting the parallel job, the PCP runtime library implements locks for critical regions, dynamic allocation of shared memory, and barrier synchronization.
Address Offsetting
On platforms that do not preserve address ordering for variables defined in a single source file, or are not capable of converting an existing virtual memory region to shared status, we have implemented a second strategy for creating the shared data segment. In this case, the PCP translator adds a constant offset to the address of all shared static variables in order to reach an unused portion of virtual memory. Before parallel processes are forked, a shared copy of the entire program data area is created at this offset address.
The address offsetting strategy of establishing the shared data segment simplifies building a code and makes library management easier. The cost to be paid for this convenience is the additional runtime overhead of adding a constant offset to the addresses of staticly scoped shared variables. In our benchmarks, which go to some degree of effort to minimize shared memory use, this additional overhead has amounted to only a few percent.
Other than the different way of establishing the shared memory segment and the offset added to static shared addresses by the PCP translator, the PCP runtime support is identical to that of conversion in place described above. On some platforms, both options can be supported.
Distributed Memory Platforms
For distributed memory platforms the PCP translator manages the distribution of arrays across the processors. Arrays are distributed on object boundaries in such a manner that the first element of a staticly allocated array resides on processor zero. Pointers to shared objects may refer to any array element in the same unrestricted manner as conventional pointers in C. Pointer arithmetic and remote memory reference operations are handled in software if required for a given architecture.
The format of a pointer to a shared object depends upon the target architecture. Some platforms implement pointers that are 64 bits wide and admit the packing of the processor index into unused address bits. An example of this is the Cray T3D which leaves the upper 16 bits of a pointer value unused. A processor index for up to 64K processors can be accommodated in this unused field. On other platforms a pointer is only 32 bits wide and will not accommodate the available system-wide memory for even a modest number of processors. In this case, we define a pointer to a shared object as a structure that contains the address and processor index as separate fields. Using a structure value for pointers is cleaner, but most C compilers are clumsy when dealing with structure values as arguments to or returned from subroutines.
In the PCP implementation for distributed memory platforms, the programmer provides information about the processor count at compile time. A shared array of size N is allocated (N+NPROCS-1)/NPROCS elements in the C language output for the array definition, where NPROCS is the lower bound on the number of processors that will be used to execute the application.
All distributed memory implementations require an implementation of arithmetic for pointers to shared data. These routines are inlined if the backend compiler on the target platform supports procedure inlining. Routines that support remote references for all of the ANSI C basic data types, and aggregate types such as structures, must also be written for the target architecture. The only restriction that this places on the target architecture is that one-sided communication be supported. In addition to functions to perform address arithmetic and remote memory operations, library support for parallel job startup, allocation of distributed arrays, mutual exclusion, and barrier synchronization must be written for the target architecture.
If possible, vendor-specific functionality is employed to optimize the performance of communication and synchronization operations. On the Cray T3D, remote memory reference routines are written in assembly language that directly manipulates the hardware communications support. Similarly, on the Cray T3E, remote memory reference operations are implemented as inlined functions that directly access the E register mechanism. This removes routine overhead from single word remote memory accesses. On the Meiko CS-2, the Elan one-sided communications library [13] is used to implement remote memory references. There is substantial software overhead associated with the Elan library.
Each of the distributed memory target architectures offers means to reduce the impact of communication latency. This can be exploited in PCP by using blocked data movement, implemented as remote access to C structures, or vector data movement, implemented with a subroutine interface. On the Meiko CS-2, the movement of large blocks of data can amortize the software costs of communication startup. The hardware prefetch queue on the T3D is used to efficiently overlap the fetch of individual words in a C structure, and to efficiently overlap the fetch of single word array elements distributed across the processors. Similarly, the E register mechanism is used on the T3E to overlap remote memory access.
Benchmark Results
In order to evaluate the utility of the type qualifier shared memory declarations, we have implemented Gaussian elimination with backsubstitution, a 2-D fast Fourier transform, and a matrix-matrix multiply in the extended PCP programming model. The performance of these benchmarks was evaluated on all of the target architectures.
Gaussian Elimination
Our first benchmark is a parallel version of the Gaussian elimination algorithm [14] . For a dense linear system of size N × N the algorithm begins with N reduction steps wherein scalar multiples of the ith row, known as the pivot row, are subtracted from the rows below in order to reduce the matrix to upper triangular form. Once the reduction is complete, backsubstitution produces the solution vector elements in succession.
In the parallel version, an array of flags located in shared memory indicates when a pivot row is ready for use in the reduction. The same array of flags, being reset to zero, indicates when an element of the solution vector is ready for use in the backsubstitution. At the start of the algorithm a processor's share of the rows of the matrix, and the associated portion of the right hand side, are copied from shared memory to private memory. This copying is carried out element-by-element, but may be handled in a routine that executes the copy in a vectorized manner if element-by-element communication can be overlapped on a given architecture. A pivot row, and/or an element of the solution vector, is copied back out to shared memory when the data is ready for use by other processors. We note that the ordering relationship between the setting of a flag and the assignment of its corresponding data must be carefully enforced on machines for which the memory consistency model is not sequential.
In presenting these performance results, we report the rate at which the parallel code executes a 1024 × 1024 linear system solve, and the speedup compared to the parallel code executed by one processor, as a function of the number of processors. All rates are expressed in millions of floating point operations per second (MFLOPS). Execution times are measured on dedicated or suitably gang scheduled [15] machines. To provide a point of reference, we also report the rate at which a processor can repetitively add a scalar multiple of a vector to another vector (DAXPY). We use a vector length of 1000 so all operations hit cache. In Tables 1 though 5 , the floating point rate for parallel execution of the Gaussian elimination benchmark as a function of the processor count, P, is shown in the column labeled MFLOPS. The speedup, measured as the execution time of the parallel code running with one processor divided by the time for P processors, is shown in the column labeled Speedup.
In Table 1 we show the performance of an 8 processor DEC 8400 system on the Gaussian elimination benchmark as a function of the number of processors. The basic DAXPY speed of a processor on this machine is 157.9 MFLOPS for compiled C code. The speedups in the table are superlinear. This is caused by the increasing amount of high speed cache memory available as the processor count is increased. The MFLOP rate does not exceed the number of processors times the cache hit DAXPY rate, as one would expect. Table 2 .
Gaussian Elimination Performance on the SGI Origin 2000
In Table 3 we show the results for the Gaussian elimination benchmark on the Cray T3D platform. Due to the capability of the T3D to overlap remote memory accesses, using a vectorized interface to the communication hardware can reduce the overheads associated with communication. Table 3 . Gaussian Elimination Performance on the Cray T3D
In Table 4 we show the results for the Gaussian elimination benchmark on the Cray T3E platform. As was the case for the T3D, the performance for both scalar and vector access to shared memory is shown. Vector access to shared memory is required to obtain good performance on the T3E. Table 4 . Gaussian Elimination Performance on the Cray T3E-600
In Table 5 we show the results for the Gaussian elimination benchmark on the Meiko CS-2. The basic DAXPY speed for a single processor on the CS-2 is 14.93 MFLOPS. Although the Meiko Elan library provides for overlapped communication, attempting to overlap small one-sided messages does not result in any performance gain. The substantial latency placed on single word communication prevents this particular algorithm for Gaussian elimination from performing well on the Meiko CS-2. Performance could be improved by changing the data layout so that a given row of the matrix is contained on one processor, enabling more efficient use of the DMA capability on the CS-2, and by using a software tree to broadcast pivot rows. 
Fast Fourier Transform
The The work load is balanced for any power of two for the processor count. The shared data accesses can be vectorized with a stride of one for the sweeps in the y direction and with a stride of 2048 for the sweeps in the x direction. On coherent cache based shared memory multiprocessors, the stride of 2048 can be unfortunate as will be shown in the benchmarks. This is dealt with by padding the arrays by one element. The index scheduling for the sweeps in the x direction can also be unfortunate on coherent cache based shared memory multiprocessors, leading to false sharing of cache lines. This is dealt with by blocking the index scheduling.
In presenting performance results, we document the time in seconds for computing a 2048 × 2048 2-D transform with serial code, and the execution time and speedup of the parallel code as a function of processor count. The speedup is relative to the time for one processor executing the parallel code. The efficiency relative to the serial algorithm is known given the separate timing of the serial code that we provide.
In Table 6 we show the performance of the DEC 8400 platform on the FFT benchmark. The execution time and associated speedup, as a function of processor count, P, are shown in the columns labeled Time and Speedup, respectively. In the columns labeled Time Blocked and Speedup Blocked, we show the execution time and associated speedup for a version of the FFT benchmark that blocks the index scheduling in a manner that minimizes false sharing. In the columns labeled Time Padded and Speedup Padded, we show the execution time and associated speedup for the blocked version of the FFT benchmark that pads the arrays to minimize cache line collisions.
The execution time for a serial implementation of a 2048 × 2048 FFT is 10.82 seconds. The execution time for the same serial 2-D transform with the arrays padded by an extra element to reduce cache line collisions when reading or writing memory in the x direction, is 8.55 seconds. Within measurement error, the timings for the serial code are identical to the timings for the associated parallel code executed with one processor. This indicates that parallelization overhead is insignificant. We note that blocked index scheduling does not change performance in a significant way on the DEC 8400.
Since we know that false sharing of cache lines must be occurring on the x directional sweep, the cost of this communication between processors must be relatively low. The best absolute performance and speedup is obtained by padding the arrays to avoid cache line collisions. Table 6 . FFT Performance on the DEC 8400
The SGI Origin 2000 is a distributed shared memory platform wherein each page resides on a computational node. If one processor performs the initialization of the 2-D array, all of the pages of memory reside on the node that contains this processor, leading to a performance bottleneck. Obtaining good performance required adjustments to the code to deal with this problem. We also found that the presence of a bottleneck in the virtual memory system prevented good performance on an initial 2-D FFT when processors were taking page or memory management unit faults. This was addressed by performing the FFT twice and timing the second instance.
In Table 7 we show the performance of the SGI Origin 2000 platform on the FFT benchmark. In the columns labeled Time Sinit and Speedup Sinit we show the execution times and relative speedup as a function of processor count, P, for a version of the FFT benchmark wherein a single processor does the initialization. As a result, pages are located on one node.
In the columns labeled Time Pinit and Speedup Pinit we show the execution times and relative speedup for a version of the FFT benchmark in which the processors share the initialization tasks. Pages are therefore distributed across the machine. The improvement in performance obtained by distributing the pages across the computational nodes is clear. All further benchmark measurements on the SGI Origin 2000 use a parallel initialization to remove this potential bottleneck.
In the columns labeled Time Blocked and Speedup Blocked we show the execution times and relative speedup for a version of the FFT benchmark in which index scheduling in the x direction is blocked to reduce false sharing. A significant increase in speedup is obtained. Finally, in the columns labeled Time Padded and Speedup Padded, we show the execution times and relative speedup for a blocked version of the FFT benchmark in which the y array dimension is padded by one element to reduce cache line collisions. As was the case for the DEC 8400, the best absolute performance and relative speedups are obtained by padding the arrays.
The execution time for a serial implementation of a 2048 × 2048 FFT is 11.0 seconds. The execution time for the same 2-D transform with the arrays padded by an extra element to avoid cache line collisions, is 7.58 seconds. Comparing the serial implementation execution times with that of the parallel implementation using one processor, we see that the parallelization overhead is low. In Table 8 we show the performance of the Cray T3D platform on the FFT benchmark. The execution time and speedup for a version of the FFT that uses scalar access to shared memory is in the columns labeled Time and Speedup, respectively. Table 8 . FFT Performance on the Cray T3D
In Table 9 we show the performance of the Cray T3E-600 platform on the FFT benchmark. The execution time and speedup for scalar and vector access to shared memory are shown in the same way as for the T3D. The execution time for a serial implementation of a 2048 × 2048 FFT is 16.93 seconds. As was the case for the T3D, vector access to shared memory produces good results on the T3E. Table 9 . FFT Performance on the Cray T3E-600
In Table 10 we show the performance of the Meiko CS-2 platform on the FFT benchmark. The execution time for a serial implementation of a 2048 × 2048 FFT is 39.96 seconds. The absolute performance and speedup for the FFT benchmark on the Meiko CS-2 are poor, caused by the high software overhead placed on shared memory access. Results could be improved through the use of a blocked layout for the 2-D arrays. We will demonstrate this in the results of the matrix multiply benchmark below. 
Matrix-Matrix Product
Our matrix multiply benchmark is the computation of the product of two matrices located in shared memory, placing the result in shared memory. This benchmark is for double precision matrices of size 1024 × 1024, and is coded in PCP without using vendor-specific libraries to accelerate performance.
In this benchmark we employ a block decomposition for the 1024 × 1024 matrices. We treat the matrices as 64 × 64 arrays of 16 × 16 submatrices. This is done by packing the submatrices into a C structure. In PCP, shared memory is interleaved on an object boundary where the object in this case is a C structure. This places the submatrix on one processor and allows the efficient blocked copying of 2048 bytes of memory for each remote memory access.
In the benchmark results below we include for comparison purposes the performance for a serial implementation of the 1024 × 1024 matrix multiply, blocked using 16 × 16 submatrices. In Tables 11 through 15 , we present in columns labeled code, respectively, as a function of the processor count, P.
In Table 11 we show the performance of the DEC 8400 on the matrix multiply benchmark. Table 11 . Matrix Multiply Performance on the DEC 8400
In Table 12 we show the performance of the SGI Origin 2000 on the matrix multiply benchmark. The performance of the serial blocked algorithm on this platform is 126.69 MFLOPS. As was the case for the FFT, the matrix multiply was computed twice and the second pass timed. The virtual memory overhead incurred on the first pass slows down execution by more than a factor of three at thirty processors. The scaling is better than the DEC 8400, but shows signs of diminishing returns above 16 processors. In Table 13 we show the performance of the Cray T3D on the matrix multiply benchmark. The performance of a serial implementation of the blocked algorithm is 23.38 MFLOPS. We note the superlinear speedups for processor counts between 2 and 8. This has no explanation in terms of scaling cache size as the processor count is increased. It is likely caused by a performance degradation arising in the use of prefetch logic by a given processor to communicate with its own memory. Table 13 . Matrix Multiply Performance on the Cray T3D
In Table 14 we show the performance of the Cray T3E-600 on the matrix multiply benchmark. The performance of a serial implementation of the blocked algorithm is 97.62 MFLOPS. The T3E, in contrast to the T3D, benefits from an on-chip cache that is fully coherent with the local memory. Memory references from remote processors do not cause gratuitous cache line spills. The parallelization overhead is 24% and good speedups are obtained. Table 14 . Matrix Multiply Performance on the Cray T3E-600
In Table 15 we show the performance of the Meiko CS-2 on the matrix multiply benchmark. The performance of a serial implementation of the blocked algorithm is 14.24 MFLOPS. Note the sharp contrast of the speedups here with the speedups for the case of the FFT benchmark. The blocked data movement has greatly improved performance on the CS-2. Coding for blocked data movement is essential on a distributed memory platform that places high software overhead on communication. 
Discussion
We have investigated the type qualifier approach, first introduced in the Split-C and AC compilers, as an alternative to explicit message passing for writing parallel programs that port across a wide range of architectures. Our investigation was carried out with an extension of the PCP translator that is capable of addressing both shared memory and distributed memory architectures.
The resulting programming model is efficient enough to be the method of choice for shared memory platforms while still providing latency hiding mechanisms that make efficient use of distributed memory architectures possible. Superlinear speedups were obtained for all of the coherent cache shared memory platforms that we tested with the Gaussian elimination benchmark. The inherent scalability of the programming model became clear with the FFT benchmark on the Cray T3D platform, wherein a speedup of 251 for 256 processors was obtained.
A shared memory programming model implemented with the type qualifier approach does not remove the need to exploit latency hiding mechanisms on architectures where communication latency is an obstacle to good performance. Communication latency is significant on all of the distributed memory platforms we tested. Vectorized movement of data between shared and private memory was found to be sufficient to provide good performance on the Cray T3D and T3E platforms. It was refreshing to see true distributed memory systems provide good performance when using well known and understood techniques that did not require much code restructuring.
The Meiko CS-2, however, does not possess communication hardware that is capable of efficiently overlapping vector access to arrays of single words. More aggressive code restructuring to provide large block access to shared data is required in order to obtain good performance. We demonstrated the value of blocking shared data access on the Meiko CS-2 in the matrix multiply benchmark.
