Exploiting locality is a central goal of translating problem-oriented parallel programming languages for distributed memory parallel machines. Modula-2* places the burden of automatically deriving good data and process distribution on the compiler.
Introduction
Straightforward compilation of FORALL statements and allocation of array elements onto massively parallel machines results in a signi cant amount of interprocessor data motion. Therefore, data and process distribution is an essential problem of numerous compiler projects targeting distributed memory machines.
There is widespread agreement about the two goals of data and process distribution: (1) Data locality. To reduce the amount of communication and achieve minimal run-time, all data elements which are used by a process should be store locally on the same PE. (2) Parallelism. Using just one processor results in perfect data locality and minimal communication cost. In general, however, the run-time can be improved by exploiting the full degree of parallelism provided by the hardware. A trade-o between the con icting goals of data locality and parallelism must be found.
Whereas the goals are agreed upon, totally di erent approaches to reach them have been developed.
In many programming languages the user must explicitly provide the data layout. Some languages require an explicit mapping of the data onto the topology 16, 9, 14] , others are more abstract and o er either sets of directives for the compiler or interactive or knowledge-based environments that help determine the alignment of array dimensions and mapping functions 3, 10, 7, 1, 2]. Recent work 5, 6, 4, 15, 8] focuses on static compile-time analysis to automatically nd a data decomposition that achieves both goals for vector and data-parallel operations.
Modula-2* 17] is designed for high-level, problemoriented, and machine-independent parallel programming. The programmer can focus on the problem he has to solve, abstracting from the available number of processors and the interconnection network. Therefore, the compiler has to determine an appropriate data and process distribution.
Known approaches to automatically derive good data allocations have been targeting pure data-parallel programming languages, i.e. the parallelism has come from vector manipulations. In these approaches it is su cient to nd good data allocations. Locality is achieved by applying the owner-computes rule to distribute the statement execution onto the processors accordingly.
Modula-2*, however, is not a purely data-parallel programming language. When designing Modula-2*, we wanted to preserve the main advantages of dataparallel languages while avoiding the drawbacks 13]. Although data-parallel programming is possible, the notion of process is present. Therefore, both data and process distribution must be found by the compiler.
In this paper we present our approach to derive both data and process distribution for Modula-2* programs. Our technique is based on the work of Knobe 6] but extends her ideas with the consideration of process distribution and the clear separation of high-level data arrangement and physical data layout.
In section 2 we present the basic characteristics of Modula-2*. Section 3 explains the general approach of the Modula-2* compiler. In sections 4 and 5 we give some more details on the alignment graphs, the con ict detection, and the heuristic search mechanism.
2 Modula-2*
The programming language Modula-2* was developed to allow for high-level, problem-oriented and machine-independent parallel programming. As described in 17], it provides the following features:
An arbitrary number of processes operate on data in the same single address space. Note that shared memory is not required; a single address space merely permits all memory to be addressed, but not necessarily at uniform speed. Synchronous and asynchronous parallel computations as well as arbitrary nestings thereof can be formulated in a totally machine-independent way. Procedures may be called in any context (sequential, synchronous, or asynchronous) at any nesting depth. Furthermore, additional parallel processes can be created inside procedures (recursive parallelism). All abstraction mechanisms of Modula-2 are available for parallel programming.
Modula-2* extends Modula-2 with just two language constructs:
1. The only way to introduce parallelism into Modula-2* programs is by means of the FORALL statement, which has a synchronous and an asynchronous version. 2. The distribution of array data is optionally speci ed by so-called allocators. These machineindependent allocators do not have any semantic meaning. They are just hints about data layout for the compiler.
Because of the compactness and simplicity of these extensions, they could easily be incorporated into other imperative programming languages, such as Fortran, C, or Ada.
FORALL statement
In Modula-2*, the syntax of the FORALL statement is: 
Allocation of array data
Modula-2* provides a simple, machine-independent construct for controlling the allocation of array data. This construct is optional and does not change the meaning of a program. The modi ed declaration syntax for arrays is: Array elements whose indices di er only in dimensions that are marked LOCAL are associated with the same processor. This facility is used to avoid distribution of data in a given dimension.
Dimensions with allocator SPREAD are divided into segments, one for each of the available processors. A vector with n elements is assigned to P processors by allocating a segment of length dn=Pe to each processor. While utilizing all available processors, it minimizes the cost of nearest-neighbor communication.
Dimensions with allocator CYCLE are distributed in a round-robin fashion over the available processors. Given P processors, the elements of a vector whose indices are identical modulo P are associated with the same processor. In contrast to SPREAD, CYCLE maximizes the cost of nearest-neighbor communication: neighboring array elements are always on di erent processors, leading to better processor utilization if a parallel algorithm operates on subsegments of a vector.
Dimensions with allocator RANDOM are distributed randomly over the available processors. In contrast to CYCLE, RANDOM leads to a better processor utilization if a parallel algorithms accesses the dimension in a random pattern.
If either SPREAD, CYCLE, or RANDOM apply to several successive dimensions, then these dimensions are \unrolled" into one pseudo-vector with a length that is the product of the lengths of the individual dimensions. This scheme idles fewer processors than applying SPREAD, CYCLE, or RANDOM to individual dimensions.
Allocators SBLOCK and CBLOCK apply SPREAD and CYCLE resp. to each dimension individually. For two successive dimensions, SBLOCK has the e ect of creating rectangular subarrays and assigning those to the processors. With this arrangement, nearest-neighbor communication in all dimensions is best supported when the interconnection network can be con gured into the same number of dimensions as the arrays.
CBLOCK for two dimensions also creates twodimensional subarrays, but the rows and columns of these subarrays are then distributed in a round-robin fashion over the processor grid. Again, SBLOCK minimizes nearest-neighbor communication, while CBLOCK allows high processor utilization if smaller subarrays are processed in parallel.
3 Alignment in Modula-2* In this section we present the general ideas of our data and process alignment strategies.
Data layout is the decision which element of an array is physically stored on which processor. Arrangement is the process of arranging array elements so that the elements of di erent arrays which are used together will end up in the same processor.
Although arrangement and layout seen as one step in the literature, we propose to separate these issues into two phases: Alignment = Arrangement + Layout
Data Alignment
In terms of Modula-2* we use a source-to-source transformation in the rst phase to achieve the arrangement. The analysis would not arrange arrays A and B if the programmer had used di erent allocators. In this case, the compiler issues a performance warning, which suggests to reconsider the used allocators. If the programmer does not use any allocator, the compiler selects an appropriate one.
In the second phase, the layout algorithm maps both arrays to the available processors in the same way. Since both arrays have the same declaration, elements with the same index end up in the same processor. Our layout algorithm, which is described in 12], reaches the following goals: (a) Exploit fast communication patterns if there is special hardware support, e.g. nearest-neighbor networks. (b) Perform simple address calculations. The computation of processor numbers and addresses of data elements are fast shift and mask operations.
Process Alignment
Up to now we have only dealt with the data alignment and its realization. Process alignment is also achieved The code generator then simply considers the range of the FORALL as an array and invokes the layout algorithm to determine which processor has to simulate which of the conceptual processes in a virtualization loop. In the above example the original FORALL has been split into two parts. In both FORALLs the process with index i will be executed where data element B i] resides, resulting in purely local accesses. This could not be achieved with a single FORALL. Furthermore, providing the code generator with exact alignment information facilitates easy exploitation of nearest-neighbor communication networks.
The arrangement does not always work that smoothly. In general, there are lots of alignment preferences both for data usage and process alignment. Additionally, suitable cost estimation is required. Depending of the overhead cost of splitting up a FORALL, it may be advantageous on particular parallel hardware to accept some non-locality instead.
The following two sections are more speci c and show our arrangement algorithm in some detail.
Arrangement Graphs and Con icts
During static compile-time analysis we create an arrangement graph. Nodes of this graph are array references of arbitrary type and FORALL-variables. Edges express arrangement preferences and are attributed with the type and the structure of the detected preference.
Type and Structure
We found four types of arrangement preferences to be necessary. The rst two types were introduced by Knobe and provide data arrangement information.
An identity preference is an arrangement request that relates a de ning occurrence of an array to a using occurrence of the same array. It indicates a preference to align identical elements of the array on the same processor for the two occurrences. The idea is to avoid redistribution cost. A conformance preference relates two array occurrences that are operated on together in a parallel expression. The goal is to group elements of different arrays so that all data accesses can be done locally. Knobe has introduced a third preference for expressing data arrangement information. An independence anti-preference is a property of speci c array dimensions if these dimensions contain a potentially parallel subscript. For analysis of Modula-2*, this type of preference is not necessary, because of (a) the allocators already indicate distributed storage and (b) the explicitness of parallelism in array subscripts inside of FORALL statements.
The next two types of arrangement preferences are used to gather information for process alignment. Arranging the processes with all LMOs in the body of the FORALL will achieve perfect locality of processes and data that is accessed in parallel. The process will run where the data is located. Since conformance preferences already ensure that all data which is operated on together will be arranged, only LMOs are considered. An LMO preference relates two successive LMOs of the same array in the body of a FORALL if these are subscripted in the same dimension with an expression using the same FORALL-variable. LMO preferences represent the cost of splitting up FORALLs, i.e. the increased virtualization overhead. If all LMO preferences are honored the FORALL will not be split up. 
Con icts
The arrangement graph usually is not free of con icts. In general, it is impossible to arrange data elements and processes in a way that all accesses are local without any redistribution of data or processes. We distinguish between data arrangement con icts and process arrangement con icts.
Data Arrangement Con icts
In the following example the data arrangement graph is cyclic. In our approach, we avoid data redistribution at run-time. Therefore, there are two possible data arrangements. In both cases, one conformance preference is honored, the other one is broken.
To determine all possible arrangements, we apply the following algorithm to each cycle in the data arrangement graph: Otherwise, there is no data arrangement conict.
The compiler preserves all con ict free data arrangements and all con icts, i.e. all possible data arrangements that require to break at least one data arrangement preference. The way this information is used is presented in section 5.
Process Arrangement Con icts
In the following example the process arrangement graph is cyclic: 7. Consider all edges of a subgraph. There is a dimension con ict if among those there is pair of process preference edges with di ering dimensions in a single array.
The compiler keeps all con ict free process arrangements and all con icts, i.e. all possible process arrangements that require to break at least one LMO preference. The way this information is used is presented in the following section.
Cost Considerations
In the previous section the processing of the data arrangement graph has resulted in a collection of several possible data arrangements for the whole program. For each FORALL statement in this program the compiler has derived a collection of possible process distributions.
Finding an optimal process distribution with a brute force algorithm would involve an exponential search space. A FORALL with n statements and p possible distributions requires the cost estimation for p n di erent combinations.
Unfortunately, the combination of two optimal process distributions for the statement sequences 1 : : :bn=2c and bn=2c + 1 : : :n does not necessarily result in a global optimum, since redistribution of processes imposes additional costs. With the assumption that the process redistribution cost, i.e., the cost of splitting up a FORALL into several FORALLs, are small compared to the communication cost due to data access, the probable loss of optimality can be tolerated. Therefore, a dynamic programming approach with a time complexity of O(n log n) is feasible: Although the compiler considers both possible data arrangements, in this example we will only consider the second arrangement. Therefore, we will only present steps 2{5 of the search algorithm from section 5. line (A,1,1,1) (A,1,1,0 
6g+2s+1f 0-0-1-0 3g+1s+2f In the table s,g, and f denote the cost of a send operation, a get operation, and the cost of splitting up a FORALL 1 . In the rst step, the costs of executing individual lines are computed for all process distributions. Merging lines 1 and 2 is obvious, since in both lines (A,1,1,0) is superior. This is shown by 0|0 in the table. For merging lines 3 and 4 there are three possibilities. All must be considered, since f is not zero. (1) use (A,1,1,1) for both lines at a cost of 2g+1s, (2) use (A,1,1,0) for both lines at a cost of 3g+1s, or (3) redistribute 1!0 at a cost of 2g+1f, which is the cheapest. When considering the whole FORALL statement in the last step, there are again three options: (1) select data distribution (A,1,1,1) for all lines at a cost of 3g+2s, (2) select (A,1,1,0) for the rst two lines and (A,1,1,1) for the last two lines at a cost of 6g+2s+1f, or (3) redistribute again resulting in a cost of 3g+1s+2f. Given the values for g,s, and f, the best process distribution will split up the given FORALL after the second and the third line.
Assuming the second data arrangement, the code fragment is transformed as follows. Note that for sake of clarity the transformations related to data arrangement are left out in this example, i.e. all arrays are still presented in their original declaration with the original subscripts.
Conclusion
In this paper we have presented a technique that enhances locality using a source-to-source transformation. The result of this program transformation is a data and process alignment that results in better performance: rst benchmarking yields an improvement of performance by at least 60% on the MasPar MP-1.
We consider this result to be initial evidence that automatic data and process distribution by the compiler is possible and can achieve attractive performance improvements.
