Abstract-The polytope model has been used successfully as a tool for program analysis and transformation in the field of automatic loop parallelization. However, for the final step of automatic code generation, the generated code is either only usable on shared memory architectures or severely restricts the parallelization methods that can be applied. In this paper, we present a fully automated method for generating efficient target code, which is executable on clusters that are based on a distributed memory architecture. We also provide speedup results of experiments on a local cluster.
In the context of this paper, we deal only with loops whose bounds are affine linear expressions in indices of surrounding loops and in (symbolic) constants. Using some additional mathematical definitions, this allows for a more concrete description of index spaces.
Definition 3: In an -dimensional vector space, a hyperplane is an " -dimensional subspace. It divides the vector space into two half spaces.
Definition 4: A polyhedron is the intersection of finitely many half spaces. A polytope is a bounded polyhedron. A polyhedron (polytope) can be defined by an affine inequality system, which we usually represent in matrix form:
is the constant vector. In our application, every loop surrounding a statement corresponds to one dimension in the vector space, i.e., the index space, in the model. Every affine loop bound is modeled by an inequality. Thus, the -dimensional index space of a statement can be represented as 
When we are not interested in the inequalities but only in the dimensions in which the polytope lies, we denote a polytope 6 by a sequence of sub-vectors which represent the names of the variables of the relevant dimensions: (4)
B. Dependences
As the final step of modeling a loop nest, we must integrate the interactions of different operations into the model. 
where Q is the lexicographic order. Definition 8: The (multi-dimensional) function T that maps every operation to its logical processor coordinate is called placement. This function has no correctness constraints, but influences heavily the number of communications.
These functions together define the parallel execution: all operations that are scheduled at a common time step C can be executed in parallel on the logical processors assigned by the placement function T . In the polytope model, we restrict ourselves to (piecewise) affine functions for schedule and placement. Therefore, they can be written in matrix notation, either each individually or also both together.
Definition 9: The affine transformation that consists of the schedule and placement functions is called space-time mapping. It maps each operation to a logical time 
D. A basic communication polytope
By applying the space-time transformation to the source and destination ISPC of each dependence 6 , we obtain a dependence relation 6 a in the TISPC. If the space coordinate of the source and the destination differ, a communication must be launched. Let us define a mathematical representation for these communications.
Definition 10: A communication polytope
is constructed from 
We can use this communication polytope to generate the required communication code. For each transformed true dependence 6 , we construct two point-to-point communication statements, one for for sending and one for receiving the data element described by . At this point, we can automatically generate code for distributed memory architectures: we supply the polytopes modeling the space-time mapped computation statements and the send and receive statements to a tool that converts the polytope description to a loop nest, e.g., CLooG [4] .
However, the target code obtained by application of the approach mentioned above is usually not efficient enough to produce a speed-up. The reason is the fine granularity of parallelism in the target program.
E. A refined communication polytope
In order to obtain coarser-grained parallelism, we apply tiling [1] , [22] . We partition an q -dimensional space (often called the offset) lies within the tile described in (8) . Combining this with (9), we obtain an inequality that expresses the relation between tile coordinates and virtual coordinates:
Note that tiling doubles the dimensionality since the virtual coordinates are decomposed into tile coordinates and offsets.
In contrast to the traditional framework, we apply tiling to the transformed index space, i.e., after the application of the space-time mapping [12] . Thus, we can check separately the effect of tiling space and time dimensions.
By applying tiling to spatial coordinates, communication is only performed between processor tile coordinates, thereby avoiding the overhead from unnecessary communication between the large number of virtual processors which are executed on the same real processor.
Time tiling is used for aggregating virtual time steps into global time steps. During each global time step, data elements are not communicated immediately, but are aggregated in buffers, which are transmitted at the end of each global time step, thereby achieving message vectorization. Aside: it can be shown that only a single virtual time dimension can be tiled with a tile size greater than 1 and less than the maximal extent of the time dimension [11] . Let us discuss space and time tiling in more detail. 1) Tiling time dimensions: Since tiling time postpones the transmission of data, it also delays the receiver. This leads to a deadlock if two processors would like to exchange data between two virtual time steps: the sending is postponed, but none of the two processors can continue execution.
We can avoid this problem if we restrict ourselves to placements that satisfy an additional property. If (11) also holds for anti and output dependences, we say that the placement satisfies the strict FCO property. Note that there exist placement algorithms that guarantee the (strict) FCO property [14] .
If we have a placement satisfying the FCO property, we know that every communication targets a processor with a higher number than the source, thus avoiding communication cycles. We obtain a correct execution order of the tiles by delaying tiles executed on processors with high numbers longer than tiles executed on processors with small numbers [11] . More formally, we delay the execution of a given time tile £ by the sum of entries in its processor tile coordinate (
Note that restricting ourselves to placements that satisfy the strict FCO property allows our communication scheme to unpack all data elements from the receiver's buffer immediately after the corresponding communication, i.e., one global time step after the write access that lead to communication [13] .
2 
Unfortunately, the resulting index space is non-convex and can only be expressed by the union of polytopes for
and
(using the lexicographic order µ in all processor dimensions). However, we can again exploit the FCO property and restrict the polytope representations to the case that Equation 14 holds.
3) The refined communication polytope: To summarize, tiling extends our communication polytope by additional tiling dimensions for time tile and processor tile coordinates for the source ( ¹ ) and destination ( ¹ ) of a transformed dependence relation (with skewing already applied):
F. Statement types for our communication scheme
For our refined communication scheme, we generate three new types of statements in addition to the transformed computation statements from the original input program. For every transformed dependence, two types of buffer management statements are generated. The third statement type is responsible for performing the communication.
1) Write-buffer statements: After each write access at the source of a true dependence, the computed data element is not sent directly by a point-to-point communication, but stored in a buffer by a write-buffer statement. This buffer holds all data destined for the corresponding destination processor.
2) Unpack-buffer statements: For each write-buffer statement, a corresponding unpack-buffer statement is generated, which is used to unpack data elements from the communicated buffers at the global time step immediately following the global time step of the write access.
For the generation of both buffer management statements, our extended communication polytope 
Ä
The polytopes for the write-buffer and the unpack-buffer statements contain both source and destination processor tile dimensions, but they differ in the ordering of these dimensions.
It is important to use the same ordering of elements in the corresponding write and read buffers. For that purpose, we must use the same polytope description of logical space-time coordinates for both buffer management statements. It is irrelevant whether we use the coordinates of the source or the destination index space of a given dependence -we just have to use the same description for both statements.
In our communication scheme, each unpacking operation of communicated data from a communication buffer is always performed at the global time step that follows the corresponding write operation (cf. Section II-E.1). Therefore, it is not required to enumerate all global time coordinates of the destination index space for each dependence. Instead, the global time coordinate for the unpack buffer statement is fully determined by the global time coordinate of the source index space of the corresponding dependence, to which one global time step is added. The resulting enumeration order for write-buffer and unpack-buffer statements is given by projecting the communication polytope as follows:
, we prevent redundant communication of data. Consider the case that 3) Communication statement: For each global time step that includes computation, a communication statement is generated, which performs the actual communication of the data elements stored in separate buffers for each destination processor.
For this communication statement, we re-use the polytope description of the transformed computation statements, where space-time mapping, tiling and skewing has already been applied. In order to enumerate all global time iterations that execute computations, we project on the global time coordinate:
Here, no processor tile coordinates are required, because the communication is performed by a collective operation that is executed on all real processors.
At this point we have polytope descriptions for all computation and communication-relevant statements of our refined communication scheme. The final step is to merge them correctly and to generate a program from them.
G. From the polytope representation to target code
For the generation of the target program's loop nest, all polytopes of the four different statement types have to be merged and scanned correctly. For this task, we use CLooG, an improved implementation by Bastoul [4] of Quilleré's algorithm [21] .
To describe the execution domain for each statement in the target program, we use our polytope descriptions. However, we also have to take care of the scheduling of the four different statement types relative to each other: 1) For each virtual time step within a global time step, our refined communication scheme can lead to at most three different types of statements being executed in the following order: a) Unpack-buffer statement (reads the received value from the receive-buffer) b) Compute statement (computes the new value) c) Write-buffer statement (writes the new value to the send-buffer) 2) At the end of each global time step, the communication statement is executed on each processor. We use additional constant scheduling dimensions that are inserted to guarantee the desired ordering [16] . CLooG provides a mechanism for inserting and reordering additional dimensions for the target loop nest. Affine functions (socalled scatter functions) are used to define equations between variables from the domain descriptions and the additionally inserted scatter dimensions. We also use these scatter functions to change the enumeration order of processor tiles for the write-buffer and unpack-buffer statement, allowing to use the same polytopes for the domain description of both statement types.
H. Post-processing
The generated loop nest is post-processed automatically, in order to insert if-statements that restrict the execution of processor tile iterations to processors with the corresponding MPI process number. If the number of available MPI processes is known at compile time, the tile sizes for processor tiles can be adjusted in order to obtain a mapping of each processor tile on exactly one MPI process. In this case, we know the number of processor tile coordinates in each dimension at compile time and can compute a one-dimensional MPI process number by an appropriate function getRank.
Inside the innermost loop for processor tiles 
. // execute inner loops }
As an additional step in post-processing, the code for buffer management and performing the actual communication is inserted for the statement placeholders that are generated by CLooG within the loop bodies.
I. Mapping to real processors
If processor tiling is used for mapping the processor tile coordinates to one-dimensional MPI process numbers (as mentioned in Section II-H), the program has to be re-generated with adjusted tile sizes, if another number of available MPI processes is used.
In order to avoid this costly code generation, we use a dynamic approach for mapping to real processors: ÷ A map is created by enumerating all processor tile coordinates that contain computation statements and mapping them to a one-dimensional coordinate, using a cyclic or block distribution. For this purpose, the polytope representations of all computation statements are projected to their processor tile dimensions.
÷
The function getRank is replaced by a function lookupRP, that performs a lookup at run time in the generated processor map:
if(lookupRP(p1,p2) == mpi_rank) { ... // execute inner loops } Especially when using the cyclic distribution strategy, this approach yields very good load balance.
III. EXPERIMENTS

A. Input program
For our experiments, we used one-dimensional successive over-relaxation (SOR) [10] : M=10000, N=500000 M=10000, N=1000000 M=10000, N=2000000 M=10000, N=5000000
Fig. 2. Efficiency
and processor dimension respectively. In this case, we only have one-dimensional time and processor dimensions, so we have to use a rectangular tile shape.
B. Target architecture
We used a local cluster for our benchmarks that consists of 32 nodes of dual-Pentium III 1 GHz CPUs with 512 MB memory. These nodes are interconnected by an SCI network. Of the 32 nodes, we used between one and 16 nodes for our experiments, while using only one CPU core for each node.
C. Results
The resulting speedup is displayed in Figure 1 for different parameter settings. Figure 2 shows the corresponding efficiency. We chosen a larger number of iterations (¥ Our experiments have shown that our refined communication scheme produces high speedups. For the problem sizes used in our examples, the generated code scales up to to 14 processors with quite good efficiency (between 51% and 71%).
The described code generation using CLooG for enumerating the target loop nest from the polytope descriptions usually yields very large code sizes for the target program (e.g. more than lines of code in the example used in our experiments). However, the overhead from buffer management, synchronization and the complexity of the loop bounds is reasonably low, as documented by the high efficiency in the case that the parallel target program is run on only one processor.
IV. RELATED WORK
There are other projects that also address automatic code generation for distributed memory architectures. Faber [6] , [7] describes an approach that is also based on the polytope model. He generates HPF target code that includes annotations of how data is distributed among processors. The HPF compiler then generates the required communication from this distribution annotations. However, the application of this method is restricted because common HPF compilers often fail to generate the corresponding communication code for cases that require irregular communications. The code generation method described in this paper, by contrast, enables the generation of communication code for arbitrary affine communication patterns, as long as the applied placement satisfies the strict FCO property.
Ferner [9] also describes a method of automatically deriving communication code for distributed memory architectures. His approach is based on a parallelization technique of Lim and Lam [19] that generates so-called partitions of the index spaces of all statements in the input program. These partitions can be executed in parallel on logical processors. Ferner proposes a mapping algorithm that uses affine mappings from logical processors (partitions) to real processors and illustrates how the corresponding communication code loop nests can be obtained.
However, as opposed to the original parallelization technique described by Lim and Lam, he restricts the partitions to be one-dimensional, which limits the degree of parallelism obtained in the target program, but simplifies the code generation. He also restricts the target programs to asynchronous parallel programs, whereas Lim and Lam's method often results in synchronous parallel target programs that require an outer sequential loop for enumerating time steps.
Athanasaki et al. describe another approach that also uses tiling to reduce communication for distributed memory based clusters [2] . In their approach, an additional tiling transformation is used for aggregating processor tiles along certain hyperplanes into so-called groups, which can be executed efficiently by exploiting the availability of communicationfree shared memory processors on each node in the cluster. They also use overlapping of computation and communication operations to achieve a form of pipeline parallelism. For the implementation of the communication statements, they make use of low level adaptations, e.g. using zero-copy network protocols [20] for SCI networks.
However, their approach is restricted to perfectly nested forloops with uniform dependences. Also, the number of nodes and the number of shared memory processors per node have to be known at code generation time for this method, because both numbers are used statically for determining the tile and group sizes, respectively.
V. CONCLUSION
