Applications with varying array access patterns require to dynamically change array mappings on distributed-memory parallel machines. Hpf (High Performance Fortran) provides such remappings, on data that can be replicated, explicitly through the realign and redistribute directives and implicitly at procedure calls and returns. However such features are left out of the hpf subset or of the currently discussed hpf kernel for e ciency reasons. This paper presents a new compilation technique to handle hpf remappings for message-passing parallel architectures. The rst phase is global and removes all useless remappings that appear naturally in procedures. The code generated by the second phase takes advantage of replications to shorten the remapping time. It is proved optimal: A minimal number of messages, containing only the required data, is sent over the network.
11 Spmd remapping code : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16 12 Remapping-based transpose code for hpfc : : : : : : : : : : : : : : : : : : : : : 20 13 Transposition speed per processor on P(2,2) : : : : : : : : : : : : : : : : : : : 21 14 Transposition speed per processor on P(2,3) : : : : : : : : : : : : : : : : : : : 22 15 Transposition speed per processor for (block,block) distributions : : : : : : : : 23
Introduction
Many applications, such as ADI (Alternating Direction Integration) and FFT 18] (Fast Fourier Transform), require di erent array mappings at di erent computation phases for e cient execution on distributed-memory parallel machines (e.g. Cray t3d, Ibm sp2, Dec Alpha farm). Data replication, sometimes partial, is used to share data between processors. Data remapping and replication often need to be combined: A parallel matrix multiplication accesses a whole row and column of data to compute each single target element, hence the need to remap data with some replication for parallel execution. Moreover, automatic data layout tools 24, 7] suggest data remappings between computation phases. Thus handling data remappings e ciently is an important issue for high performance computing.
Hpf (High Performance Fortran 14, 27] , a Fortran 90-based data-parallel language) targets distributed-memory parallel architectures. Standard directives are provided to specify array mappings that may involve some replication. These mappings are changed dynamically, explicitly with executable directives (realign, redistribute) and implicitly at procedure calls and returns for prescriptive argument mappings. These useful features are perceived as di cult to compile e ciently and thus are left out of the hpf subset or of the currently discussed hpf kernel 15] . If not supported, or even not well supported, applications requiring them will not be ported to hpf: : : The key issues to be addressed are the reduction of the runtime overheads induced by remappings, and the management of the rich variety of hpf mappings. 
Related work

Contributions
Remapping overheads are attacked at di erent levels by the compilation technique implemented in hpfc, our prototype hpf compiler. At a global level, all useless remappings are removed. This optimization is presented in the rst part of the paper. Such remappings arise naturally in programs.
The second part of paper focuses on the remapping code generation problem for messagepassing parallel architectures with non-blocking sends and blocking receives. The problem is tted into a single powerful linear framework, which integrates all issues. Arbitrary remappings, involving partial replication, alignment strides, general cyclic distributions and di erently shaped processor grids are handled. The spmd generated code is based on the enumeration of polyhedron solutions that abstracts the required communications. Load balancing and broadcasts are also considered. Correctness and optimality results are discussed. The technique is fully implemented in hpfc 8, 9, 10], a prototype hpf compiler developed within the pips project 22] . Experiments on a Dec Alpha farm are also presented. To our knowledge, this technique is the rst to integrate all hpf mapping issues in a single framework, to address load balancing and possible broadcasts in the generated code and to present optimality results.
Section 1 presents the remapping graph construction from the control ow graph and a global optimization on this graph to remove statically all useless remappings. Section 2 introduces an example and notations for the code generation. The remapping problem formalization into a polyhedron is described and illustrated in Section 3: The hpf constraints are presented, then optimizations taking into account (1) particular distribution of data onto the processors such as replication and (2) e cient communication capabilities of distributed memory machines such as broadcast, are added to the compilation scheme. The SPMD code generation is presented in Section 4 and optimality properties are discussed in Section 5. Finally, Section 6 presents and analyzes experimental results.
Remapping Graph
Useless remappings may appear naturally in hpf programs. First, the change of both alignment and distribution of an array requires a realign and a redistribute, hence resulting in two remappings if no special care is taken. Second, the redistribution of a template 2 induces the remapping of all aligned arrays, even if they are not all referenced afterwards. Third, at an interprocedural level, two consecutive subroutine calls may require the same remapping for a given array, resulting in a useless remapping on return from the rst subroutine and on entry in the second. If two di erent mappings are required, it may also be interesting to remap data directly rather than using the intermediate original mapping. Such examples do not arise from badly written programs, but from a normal use of hpf features. They demonstrate the need for compile time optimizations to avoid useless costly remappings at runtime. Figure 1 . The loop nest involving two remappings is typical of ADI computations. Template T is redistributed at 1, inducing B and C remappings, but C is not referenced afterwards. Moreover argument A is never referenced with its initial mapping.
In this section, the remapping graph, its construction from the control ow graph and its optimizations are presented. This approach deals with descriptive and prescriptive mappings, i.e. when the compiler is aware of data distributions. This information is depicted in Figure 2 . To the vertex is associated the remapped arrays A and B, with the leaving mapping as a subscript (0 for A, 1 for B) and the set of reaching mappings as a superscript (f1; 3g for A, f2g for B). Referenced arrays are underlined (here only B). The compiler must generate remapping codes for each pair (reaching to leaving mappings). Let us describe how G R is built from the program control ow graph. First, the entry and exit vertices are created. The distributed subroutine local variables are attached to the entry, with their initial mapping as a leaving mapping. The subroutine distributed formal parameters are attached to both entry and exit vertices. The leaving mapping for those variables on entry is the initial mapping. The reaching (resp. leaving) mapping for the entry (resp. exit) is an a priori unknown X mapping. If the directives are descriptive, the reaching mapping on entry of the subroutine is the initial mapping (X=0). On exit, distributed formal parameters are tagged as used: without further interprocedural information, the compiler assumes that the nal remapping is needed. Figure 3 shows the initial graph for remaps. The next phase of the G R construction is the propagation of the initial mappings from the subroutine entry, till remappings are encountered. This must be done for each couple (v; A) of vertex and arrays remapped at this vertex. First, initialize the set of couples to be propagated with the entry vertex associated to the distributed arrays. Then for each such couple (v; A), propagate in the control graph from the corresponding vertex till meeting remapping statements for that array mapping. Tag the array remapping as used if a reference is encountered while propagating. Let w be one of the encountered remapping statements. Add a corresponding w vertex in G R if necessary. Add A to S(w) and compute L A (w) if necessary, and (w; A) is a new couple to be explored later on. Add L A (v) to R A (w). The resulting remapping graph for remaps is shown in Figure 4 . 
Exit
Optimization
In G R , arrays that are remapped after a remapping without having been referenced are tagged as not used for this remapping. In such cases, at least two remappings will be performed at runtime without referencing the array in between, as array A in Figure 4 after the entry vertex. Such useless remappings must be removed. However the successive remapping statements must be aware that they were not performed and that they may have to deal with other reaching mappings. Indeed, the whole set of reaching mappings must be recomputed. Some are no longer of use and others must be added. This optimization is performed as follow:
First, remove all useless remappings 4 , simply by deleting the leaving mapping for those vertices and arrays. The iterative resolution of the optimizing function is increasing and bounded, thus it converges. The resulting graph for remaps is shown in Figure 5 .
{X,2} 8v; 8A 2 S(v)^Used A (v); 8a 2 R A (v); 9w and a path from w to v in G R , so that a 2 L A (w) and A is not used on the path: Proof: construction of the path by induction on the solution of the data ow problem. Note that the path in G R re ects an underlying path in the control ow graph with no use and no remapping of the array.
Discussion
If subroutine local arrays are not used from the entry point in their initial mapping, the compiler may delay the allocation till a used mapping is needed, or chose another initial mapping among directly useful ones.
Remappings involving unknown X mappings should be propagated to call sites in order to be instantiated.
The set of needed remappings after this optimization may have been reduced or extended.
What is minimized is the number of remappings performed at run-time, not those that must be addressed at compile time. Our compiler keeps a database of generated remapping codes in order not to generate some code twice.
The remapping graph was presented at an intraprocedural level. It is natural to extend it to the interprocedural level, for instance by providing a summary of the entry and exit remappings to be used at the call sites foroptimizations. Remappings of arguments should be decided and performed at call site.
G R for remaps includes an edge from the entry to the exit vertex, because the DO loop may be empty and thus array A may reach the exit vertex without remapping. If the compiler can determine that the loop body is always executed, the skipping edge can be removed from the control graph, thus improving remapping graph G R quality.
In order to simplify the presentation, it was assumed that only one mapping for an array could leave a remapping statement. This is not necessarily the case in hpf. Thus several leaving mappings may be associated to a vertex and array, and for each of these mappings a set of reaching mappings. Care must also be taken in the building phase. Use-information must be attached to the leaving mappings.
The runtime needs to keep track of the mapping status of each array to chose the right remapping routine when needed.
Some additional bene ts may be obtained by moving remapping in the control ow graph, in order to perform a remapping only when the array is to be actually referenced in its new shape.
Example and notations
Let us consider the example in Figure 6 . This example is deliberately contrived, and designed to show all the capabilities of our algorithm. Real application remappings should not present all these di culties at once, but they should frequently include some of them. Vector A is remapped from a block distribution onto 3-d processor grid Ps to a general cyclic distribution onto 2-d processor grid Pt through template T redistribution. Both source and target mappings involve partial replication. The corresponding data layouts are depicted in Figure 7 . The colors denote the data to processor a ectation. The initial mapping is a block distribution of A onto the second dimension of Ps. Each column of (dark and light) processors in Ps owns a full copy of A. Thus A is replicated 4 times. The target mapping is a cyclic(2) distribution onto Pt rst dimension. Each line owns a full copy of A and A is replicated twice. Let us describe how the spmd generated code handles the remapping communications. The arrows in Figure 7 denote the source to target processor assignment. On the target side, each column of processors waits for exactly the same data, hence the opportunity to broadcast the same messages to these pairs. On the source side, each column can provide any needed data, since it owns a full copy of A. The di erent columns can deal with di erent target processors, thus balancing the load of generating and sending the messages. For the running example, 5 di erent target processor sets are waiting for data that can be addressed by 4 source processor groups. The source to target processor assignment statically cyclically balances the targets among the possible senders.
Linear algebra provides a powerful framework to characterize array element sets, represent hpf directives, generate e cient code, de ne and introduce optimizations and make possible correctness proof of the compilation scheme. Our compilation scheme uses polyhedra to represent the hpf data remapping problem. Notations are shown in Tables 1, 2 Figure 8 shows the declaration constraints for the objects involved in the source and target mappings of the running example. Lower and upper bounds are de ned for each dimension of arrays, templates and processor grids. Figure 9 shows the constraints derived from hpf directives. The sets of distributed and replicated dimensions are also shown. The alignment is quite simple here, but a ne expressions are needed for general alignments (for instance align A(i,*) with T(*,3*i-2) would lead to 2 = 3 1 2). The template distributions require additional variables, for modeling blocks and cycles: is the o set within a block; is the cycle number, i.e. the number of wraps around the processors for cyclic distributions. General cyclic distributions need both variables. Distributions on replicated dimensions are useless, thus are not included. This linear modelization is extensivelly described in 3]. Figure 10 presents the local declarations and global to local address translations generated by hpfc, expressed through linear constraints. Thus they are directly included in our compilation scheme. However such an integration is not required: providing global to local address translation functions would be su cient, although more expensive at run time. i.e. apart from replicated dimensions, only one processor owns a data on the source and target processor grids, thus constraining the possible communications.
Proof: Hpf mapping semantics.
Broadcasts and load balance
A spmd code must be generated from such a polyhedron linking the array elements to their corresponding source and target processors. However, in the general case, because of data replication, R is not constrained enough for attributing one source to a target processor for a given needed array element. Indeed, R jpR assigns exactly one source to a target as shown in Proposition 2, but p R variables are still free (Proposition 1). The underconstrained system allows choices to be made in the code generation. On the target side, replication provides an opportunity for broadcasts. On the source side, it allows to balance the load of generating and sending the messages.
Broadcasts
In the target processor grid, di erent processors on the replicated dimensions own the same data set. Thus they must somehow receive the same data. Let us decide that the very same messages will be broadcasted to replicated target processors from the source processors. From the communication point of view, replicated target processors are seen as one abstract processor to be sent a message. On the polyhedron point of view, 0 R dimensions are collapsed for message 5 Some variables, as , are of no interest for the code generation and can be exactly eliminated, reducing the size of the system without loss of generality nor precision. Load balancing Now one sender among the possible ones ( R ) must be chosen, as suggested in Figure 7 . This choice must be independent of the replicated target processors, because of the broadcast decision. Moreover, in order to minimize the number of messages by sending elements in batches, it should not depend on the array element to be communicated. Thus the only possible action is to link the abstract target processors 0 D to R . These processors wait for disjoined data sets (Proposition 2) that can be provided by any source replicated processors (Proposition 1).
To assign 0 D to R in a balanced way, the basic idea is to attribute cyclically distributed target to replicated source processor dimensions. This cyclic distribution must involve processors seen as vectors on both side. In order to obtain this view of 0 D and R , a linearization is required to associate a single identi er to a set of indices.
The rationale for the linearization is to get rid of the dimension structuration in order to balance the cyclic distribution from all available source replicated processors onto all target distributed processors. Source processors that own the same elements are attributed a unique identi er through lin( R ), as well as target processors requiring di erent elements through lin( If all processors must enumerate all the integer solutions to polyhedron E, this is equivalent to the runtime resolution technique and is very ine cient. Moreover, it would be interesting to pack at once the data to be sent between two processors, in order to have only one bu er for message aggregation. Therefore some manipulations are needed to generate e cient code. This projection may not be exact 6 . P represents processors that may have to communicate: empty messages may be generated for processor couples in P. To avoid sending and receiving these empty messages, while preserving the balance of messages, the following runtime technique is used: (1) in the send part, messages empty after packing are not sent; (2) in the receive part, messages are lazily received when some data must be unpacked. The technique is shown in Figure 11 . Thirdly, in a spmd code executed in parallel, each (maybe virtual) processor plays a part (or none) in the processor grids Ps and Pt. Hence not all processors should enumerate the couples of communicating processors: processors in Ps resp. Pt] are just interested in enumerating their matching target resp. source] processors for sending resp. receiving] data. P can be used to generate guards to select relevant processors and then to directly enumerate the sole matching processors in the other grid. At last, processors from di erent processor grids may be allocated to the same physical processor. Thus the code must not send a message from a processor to itself, but rather generate a local copy of the required elements instead. Figure 11 shows the spmd code generated with P and E . The code is composed of a send and a receive part. The send part rst selects the processors in Ps, and among them those which may communicate (P j 0 ; ). Then the corresponding target processors are enumerated ( 0 D loop). If there is a broadcast or if the target and source physical processor di er, the data are packed in a message (e loop). Then the message is sent of not empty. Function local source address() computes the local address for a given array element on the source processors. If the local addressing scheme is integrated in the modelization, the local adress is directly enumerated in e.
The receive part is the dual of the send part. It selects the processors in Pt, and among them those processors which may have to receive some data. Then the corresponding senders are enumerated, and the messages are lazily received and unpacked to the local target array. If the sender was the processor itself, a local copy is performed: the communicating couple led to identical physical processors. The data is just copied from the source to target arrays. The copy is performed on the receive part in order not to delay the messages sending.
The generated code requires the enumeration of some polyhedra. Techniques based on Fourier elimination 21, 2] or a parametric simplex 13] generates code to exactly enumerate the solutions to a polyhedron. The correctness of the communications requires that the messages are packed and unpacked in the same order. This is enforced because the very same loop nest on E is generated for both packing and unpacking.
Optimality and discussion
For a given remapping, a minimal number of messages, containing only the required data, is sent over the network:
Theorem 2 (Only Required Data is Sent) If the source and target processor grids are disjoined on the physical processors, only required data is communicated.
Proof: E exactly describes the array elements and their mapping (derived from Proposition 2). Polyhedron scanning techniques exactly enumerate the elements in E and these elements must be communicated if the processors are disjoined.
Theorem 3 (Minimum Number of Messages is Sent) If the source and target processor grids are disjoined on the real processors, a minimal number of messages is sent over the network.
Proof: Only required data is communicated (Theorem 2), all possible aggregations are performed (Proposition 6) and empty messages are not sent.
Theorem 4 (Memory Requirements) The maximum amount of memory required per hpf processor for a remapping is 2 (memory(As) + memory(At)).
Proof: Local arrays plus send and receive bu ers may be allocated at the same time. The bu er sizes are bounded by the local array sizes because no more than owned is sent (even for broadcasts) and no more than needed is received (Theorem 2).
Special remappings that involve no communications can be automatically detected: if the target mapping is a particularization of the source mapping, i.e. all the needed data is locally available. The usual Fourier elimination technique needs to know the number of processors. However the parametric extension presented in 1] allows to generate code if this number is parametric.
Since processor distributed dimensions are independent in E, the practical complexity of the code generation for multiple dimensions roughly is the number of dimensions times the complexity of the code generation for one dimension. As expected, simple codes are generated for simple remappings, and more complicated ones for general cyclic distributions.
Experiments
The remapping generation technique is implemented in hpfc, our prototype hpf compiler. Pvm is used for handling communications. Hpfc aims primarily at demonstrating feasability and being portable, rather than achieving very high e ciency on a peculiar architecture. These new features were tested on a DEC Alpha farm at LIFL (Universit e de Lilles, France).
The experimental results and derived data are presented in this section. They show some improvement over communications generated by the dec hpf compiler, despite our high level implementation. Quite good performances for complex remappings compared to the simple and straightforward block distribution case were obtained. Experimental conditions and raw measures are presented and analyzed.
Experimental conditions
This section presents the hardware and software environment used for the tests, the measurements and the experiments. Compilers: DEC Fortran OSF/1 f77 version 3.5 with "-fast -O3 -u" options. DEC C OSF/1 cc with "-O4" option (for part of the hpfc runtime library). DEC hpf f90 version FT1.2 with "-fast -wsf n" options for comparison with the remapping codes generated by hpfc, our prototype hpf compiler.
Communications: Pvm version 3.3.9 standard installation, used with direct route option and raw data encoding. PVMBUFSIZE not changed. 1 MB intermediate bu er used to avoid packing each array element through pvm.
Transposition: A square matrix transposition was tested for various matrix sizes 8 , processor arrangements and distributions. The time to complete A=TRANSPOSE(B) with A and B initially aligned was measured. It includes packing, sending, receiving and unpacking the data, plus performing the transposition. The code is shown in Figure 12 . The 2d remapping compilation times ranged in 0:3 2:5s on a sun ss10. Other experiments with 1d remappings ranged in 0:2 1:5s.
Measures: The gures present the best wall-clock execution time of at least 20 instances, after substraction of a measure overhead under-estimation. The starting time was taken 7 Raw measures for transpositions are displayed. Tables 5 and 6 show the transposition times for various matrix sizes and distributions. The column heads describe the distribution of the array dimensions: for instance c5c7 stands for (cyclic(5),cyclic (7)). Table 7 show the (block,block) transposition time for various array arrangements, involving up to 16 processors. These raw measures are analyzed in the next section.
Performance analysis
From the previous raw measures, derived data are presented in Figures 13, 14 and 15 . For comparison purposes, we introduce the transposition speed per processor, expressed in MB/s/pe 9 . This unit is independent of the matrix size n, the type length 10 l and the number of processors p involved. If t is the measured time, then speed s is de ned as: Figure 13 displays the performances on P(2,2). Cases bb and cc are very similar: Indeed, in both cases 2 processors must exchange all their local data and 2 do not have to communicate at all, thus the generated codes are very similar. Also two degradations due to pvm are noticeable for large matrix sizes, when the amount of communication steps over 1 MB and 2 MB. bc and c48c64 show similar performances. c5c7 is a tricky case, and the enumeration costs are higher. 9 MB stands for Mega Bytes, and 1M = 2 20 = 1 048576 10 l = 8 in our experiments based on real*8 arrays. Finally Figure 15 shows performance of (block,block) transpositions on various processor arrangements for hpfc and the dec hpf compiler. The transpose intrinsic is serialized according to the release notes, so it was not used. Since the independent, realign and redistribute directives are not implemented, only the available hpf forall instruction was used to transpose the matrix. Our performance is degraded for large matrix sizes because of the pvm 1 MB bu er size limit. Also the more processor the larger the matrix size is needed to get comparable speeds. Transposition speed based on our code show 20-30% improvements over the dec compiler, up to pvm bu er size problems. However these results are not comparable: our code is a higher level one, based on pvm, and simple standard optimizations would be usefull for the compiling enumeration code e ciently. 11 Complex remappings show quite good results with respect to simple ones. However the performances should be compared somehow to the peak 12:5 MB/s/link available. Some measurements show that the pvm overhead represents up to 80% of the measured time. A more aggressive bu er management for the remappings and the PvmDataInPlace option 4] may reduce this overhead. Generating code closer to the machine would also help, but at the price of portability. The experiments also show that lattice detection is an important issue for generating good code, and that optimizations such as invariant code motion can have a great in uence on the performances of the polyhedron enumeration code.
Conclusion
A general compilation technique was presented to handle hpf remappings e ciently. This technique is implemented in hpfc, our prototype hpf compiler. Portable pvm-based 16] code is generated. The remapping code generation problem is put in a single linear framework that deals with all hpf issues such as alignments or general cyclic distributions. Optimality results
