AbstractÐOwnership sets are fundamental to the partitioning of program computations across processors by the owner-computes rule. These sets arise due to the mapping of arrays onto processors. In this paper, we focus on how ownership sets can be efficiently determined in the context of the HPF language and show how the structure of these sets can be symbolically characterized in the presence of arbitrary array alignment and array distribution directives. Our starting point is a system of equalities and inequalities due to Ancourt et al. [1] that captures the array mapping problem in HPF. We arrive at a refined system that enables us to efficiently solve for the ownership set using the Fourier-Motzkin Elimination technique and that requires the course vector as the only auxiliary vector. The formulation makes it possible to enumerate the elements of the ownership set exactly once, a feature that is very beneficial when such sets are applied to handle DO loops qualified by HPF's INDEPENDENT directive. We develop important and general properties pertaining to HPF alignments and distributions and show how they can be used to eliminate redundant communication due to array replication. Polynomial-time schemes that determine whether the ownership set of a particular processor, with respect to some array, is the empty set or whether the ownership set of every processor, with respect to some array, is the empty set, are presented. We show how distribution directives with unspecified processor meshes can be efficiently handled at compile time. We also show how to avoid the generation of communication code when pairs of array references are ultimately mapped onto the same processors. Experimental data demonstrating the improved code performance that the latter optimization enables is presented and discussed.
ae

INTRODUCTION
I N languages such as High Performance Fortran (HPF) [10] , array mappings guide the partitioning of program computations across processors. They are specified by the programmer in terms of annotations called directives. The actual mapping process typically involves two steps: Arrays are first aligned with a template and templates are then distributed over a virtual mesh of processors. The alignment operation, performed via the ALIGN directive, assigns every array element to at least a single template cell. The distribution operation, done using the DISTRIBUTE directive, associates every template cell with exactly one processor. In this way, array elements are eventually mapped onto processors.
In an automated code generation scenario, the compiler decides the processors on which to execute the various compute operations occurring in a program. In allocating program computations to processors, the compiler uses the mapping information associated with the data. A possible scheme, known as the owner-computes rule [15] , is to allow only the owner of the left-hand side reference in an assignment statement to execute the statement. By the owner-computes rule, expressions that use data located on the same processor can be evaluated locally on that processor, without the need for interprocessor communication. When the need to transfer remotely located data arises, the compiler produces the relevant communication code. Hence, the owner-computes rule leads to the notion of an array's ownership set, which is the set of all its elements mapped onto a processor by virtue of the alignment and distribution directives.
Since the assignment of computations to processors is determined by the allocation of data to processors, one of the aims of the HPF compilation problem is to find a suitable way of representing the mapping information. Given such a representation, the next issue that must be addressed is how can it be used to realize the ownercomputes rule. Does the proposed framework provide insights into the nature of the array mapping problem? Does the representation reveal general properties that can be leveraged to generate efficient code? In this paper, we investigate these questions in the context of a recent representation proposed by Ancourt et al. [1] . Fig. 1 shows a four-point stencil computation that occurs over a two-dimensional grid of IHPP Â IHPP points. The value at every grid point is updated with the average of its four neighbors. Once the entire grid is updated this way, the process is repeated with the new values.
An Example
The example introduces an abstract two-dimensional array T of size IHPR Â IHPR, called the template, against which the arrays A and B are aligned. The alignment directives align the array elements eiY j and fiY j with the template cell iY j. The DISTRIBUTE directive maps blocks of template cells in T onto a two-dimensional virtual processor mesh P. The extent of a block along a dimension is determined by the extents of T and P along that dimension. Since the extents along each dimension in this case are 1024 and 2, respectively, a block size of IHPRaP Â IHPRaP SIP Â SIP is used. Thus, the cells of the template T in the region SIPp I l I , SIPp P l P get mapped onto the processor p I Y p P , where H l I Y l P`S IP and H p I Y p P`P . We could have also chosen a ggvsgf distribution along a particular processor dimension instead of a BLOCK distribution; in this case, the distribution is allowed to ªwrap aroundº the processor dimension. Because eiY j and fiY j are aligned with iY j, the (BLOCK, BLOCK) distribution results in each of the four quadrants in A and B being mapped onto the corresponding ªquadrant processorº in P. Hence, a specific processor would execute only those statement instances in Fig. 1 whose left-hand sides lie in the quadrant of A or B owned by it.
Related Work
The problem of array alignment and array distribution has been extensively studied and numerous structures that describe the mapping of arrays to processors have been suggested and examined [7] , [4] , [2] , [13] , [14] , [3] . Early representations focused on BLOCK distributions alone and were incapable of conveniently describing the general ggvsgf distribution. This deficiency was addressed in subsequent work by using techniques ranging from finite state machines, virtual processor meshes to set-theoretic methods [6] , [8] , [12] . However, these schemes primarily concentrated on enumerating local memory access sequences and handling array expressions. A generalized view of the HPF mapping problem was subsequently presented by Ancourt et al. [1] who showed how a system of equalities and inequalities could be used to mathematically express the regular alignment and distribution of arrays to processors. These systems were then used to formulate ownership sets and compute sets for loops qualified by the INDEPENDENT directive and parametric solutions for the latter were provided based on the Hermite Normal Form [1] .
Contributions
In this paper, we investigate the ownership set formulation in the Ancourt et al. framework and show how it can be refined to a form that requires a course vector as the only auxiliary vector and that also enables the efficient enumeration of its constituent elements. This property is desirable when ownership sets are applied to handle DO loops qualified by HPF's INDEPENDENT directive. Our approach to solving for the ownership set is based on the FourierMotzkin Elimination (FME) technique and, in that respect, we deviate from [1] . We also formulate an efficient polynomial-time test using which redundant communication due to array replication can be avoided. We present a sufficient condition called the mapping test that eliminates the need for generating communication code for certain right-hand side array references in an assignment statement. These two optimizations in turn depend upon whether the ownership set of a particular processor with respect to some array is the empty set or whether the ownership set of every processor with respect to some array is the empty set. We discuss how these decisions can be arrived at in polynomial time, once the symbolic representation of the ownership set is known. The mapping test often results in marked performance improvements and we substantiate this by presenting experimental data. Finally, we discuss how our schemes permit the efficient handling at compile time of distributions in which the processor meshes are unspecified. The techniques mentioned in this paper have been incorporated into a new version of the PARADIGM compiler [9] using Mathematica 1 as the symbolic manipulation engine.
Outline
The rest of this paper is organized as follows: In Section 2, we describe previous research that forms the foundations of our work. We discuss in this section how information pertaining to alignments and distributions can be compactly expressed as systems of equalities and inequalities. The ownership set is formally defined in this section. We refine the ownership set formulation in Section 3 and prove an important equivalence relation. In Section 4, we present the replication test and explain how it can be used to avoid redundant communication due to array replication. We also show in Section 4 how to ascertain in polynomial time whether the ownership set is empty or nonempty, given its symbolic FME solution. In Section 5, we describe the handling of distributions that lack an explicitly specified processor mesh. The mapping test is derived in Section 6 and we illustrate its workings using three examples. Finally, in Section 7, we report and analyze experimental data that demonstrates the performance benefits of the mapping test optimization. Fig. 2 shows an artificial code fragment comprising a declaration and a set of HPF directives. Since the first dimension of A and the single subscript-triplet expression conform, this fragment is equivalent to that shown in Fig. 3 . The dummy variables i, j, k, and l in Fig. 3 satisfy the constraints ÀI i PH, Q j RH, H k PH, and H l WW, respectively.
PRELIMINARIES
The alignment directives in Fig. 3 can be compactly expressed through the following collection of equalities and inequalities [1] :
Similarly, the distribution directives in Fig. 3 can be represented by the following system [1] : y, and z to represent the number of dimensions of the alignee [10] , template, and processor mesh, respectively. Using this notation, the column vectors and t t consist of x and y elements respectively, while the column vectors p p, l l, and Ðcalled the course vectorÐhave z elements each. The ownership set of a processor p p, which is defined with respect to an array X, denotes those elements of X that are finally mapped onto p p by virtue of the alignment and distribution directives. In set-theoretic notation, this set is:
where H H p p` I I. For instance, we could evaluate (8) to obtain Á p f for the example in Fig. 1 . If we assume a ggvsgPSTY ggvsgPST distribution (instead of the fvyguY fvygu distribution shown in Fig. 1 ), this turns out to be:
OWNERSHIP SETS REVISITED
To solve (8) symbolically, we apply the Fourier-Motzkin Elimination (FME) technique [5] , eliminating the variables corresponding to the unknown vectors , l l, t t, and . During the elimination process, the vector p p, which denotes the processor's identity in a Cartesian processor mesh, is treated like any other constant; it therefore manifests in the bound expressions of the solution system. When resolved at runtime to a particular processor's identity, the solution system will represent that processor's ownership set. Fig. 4 shows the outcome when the FME technique is applied on the system corresponding to Á p e for the example in Fig. 3 . We represent the chosen elimination order by the elimination vector . The elimination vector is a column vector whose last element denotes the first unknown to be eliminated and whose first element denotes the last unknown to be eliminated. In Fig. 4 , Y l l Y Y t t . Note that the actual region of interest is the set of points corresponding to the solution vector . We can scan the elements of Á p e by constructing a loop nest in which the outermost loop corresponds to 6I (i.e., I ) having mx À ÀR6pI À QaQTY ÀR6pI À IaQT Á AE Ç as its lower bound and min À II À 6pIaWY IIaR À 6pIaW Á Ä Å as its upper bound. Nested within 6I are the loops corresponding to 6P, 6lI, 6lP, 6I, 6P, 6Q, 6tI, 6tP, 6tQ and in that order. Thus, the innermost loop that corresponds to 6tQ (i.e., t Q ) would have mxHY IIU6P 6lP IQ6pP as its lower bound and minWWY IIU6P 6lP IQ6pP as its upper bound. Within the body of such a loop nest, 6I and 6P (i.e., I and P ) will be the subscripts of the elements of the array A owned by processor p p.
Refinement
Since the elements of , l l, and t t serve only as auxiliary variables in (8) , an important question that crops up at this juncture is whether formulations for Á p exist that require fewer auxiliary variables. A lesser number of unknowns translates to a smaller loop nest depth and, hence, a more efficient scan of Á p . This is especially important when compute sets derived from ownership sets are used to partition loops [9] . Reducing the number of unknowns also improves the overall timing of the FME solver.
The first improvement that can be done is removing the offsets vector l l. This can be accomplished in a straightforward manner and Lemma 3.1 shows how. The system in Lemma 3.1, shown in Fig. 5 , is an improvement over that in (8) because there is a reduction in the number of unknowns (by kl lk) and in the number of inequalities (by Pkl lk). 
Removing t t
A key observation to make in (9) is that the replicated dimensions of a template do not affect the elements of . This is because, by premultiplying t t with , rows that correspond to replicated dimensions in t t get elided. What is a ªreplicated dimension?º We refer to those dimensions of a template that contain a Ã or an unmatched dummy variable in the alignment specification as the template's replicated dimensions (see [10] ). The remaining dimensions are called its aligned dimensions.
The dimensions of a template that are not mapped onto any processor dimension are said to be collapsed; the remaining dimensions of the template are referred to as its distributed dimensions. If a template dimension is collapsed and not replicated, that template dimension can only affect the values of through the equality t t e s s H À l l and the constraint H t t u u À l l . To then find the corresponding elements of , we need only consider the pair:
template dimensions that are both aligned and distributed can be eliminated. Therefore, it is worth investigating whether a new system can be constructed for the ownership set that has a lesser number of unknowns than that required by (9) . As Lemma 3.2 in Fig. 6 will show, such a formulation indeed does exist.
The new system defined in (10) has exactly the same number of inequalities as the system in (9); however, the gain is a reduction of kt tk in the number of unknowns.
Further Refinements
From the standpoint of efficiently scanning the ownership set, the system in (10) has further scope for improvement. If we were to consider those processor dimensions j along which replicated template dimensions are distributed, the corresponding j s are only constrained by
This means that for a given , there could exist more than one for which the system in (10) holds. Hence, if the solution system for (10) were used to scan the ownership set, members could get enumerated more than once. To remedy this problem, we extend the system in (10) by the equation Fig. 7 ensures the validity of such a transformation.
An important implication of Lemma 3.3 is that if the FME technique is now applied to the system in (11), then, whatever be the order of elimination of the unknown variables (corresponding to the elements of and ), the associated loop nest will scan every member of the ownership set (i.e., ) exactly once. To see why this is so, let represent one such elimination order. Suppose k k x zY where x and z denote the number of dimensions of the alignee and processor mesh, respectively. The integer bound expressions returned by the FME solver can be used to construct a loop nest that scans Á But from Lemma 3.3, only one can satisfy the system for a given . Thus, the corresponding values for in & & and 7 7 must also be different. That is, the associated with the iteration point & & must be different from the associated with any other iteration point 7 7 of the same loop nest. In other words, every member of the ownership set gets enumerated exactly once.
Comparisons
From a mathematical perspective, the systems in (8), (9), (10) , and (11) are all identicalÐthey refer to the same ownership set in all cases. However, they differ in such metrics as the number of inequalities to be solved, the number of variables to be eliminated (consequently, the number of auxiliary parameters required to scan the set), and, finally, in the way the members of the set get enumerated. Table 1 summarizes these metrics. As can be seen from the table, the original system in (8) requires the elimination of the largest number of variables and handles the largest number of inequalities. If a loop nest were generated from its FME solution system to scan Á p , it is not guaranteed that every member will get enumerated exactly once. This inability to ensure the ªuniquenessº of each enumerated member has serious repercussions if the ownership set is used to generate other sets. For instance, we could use the compute sets defined in [1] to handle loops qualified by the INDEPENDENT directive. In [1] , these sets were defined using a formulation of the ownership set similar to that in (8) . If the FME solution system to such a compute set formulation were scanned, certain iterations in the set could get enumerated more than once for certain alignment/distribution combinations. However, if the formulation in (11) is used, this problem can be avoided. The FME approach that we adopt to solve for these sets is quite different from the approach in [1] where a parametric solution based on the Hermite Normal Form was exploited for the same purpose.
In Fig. 8 , we show the result of applying the FME technique to the formulation in (11) . Though the system has the same number of inequalities as the 
TABLE 1 Comparisons of the Ownership Set Formulations
d. When the FME technique is applied, equalities of the form f Y l lY t tY g Y l lY t tY get replaced by a pair of inequalities of the form f Y l lY t tY g Y l lY t tY and f Y l lY t tY ! g Y l lY t tY . e. To recapitulate, x is the number of dimensions of the alignee, y is the number of dimensions of the template, z is the number of dimensions of the processor mesh, and m is the number of aligned dimensions of the template.
formulation in (8) , the number of unknowns to be eliminated is lesser by 5 variables, and this results in a significant reduction in the time required to solve the system from 0.58 seconds to 0.25 seconds.
An Equivalence Relation
It is interesting to enquire into the nature of ownership sets across processors. That is, for an arbitrary alignment/ distribution combination, can these sets partially overlap? Or, are they equal or disjoint? The answers to these questions can be used to devise an efficient runtime test that avoids redundant communication due to array replication (see Section 4).
The expression % % is a square diagonal matrix of size z Â z. It is easy to see that the principal diagonal elements of this matrix are either 0 or 1. It is also easy to see that the jth principal diagonal element is 0 if and only if the template dimension distributed on the jth processor dimension is a replicated one. To review, replicated dimensions are those dimensions of a template that contain either a Ã or an unmatched dummy variable in the alignment specification. Thus, if an array X is aligned to a template T that is then distributed onto a processor mesh P, Lemma 3.4 states, in Fig. 9 , that the ownership sets with respect to X of two processors p p andin P will be the same if (1) p p andown at least one element of the alignee array X, and (2) p p andmatch in at least those dimensions on which the aligned dimensions of the template T are distributed.
The reverse is also true; if the ownership sets of two processors with respect to an array overlap, then their coordinates along those dimensions on which the aligned dimensions of the template are distributed must match. This is what Lemma 3.5 states in Fig. 10 . The above two lemmas can be used to prove Theorem 1 in Fig. 11 Let us define a binary relation $ on a mapped array such that given two array elements and , $ if and only if and are mapped onto the same processors. The rules of HPF ensure that for any legal ALIGN/DISTRIBUTE combination, every element of the mapped array will reside on at least one processor [10] . Hence, $ must be reflexive. Also, if $ , $ is obviously true. Therefore, $ is symmetric. Finally, if $ and $ are true, then from Theorem 1, $ . That is, $ is transitive. Hence, the ALIGN Fig. 8 . The FME solver applied on the modified ownership set formulation. and DISTRIBUTE directives for a mapped array define an equivalence relation on that array. Fig. 13 show this optimization.
THE REPLICATION TEST
Once the integer FME solution for the system of equalities and inequalities that describe the ownership set is obtained, computing whether Á p is the empty set for a given p p incurs only an additional polynomial-time overhead. The key idea that enables this decision is that in the FME solution system for (11), a particular p j will occur in the bound expressions for j . 3 That is, there will be an inequality pair of the form f j p j j g j p j in the solution system. In addition, there can at most be one more inequality pair in the solution system that also contains p j in its bound expressions. This inequality pair will have the form p j p j Y j nj q j p j Y j . Hence, if (11) has a solution for a given p p, each of the z disjoint inequality groups
in the solution system must independently admit a solution. The task of checking whether each of these groups has a solution for a particular p j is clearly of quadratic complexity. Hence, the complexity of ascertaining whether Á p is nonempty for a given p p is polynomial. Since the complexity of evaluating the condition % % p p ÀT H H is yz, the overall runtime complexity of evaluating the Boolean predicates in Fig. 13 , given the integer FME solution system for the ownership set (known at compile time), becomes polynomial.
Observe that in the absence of replication, is the identity matrix; in this situation, % % p p ÀT H H if and only if p p T. Hence, in the absence of replication, the test degenerates to the usual p p Tcondition.
UNSPECIFIED PROCESSOR MESHES
When a variable is eliminated from a system of inequalities by the FME technique, the method partitions the system into three sets: a set À in which the coefficients of the variable are negative, a set H in which the coefficients of the variable are zero and a set in which the coefficients of the variable are positive [5] . Therefore, for those variables that are to be eliminated by the technique, knowledge regarding the signs of their coefficients must be available. In the context of the system in (11), this implies that the principal diagonal entries of the square diagonal matrices g and can be symbolic. This is because these elements are known to be positive a priori and their actual values are of no concern until runtime. A useful consequence of this fact is that a processor mesh need not be provided in a DISTRIBUTE directive. Hence, the determination of the actual processor mesh can be postponed until runtime.
For instance, suppose that the distribution directive in Fig. 3 where II and PP are positive. If we apply the FME method on the system in (11) with this knowledge of , we obtain the solution shown in Fig. 14 . Notice that on replacing II and PP in this solution system by 9, we obtain the solution system shown in Fig. 8 . 3. The symbols p j and j indicate elements in p p and , respectively.
Apart from being positive, since the third dimension of T is BLOCK distributed on the second dimension of P, the choice for PP would have to be additionally constrained by the inequality IQ PP ! IHH (i.e., g jj jj ! u k j À l k j I where k j is the template dimension distributed in a BLOCK fashion on the jth processor dimension). More generally, the principal diagonal elements of would have to be chosen so as to satisfy the following inequalities:
A possible strategy for determining the principal diagonal elements of at runtime is presented in Fig. 15 . In general, more than one distributee may be associated with a processor mesh and the handling of such a case is highlighted in the figure. The displayed pseudocode fragment assumes that two templates T and S are distributed on the processor mesh P. The procedure number_of_processors() is a generic system inquiry function that returns the number of physical processors available in the lower-level processor arrangement. The quantity 3Ðdetermined at compile timeÐequals the number of processor dimensions on which template dimensions are distributed in a ggvsgf fashion considering both T and S. That is, 3 equals the rank of s À ! s À ! . Note that, by this strategy, some of the physical processors may remain unused. More sophisticated schemes that utilize all of the physical processors and that also optimize with respect to some other criteria could be devised.
THE MAPPING TEST
Consider an assignment statement
H bearing affine subscript expressions and contained in a loop nest characterized by the loop iteration vector :
x H x x Á Á Á y H y y Á Á Á X Fig. 12 and Fig. 13 ) that would be generated for the above assignment statement take into account the relative alignments and distributions of the left-hand side and right-hand side array references. If these communication sets are empty, no communication will occur at runtime. However, the overhead of checking at runtime whether a particular processor should dispatch a section of its array to some other processor that views it exists, irrespective of whether data is actually communicated or not. This could result in the expensive runtime cost of communication checks, even in cases where it could be avoided such as when the elements of x H x x and y H y y are ultimately mapped onto the same processor. If the compiler could detect such situations, it could refrain from generating communication code for such array reference pairs. This is what the mapping test attempts to do.
A Sufficient Condition
Thereom 2, shown in Fig. 16 , states a sufficient condition for the mapping test that is proven below.
Proof. Suppose
x H x x P Á HH p ; as discussed earlier, a legal ALIGN/DISTRIBUTE combination will admit at least one such p p. Thus, there exists a such that from the system in (11), we have
By virtue of (11), " . Therefore, after some rearranging, (II.1) becomes is an yz operation. Establishing whether Á r T Y for all H H r r` I I is a polynomial-time operation, given the symbolic representation of the ownership set (see Section 4). Thus, the overall time complexity for verifying the requirements of Theorem 2 is polynomial once the FME solution system for the ownership set is known.
The impact of the mapping test on runtimes can often be dramatic. To illustrate the savings, the runtimes for the ADI benchmark, for arrays of sizes R Â IHPR Â P on a I Â R mesh of processors with and without this optimization were HXSI and TRXUW seconds, respectively! The large value of TRXUW seconds arose due to three assignment statements that were the sinks of loopindependent flow dependencies that were enclosed within a triply nested loop spanning an iteration space of PHRV Â P Â IHPP points. Each of these three assignment statements included right-hand side array references that were finally distributed onto the same processor as the corresponding left-hand side array reference. Hence, in all, 18 communication checks (nine for MPI_SEND and another nine for MPI_RECV) per iteration were eliminated.
The Matrix Multiplication Benchmark
To demonstrate how the mapping test works, we begin with the Matrix Multiplication benchmark (from the Livermore Kernel 21 [11] ), in which each of the templates T and S are simultaneously collapsed and replicated. The mapping test correctly determines that none of the right-hand side array references in the benchmark's single assignment statement require any communication code to be generated.
Consider the pair of references C(i, j) and A(i, k) in Fig. 17 . By inspection, we have
To verify the first requirement, we need to determine the integer FME solution system for Á p e. By applying the FME solver with I Y P Y I Y P as the elimination order, we find this to be
IS
There are two constraints in (15) that contain p I in their bound expressions and another constraint that contains p P in its bound expressions. Hence, by separately considering two disjoint inequality groups, we can determine in polynomial time that Á r e T Y for all H H r r` I I. Further, since # ", the second requirement of the mapping test is also satisfied. In addition, g
Therefore, the third requirement is also satisfied since, # g
We can thus conclude that A(i, k) is identically mapped onto the same processor that owns C(i, j), for all values of the loop iteration vector . We can similarly show that B(k, j) is also identically mapped onto the same processor that owns C(i, j), for all iterations of the loop nest. Since the third right-hand side array reference C(i, j) is trivially mapped onto the same processor as the left-hand side array reference, no communication code needs to be generated for this benchmark.
How does the replication test fare on this benchmark? Note that the Boolean predicate % % p p ÀT H H simplifies to p I T q I . Thus, though the replication test avoids some of the communication check overhead in the send and receive actions shown in Fig. 13 , the send and receive sets would still be computed and checked whenever p I and q I are not equal. However, because of the nature of the affine subscript expressions of the array references involved (i.e., C(i, j) and A(i, k)), the send and receive sets evaluate to empty sets even when p I and q I are not equal. Thus, in the case of this benchmark, the entire overhead of performing communication checks at runtime can be avoided which is exactly what the mapping test detects.
A Synthetic Example
Consider the synthetic code shown in 
as the elimination order, we find the integer FME solution system for Á p to be
IT Equation (16) consists of two disjoint inequality groups; we can therefore determine in polynomial time that Á r T Y for all H r r` I I. Therefore, from Theorem 2, the righthand side array reference is identically mapped onto the same processor that owns the left-hand side array reference. This is indeed the case as can be verified by visualizing the alignments and distributions. Notice that terms unknown at compile time are only manipulated symbolically by the mapping test. and the test would have failed since nothing can be said at compile time about the equality of Pi Pj TaIT mod W and Qi Pj TaIT mod W. For some values of i and j this may be true (say, i QY j P) while not for others (say, i PY j P). Thus, in such a situation, the test fails and, conservatively, communication code is generated.
The ADI Benchmark
The last example shows how the mapping test makes it possible for the compiler to avoid generating any communication code for the Automatic Differentiation and Integration (ADI) benchmark (from the Livermore Kernel 8 [11] ), when the alignments and distributions are suitably chosen. In essence, the ADI benchmark comprises six arrays of type REAL, of which three are one-dimensional and consist of 1,024 elements each, while the remaining three are threedimensional arrays consisting of R Â IHPR Â P elements each. The only flow dependencies that this benchmark exhibits are three loop-independent ones and all of the source-sink statement pairs are contained in the innermost loop. The right-hand side array references on which these dependencies terminate can be aligned and distributed onto the same processors that own the corresponding left-hand side array references and it is this particular case that the mapping test optimizes.
In the code excerpt shown in Fig. 19 , only one sourcesink statement pair among the three is indicated. The other two are similar with the array DU1 replaced, respectively, by DU2 and DU3 in the source statements and with the array AU1 replaced, respectively, by AU2 and AU3 in the source and sink statements. We now consider the pair of array references AU1(kx, ky, 2) and DU1(ky) occurring in the second assignment statement of Fig. 19 . Thus,
we see that the first requirement is fulfilled. In addition, since # ", the second requirement is also satisfied. Finally, g X Thus, the third requirement is also satisfied. We can therefore conclude that DU1(ky) gets identically mapped onto the same processor that owns AU1(kx, ky, 2) for all values of the loop iteration vector . It can be similarly shown that for the same alignment and distribution directives, the right-hand side array references in the other assignment statements on which flow dependencies terminate in this benchmark are also identically mapped onto the same processor as the respective left-hand side array references. Thus, no communication code needs to be generated for the benchmark.
MAPPING TEST MEASUREMENTS
Execution times and compilation times were measured for the PARADIGM (version 2.0) system with and without the mapping test optimization. For the sake of comparison, execution times and compilation times for the original sequential sources and the parallelized codes generated by pghpf (version 2.4) and xlhpf (version 1.03) were also recorded. pghpf and xlhpf are commercial HPF compilers from the Portland Group Inc., (PGI) and the International Business Machines (IBM), respectively. In the input codes to the pghpf compiler, DO loops were recast into FORALL equivalents where possible and were qualified with the INDEPENDENT directive when appropriate. The FORALL construct and the INDEPENDENT directive were not mixed in the inputs to pghpf and the tabulated execution times correspond to the best of the two cases. All of the PARADIGM measurements were done in the presence of the replication test.
System Specifications
The IBM compilers xlf and mpxlf were used to handle Fortran 77 and Fortran 77+MPI sources, respectively. The HPF sources were compiled using xlhpf and pghpf.
The -O option, which results in the generation of optimized code, was always used during compilations done with xlf, xlhpf, mpxlf, and pghpf. Compilation times were obtained by considering the source-to-source transformation effected by PARADIGM, as well as the source-to-executable compilation done using mpxlf (version 5.01). The source-to-source compilation times for PARADIGM were measured on an HP Visualize C180 with a 180MHz HP PA-8000 CPU running HP-UX 10.20 and having 128MB of RAM. Compilation times for pghpf, xlhpf as well as mpxlf were measured on an IBM E30 running AIX 4.3 and having a 133MHz PowerPC 604 processor and 96MB of main memory. In those tables that tabulate the execution times, the RS6000 column refers to the sequential execution times obtained on the IBM E30. The parallel codes were executed on a 16-node IBM SP2 multicomputer running AIX 4.3 and in which each processor was a 62.5MHz POWER node having 128MB of RAM. Interprocessor communication on the IBM SP2 was across a high performance adapter switch.
Alignments and Distributions
Measurements for the mapping test were taken across three benchmarks: ADI, Euler Fluxes (from FLO52 in the Perfect Club Suite) and Matrix Multiplication. For all the input samples, fixed templates and alignments were chosen; these are shown in Fig. 20 . Note that the most suitable alignments were chosen for the benchmark input samples. For the Matrix Multiplication and ADI benchmarks, these alignments resulted in communication-free programs independent of the distributions. For the Euler Fluxes benchmark, the distributions resulted in varying amounts of communication.
The PROCESSORS and the DISTRIBUTE directives were changed in every benchmark's input sample. The various distributions were chosen arbitrarily, the idea being to demonstrate the ability of the mapping test to handle any given alignment/distribution combination. 
Analysis
As Table 2 reveals, the benefits of the mapping test were most pronounced for the ADI benchmark, followed by the Matrix Multiplication benchmark. In the case of the Euler Fluxes benchmark, the mapping test eliminated six communication checks per iteration for the first input sample and eight communication checks per iteration for the second and third input samples. In the absence of the mapping test, 10 communication checks per iteration were generated. On account of loop-carried flow dependencies, the associated communication codes were hoisted immediately within the outermost loop. However, since the number of iterations of the outermost loop was a mere 100, the optimized compiled codes did not exhibit any significant improvement in runtimes. For all of the ADI benchmark input samples, the iteration space comprised of PHRV Â P Â IHPP points, and the communication codes generated in the absence of the mapping test were hoisted immediately within the innermost loop. For the three Matrix Multiplication benchmark samples, the number of iteration points were SIP Â SIP Â SIP, IY HPR Â IY HPR Â IY HPR, and IY HPR Â IY HPR Â IY HPR, respectively, and the single communication check that was generated in the absence of the mapping test was hoisted within the second innermost loop.
Given a sequential input source written using Fortran 77 and having HPF directives, PARADIGM produces an SPMD output consisting of Fortran 77 statements and procedure calls to the MPI library. The compilation of this SPMD code into the final executable is then performed using mpxlf. Since the mapping test eliminates the generation of communication code, where possible, it also exerts an influence on the overall compilation times. That is, the application of the mapping test often results in the generation of a smaller intermediate SPMD code, and this improves on the back-end source-to-executable compilation time. In our setup, this was done using mpxlf. Note that applying the mapping test does not necessarily mean an increased time for the source-to-source compilation phase performed by PARADIGM. This is because, though compilation in the presence of the mapping test involves the additional effort of identifying the candidate array reference pairs that are identically mapped, it, however, saves on the communication code generation part that would otherwise have to be done for the same array reference pairs. Hence, compilation times for the source-tosource compilation phase may in fact be more in the absence of the mapping test and this was found to be true for nearly all of the benchmark samples tested. However, as Table 3 also reveals, the back-end compilation times were nearly always more in the absence of the mapping test and this was because of the larger intermediate SPMD code sizes handled.
SUMMARY
The preceding sections have shown certain basic and interesting properties that ownership sets exhibit, even in the presence of arbitrary alignments and distributions. Our approach to solving for the ownership set (and other sets derived from it) is based on integer FME solutions to the systems characterizing these sets. We also showed how the system of equalities and inequalities originally proposed in [1] can be refined to a form requiring the course vector as the only auxiliary vector. This refinement is beneficial to the FME approach. The fundamental property of ownership set equivalence is derived and we demonstrated how it can be used to eliminate redundant communication due to array replication. We also briefly described how to efficiently make decisions regarding the ªemptinessº of an ownership set. Finally, we derived a sufficient condition that, when true, ensures that a right-hand side array reference of an assignment statement is available on the same processor that owns the left-hand side array reference, thus, making it possible to avoid generating communication code for the pair.
The mapping test is a very useful optimization. Its positive effect was observable in the case of other benchmarks such as Jacobi, TOMCATV, and 2D Explicit Hydrodynamics (from the Livermore Kernel 18), and was significant in most situations. This was on account of the fact that, typically, suitably chosen ALIGN and DISTRIBUTE directives perfectly align and distribute at least one pair of left-hand side and right-hand side array references in at least one assignment statement in the Lprogram and such alignments and distributions are often valid independent of the values through which the loop iteration vector ranges. Thus, by efficiently exploiting the ownership set, efficient SPMD code can be generated efficiently at compile time.
TABLE 3 Compilation Times in Seconds
k. xlhpf does not permit a CYCLIC blocking factor greater than 1. l. Arrays were of type REAL; array sizes were SIP Â SIP. m. Arrays were of type REAL; array sizes were IHPR Â IHPR.
Note that
s, where s is the m Â m identity matrix. However, is a y Â y matrix in which the only nonzero elements are the principal diagonal elements; the ith principal diagonal element is 1 if and only if the ith template dimension is an aligned dimension. Similarly, % % is the z Â z identity matrix whereas % % is a y Â y matrix in which the only nonzero elements are the principal diagonal elements; the ith princial diagonal element is 1 if and only if the ith template dimension is a distributed dimension. Square matrices in which the only nonzero elements are those that reside on the principal diagonal are usually termed square diagonal.
If
e and f are two conforming square diagonal matrices, then the commutative law for matrix multiplication holds (i.e., e f f e).
