This paper presents a framework, based on global array data-ow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. We introduce available section descriptor, a novel representation of communication involving array sections. This representation allows us to apply techniques for partial redundancy elimination to obtain powerful communication optimizations. With a single framework, we are able to capture optimizations like (i) vectorizing communication, (ii) eliminating communication that is redundant on any control ow path, (iii) reducing the amount of data being communicated, (iv) reducing the number of processors to which data must be communicated, and (v) moving communication earlier to hide latency, and to subsume previous communication. We show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, which makes the analysis procedure more e cient. We present results from a preliminary implementation of this framework, which are extremely encouraging, and demonstrate the e ectiveness of this analysis in improving the performance of programs.
1 Introduction information used in optimizing communications. In Section 5, we describe how the di erent communication optimizations are captured by the data ow analysis. Section 6 presents the algorithms for the various operations on ASDs. Section 7 presents some preliminary results and in Section 8, we discuss related work. Finally, Section 9 presents conclusions. end do end do e(i) ! V P ROC S(i; j ) e(i); d(i) ! V P ROC S(i; j ) 8 : do i = 1, 100 do i = 1, 100 9: do j = 1, 100 do j = 1, 100 10:
z(i,j) = e(i) z(i,j) = e(i) 11: end do end do 12: end do end do 13: e(1) = 2 * d(1) e(1) = 2 * d (1) e(1) ! V P ROC S(1; j ) 14: if (s 6 end do end do 28: z(j,100) = w(j) z(j,100) = w(j) 29: end do end do 
Representation of ASD
The ASD is de ned as a pair hD; Mi, where D is a data descriptor, and M is a descriptor of the function mapping elements in D to virtual processors. Thus, M(D) refers collectively to processors where data in D is available. For an array variable, the data descriptor represents an array section. For a scalar variable, it consists of just the name of the variable. In this paper, we shall use the bounded regular section descriptor (BRSD) 17] to represent array sections and treat scalar variables as degenerate cases of arrays with no dimensions. Bounded regular sections allow representation of subarrays that can be speci ed using the Fortran 90 triplet notation. We represent a bounded regular section as an expression A(S), where A is the name of an array variable, S is a vector of subscript values such that each of its elements is either (i) an expression of the form k + , where k is a loop index variable and and are invariants, (ii) a triple l : u : s, where l; u; and s are invariants (the triple represents the expression discussed above expanded over a range) , or (iii) ?, indicating no knowledge of the subscript value.
The processor space is regarded as an unbounded grid of virtual processors. The abstract processor space is similar to a template in High Performance Fortran (HPF) 10], which is a grid over which di erent arrays are aligned. The mapping function descriptor M is a pair hP; Fi, both P and F being vectors of length equal to the dimensionality of the processor grid. The ith element of P (denoted as P i ) indicates the dimension of the array A that is mapped to the ith grid dimension, and F i is the mapping function for that array dimension, i.e., F i (j) returns the position(s) along the ith grid dimension to which the jth element of the array dimension is mapped. We represent a mapping function, when known statically, as F i (j) = (c j + l : c j + u : s)
In the above expression, c; l; u and s are invariants. The parameters c; l and u may take rational values, as long as F i (j) evaluates to a range over integers, over the data domain. The above formulation allows representation of one-to-one mappings (when l = u), one-to-many mappings (when u l + s), and also constant mappings (when c = 0). The one-to-many mappings expressible with this formulation are more general than the replicated mappings for ownership that may be speci ed using HPF 10] . Under an HPF alignment directive, the jth element of array dimension P i may be mapped along the ith grid dimension to position c j + o or \ ", which represents all positions in that grid dimension.
If an array has fewer dimensions than the processor grid (this also holds for scalars, which are viewed as arrays with no dimensions), there is no array dimension mapped to some of the grid dimensions. For each such grid dimension m, P m takes the value , which represents a \missing" array dimension. In that case, F m is no longer a function of a subscript position. It is simply an expression of the form l : u : s, and indicates the position(s) in the mth grid dimension at which the array is available. As with the usual triplet notation, we shall omit the stride, s, from an expression when it is equal to one. When the compiler is unable to infer knowledge about the availability of data, the corresponding mapping function is set to ?. We also de ne a special, universal mapping function descriptor U, which represents the mapping of each data element on all of the processors.
Example. Consider a 2-D virtual processor grid VPROCS, and an ASD hA(2 : 100 : 2; 1 : 100); h 1; 2]; F This ASD is illustrated in Figure 2 . Figure 2(a) shows the array A, where each horizontal stripe A i represents A(2 i; 1 : 100). Figure 2 (b) represents the mapping of the array section onto the virtual processor template VPROCS, where each subsection A i is replicated along its corresponding row.
Computing Generated Communication
Given an assignment statement of the form lhs = rhs, we describe how communication needed for each reference in the rhs expression is represented as an ASD. This section deals only with communication needed for a single instance of the assignment statement, which may appear nested inside loops. The procedure for summarizing communication requirements of multiple instances of a statement with respect to the surrounding loops is discussed in the next section. We shall describe our procedure for array references with arbitrary number of dimensions; references to scalar variables can be viewed as special cases with zero As remarked earlier, the identity of senders is ignored in our representation of communication. The ASD simply represents the intended availability of data to be realized via the given communication, or equivalently, the availability of data following that communication. Clearly, that depends on the mapping of computation to processors. In this work, we determine the generation of communication based on the owner computes rule, which assigns the computation to the owner of the lhs. The algorithm can be modi ed to incorporate other methods of assigning computation to processors 7], as long as that decision is made statically.
Let D L be the data descriptor for the lhs reference and M L = hP L ; F L i be the mapping function descriptor representing the ownership of the lhs variable (this case represents an exception where the mapping function of ASD corresponds to ownership of data rather than its availability at other processors). M L is directly obtained from the HPF alignment information which speci es both the mapping relationship between array dimensions and grid dimensions (giving P L ) and the mapping of array elements to grid positions (giving F L ), as described earlier. We calculate the mapping of the rhs variable hD R ; M R i that results from enforcing the owner computes rule. The new ASD, denoted CGEN, represents the rhs data aligned with lhs. The regular section descriptor D R represents the element referenced by rhs. The mapping descriptor M R = hP R ; F R i is obtained by the following procedure:
Step 1. Align array dimensions with processor grid dimensions:
1. For each processor grid dimension i, if the lhs subscript expression, S P i L L , in dimension P i L has the form 1 k + 1 and there is a rhs subscript expression S n R = 2 k + 2 , for the same loop index variable k, set P i R to the rhs subscript position n. 2. For each remaining processor grid dimension i, set P i R to j, where j is an unassigned rhs subscript position. If there is no unassigned rhs subscript position left, set P i R to .
Step 2. Calculate the mapping function for each grid dimension:
For each processor grid dimension i, let represents the range of all positions in that grid dimension). We determine the rhs mapping function F i R (j) from the lhs and the rhs subscript expressions corresponding respectively to dimensions P i R and P i L . The details are speci ed in Table 1 .
The rst entry in Table 1 follows from the fact that element j = 2 k + 2 of S P i R R is made available at grid position c ( 1 k + 1 ) + o along the ith dimension; substituting k by (j ? 2 )= 2 directly leads to the given result. The second and the third entries correspond to the special cases when the rhs dimension has a constant subscript or there is no rhs dimension mapped to grid dimension i. The last entry represents the case when there is no lhs array dimension mapped to grid dimension i. In that case, the mapping function of the lhs variable must have c = 0.
Example. Consider the assignment statement in the code fragment: This mapping descriptor is derived from the HPF alignment speci cation. Applying Step 1 of the compute rule algorithm, P R is set to 1; 2] , that is, the rst dimension of VPROCS is aligned with the rst dimension of B, and the second dimension of VPROCS is aligned with the second dimension of B. The second step is to determine the mapping function F R . For the rst grid dimension, P 1 L corresponds to the subscript expression i and P 1 R corresponds to the subscript expression 2 i. Therefore, using F 1 L and the rst rule in Table 1 Table 1, F   2 R (j) is set to j + 1. The mapping descriptor thus obtained maps B(2*i, j-1) onto VPROCS(i, j).
Data Flow Analysis
In this section, we present a procedure for obtaining data ow information regarding communication for a structured program. The analysis is performed on the control ow graph representation 1] of the program, in which nodes represent computation, and edges represent the ow of control. We are able to perform a collection of communication optimizations within a single framework, based on the following observations. Determining the data availability resulting from communication is a problem similar to determining available expressions in classical data ow analysis. Thus, optimizations like eliminating and hoisting communications are similar to eliminating redundant expressions and code motion. Furthermore, applying partial redundancy elimination techniques at the granularity of sections of arrays and processors enables not merely elimination, but also reduction in the volume of communication along di erent control ow paths.
The bidirectional data-ow analysis for suppression of partial redundancies, introduced by Morel and Renvoise 24], and re ned subsequently 8, 20, 9], de nes a framework for unifying common optimizations on available expressions. We adapt this framework to solve the set of communication optimizations described in Section 2. This section presents the following results. In contrast to previous work, solving these equations for ASDs requires array data-ow analysis. In Section 4.3, we present the overall data-ow procedure that uses interval analysis. As with other similar frameworks, we require the following edge-splitting transformation to be performed on the control ow graph before the analysis begins: any edge that runs directly from a node with more than one successor, to a node with more than one predecessor, is split 9]. This transformation is illustrated in Figure 3 . Thus, in the transformed graph, there is no direct edge from a branch node to a join node.
Data Flow Variables and Equations
We use the following de nitions for data-ow variables representing information about communication at di erent nodes in the control ow graph. Each of these variables is represented as an ASD.
ANTLOC i : communication in node i, that is not preceded by a de nition in node i of any data being communicated (i.e., local communication that may be anticipated at entry to node i). CGEN i : communication in node i, that is not followed by a de nition in node i of any data being communicated. KILL i : data being killed (on all processors) due to a de nition in node i. 
The problem of determining the availability of data (AV IN i =AV OUT i ) is similar to the classical dataow problem of determining available expressions 1]. This computation proceeds in the forward direction through the control ow graph. The rst equation ensures that any data overwritten inside node i is removed from the availability set, and data communicated during node i (and not overwritten later) is added to the availability set. The second equation indicates that at entry to a join node in the control ow graph, only the data available at exit on each of the predecessor nodes can be considered to be available. gives an additional property to PPIN i , namely that all data included in PPIN i must be available at entry to node i on every incoming path due to original or moved communication. PPOUT i is set to communication that can be placed at entry to each of the successor nodes to i, as shown by Equation 4. Thus, PPOUT i represents communication that can legally and safely appear at the exit of node i. The property of safety implies that the communication is necessary, regardless of the ow of control in the program. Hence, the compiler avoids doing any speculative communication in the process of moving communication earlier.
As Equations 3 and 4 show, the value of PPIN i for a node i is not only used to compute PPOUT p for its predecessor node p, but it also depends on the value of PPOUT p . Hence, this computation represents a bidirectional data ow problem.
Finally, INSERT i represents communication that should be inserted at the exit of node i as a result of the optimization. Given that PPOUT i represents safe communication at that point, as shown in Equation 5 , INSERT i consists of PPOUT i minus the following two components: (i) data already available at exit of node i due to original communication: given by AV OUT i , and (ii) data available at entry to node i due to moved or original communication, and which has not been overwritten inside node i: this component is given by (PPIN i ? KILL i ). Following the insertions, any communication in node i that is not preceded by a de nition of data (i.e., ANTLOC i ) and which also forms part of PPIN i becomes redundant. This directly follows from the property of PPIN i that any data included in PPIN i must be available at entry to 1 The original equation in 9] for PPIN i has an additional term, correspondingto the right hand side being further intersected with PAV IN i , the partial availability of data at entry to node i. This term is important in the context of eliminating partially redundant computation, because it prevents unnecessary code motion that increases register pressure. However, moving communication early can be useful even if it does not lead to a reduction in previous communication, because it may help hide the latency. Hence, we drop that term in our equation for PPIN i . node i on every incoming path due to original or moved communication. Thus, in Equation 6 , REDUND i represents communication in node i that can be deleted.
The union, intersection, and di erence operations on ASDs are described later in the paper, in Section 6. The ASDs are not closed under these operations (the intersection operation is always exact, except in the special case when two mapping functions, of the form F i (j) = c i + l : c i + u : s, for corresponding array dimensions have di erent values of the coe cient c). Therefore, it is important to know for each operation whether to underestimate or overestimate the result, in case an approximation is needed. 
Decomposition of Bidirectional Problem
Before we describe our data-ow procedure using the above equations, we need to resolve the problem of bidirectionality with the computation of PPIN i and PPOUT i . Solving a bidirectional problem usually requires an algorithm that goes back and forth until convergence is reached. A preferable approach is to decompose the bidirectional problem, if possible, into simpler unidirectional problem(s) which can be solved more e ciently.
Dhamdhere et al. 9] prove some properties about the bidirectional problem of eliminating redundant computation, and also prove that those properties are su cient to allow the decomposition of that problem into two unidirectional problems. One of those properties, distributivity, does not hold in our case, because we represent data-ow variables as ASDs rather than bit strings, and the operations like union and di erence are not exact, unlike the boolean operations. However, we are able to prove directly the following theorem:
Theorem 1 The bidirectional problem of determining PPIN i and PPOUT i , as given by Equations 3 and 4, can be decomposed into a backward approximation, given by Equations 7 and 8, followed by a forward correction, given by Equation 9 .
Proof : BA PPIN i represents a backward approximation to the value of PPIN i (intuitively, it represents communication that can legally and safely be moved to the entry of node i). We will show that the correction term (\ p2pred(i) (AV OUT p PPOUT p )) applied to a node i to obtain PPIN i cannot lead to a change in the value of PPOUT for any node in the control ow graph, and that in turn implies that the PPIN values of other nodes are also una ected by this change. The correction term, being an intersection operation, can only lead to a reduction in the value of the set PPIN i . Let X = BA PPIN i ? PPIN i denote this reduction, and let x denote an arbitrary element of X. Thus, x 2 BA PPIN i , and x 6 2 PPIN i . Hence, there must exist a predecessor of i, say, node p (see Figure 4 ), such that: x 6 2 AV OUT p and x 6 2 PPOUT p . Therefore, p must have another child j such that x 6 2 BA PPIN j , otherwise x would have been included in PPOUT p . Now let us consider the possible e ects of removal of x from PPIN i . From the given equations, a change in the value of PPIN i can only a ect the value of PPOUT for a predecessor of i (which can possibly lead to other changes). Clearly, the value of PPOUT p does not change because PPOUT p already does not include x. But node i cannot have any predecessors other than p because p is a branch node, and by virtue of the edge splitting transformation on the control ow graph, there can be no edge from a branch node to a join node. Hence, the application of the correction term at a node i cannot change the PPOUT value of any node: this implies the validity of the above process of decomposing the bidirectional problem. 2 We observe that since the application of the correction term to a node does not change the value of PPOUT or PPIN of any other node, it does not require a separate pass through the control ow graph. During the backward pass itself, after the value of PPOUT p is computed for a node p, the correction term can be applied to its successor node i by intersecting BA PPIN i with AV OUT p PPOUT p .
Overall Data-Flow Procedure
So far we have discussed the data ow equations that are applied in a forward or backward direction along the edges of a control ow graph to determine the data ow information for each node. In the presence of loops, which lead to cycles in the control ow graph, one approach employed by classical data ow analysis is to iteratively apply the data ow equations over the nodes until the data ow solution converges 1]. We use the other well-known approach, interval-analysis 2, 13], which makes a de nite number of traversals through each node and is well-suited to analysis such as ours which attempts to summarize data ow information for arrays.
We use Tarjan intervals 29], which correspond to loops in a structured program. Each interval in a structured program has a unique header node h. As a further restriction, we require each interval to have a single loop exit node l. Each interval has a back-edge hl; hi. The edge-splitting transformation, discussed earlier, adds a node b to split the back-edge hl; hi into two edges, hl; bi and hb; hi. We now describe how interval analysis is used in the overall data ow procedure.
INTERVAL ANALYSIS
Interval analysis is precisely de ned in 5]. The analysis is performed in two phases, an elimination phase followed by a propagation phase. The elimination phase processes the program intervals in a bottom-up (innermost to outermost) traversal. During each step of the elimination phase, data ow information is summarized for inner intervals, and each such interval is logically collapsed and replaced by a summary node. Thus, when an outer interval is traversed, the inner interval is represented by a single node. At the end of the elimination phase, there are no more cycles left in the graph. For the purpose of our description, the top-level program is regarded as a special interval with no back-edge, which is the rst to be processed during the propagation phase. Each step of the propagation phase expands the summary nodes representing collapsed intervals, and computes the data ow information for nodes comprising those intervals, propagating information from outside to those nodes.
Our overall data ow procedure is sketched in Figure 5 . We now provide details of the analysis. 
Elimination Phase
We now describe how the values of local data-ow variables, CGEN, KILL, and ANTLOC are summarized for nodes corresponding to program intervals in each step of the elimination phase. These values are used in the computations of global data-ow variables outside that interval. The computation of KILL and CGEN proceeds in the forward direction, i.e., the nodes within each interval are traversed in topological-sort order. For the computation of KILL, we de ne the variable K i as the data that may be killed along any path from the header node h to node i. We initialize the data availability information and the kill information (K h ) at the header node as follows:
The transfer function for K i at all other nodes is de ned as follows:
The transfer functions given by Equations 1, 2 and 10 are then applied to each statement node during the forward traversal of the interval, as shown in Figure 6 . Finally, the data availability generated for the interval last node l must be summarized for the entire interval, and associated with a summary node s. However, the data availability at l, obtained from (1) and (2), is only for a single iteration of the loop. The following equations de ne the transfer functions which summarize the data being killed and the data being made available in an interval with loop index k, for all iterations low : high. where AntiDepDef represents each de nition in the interval loop that is the target of an anti-dependence at the loop nesting level (we conclude that a given dependence exists at a loop nesting level m if the direction vector corresponding to direction`=' for all outer loops and direction`<' or`=' for the loop at level m is included in the direction vector representation 32] of that dependence). If the loop is a doall loop, the range of AntiDepDef is empty, so that we get CGEN s = expand(AV OUT l ; k; low : high). Explanation For computing the interval kill set KILL s , we simply expand the kill set generated at l over the interval loop bounds low : high. Computing the interval availability set CGEN s requires more work, because a variable de nition in an iteration k may kill data made available in previous iterations low : k. Therefore, we rst expand the data made available in a single iteration, obtaining all data made available in any iteration, and then subtract out the data that may be killed after it is made available. A de nition kills data made available in a previous or the same iteration of a loop if it is the target of an anti-dependence at the loop nesting level, that is, if it de nes data previously used. 
Propagation Phase
The propagation phase processes the program intervals in a top-down (outermost to innermost) traversal. During the expansion of an interval, data ow information from outside is propagated to nodes inside that interval. During this traversal of nodes in an interval, any inner interval is treated as a single node represented by its summary node.
Given an interval representing a loop, our analysis calculates the data ow information for each node for a loop iteration k. For the forward traversal determining the solutions to AV IN=AV OUT, the value of AV IN h at the beginning of the kth loop iteration is given by: Explanation AV IN low h represents the data that is available at entry to the header before the loop is entered (this is the information that is propagated from outside the interval). The data available at the beginning of iteration k consists of:
1. The data made available before the loop is entered, which is not killed in iterations low : k ? 1, and 2. The data made available on all previous iterations low : k?1, which has not been killed before iteration k. The two terms unioned together in the above equation for AV IN k h correspond to these two components. In a similar manner, for the backward traversal obtaining the solutions to BA PPIN=PPOUT, the value of PPOUT l at the beginning of the kth loop iteration is obtained as: Once again, for the example shown in Figure 7 , using the same de nition of M as before, we get: As noted earlier, the rst step in the propagation phase of our algorithm is di erent from others, in that it is applied over a special interval representing the top-level program rather than a loop. For that step, the values of AV IN h for the entry node h and of PPOUT l for the exit node l are initialized to ;.
For each interval processed in the propagation phase, after the initial determination of AV IN h in the forward traversal, the transfer functions given by Equations 1 and 2 are applied to obtain AV OUT and AV IN values for the other nodes. Similarly, during the backward traversal, after determining PPOUT l for the last node l, Equations 7 and 8 are applied to obtain BA PPIN and PPOUT values for the remaining nodes. Furthermore, during this backward traversal, after computing PPOUT p for node p, the forward correction given by Equation 9 is applied to PPIN i for each successor node i, as discussed earlier. Following the determination of PPIN i (which is complete after the last forward correction has been applied from its predecessor(s)), the values of INSERT i and REDUND i are obtained using Equations 5 and 6 respectively.
Communication Optimizations
Following the determination of INSERT and REDUND for each node, communications corresponding to the values of INSERT are placed at the exits of nodes, and the values of REDUND are used to delete redundant communication. We now describe how di erent optimizations are captured by the data ow procedure that we have described. Message vectorization is accounted for by the computation of ANTLOC for an entire interval, as it characterizes the communication that can be moved outside a loop. Since message vectorization is a well-understood optimization implemented by most distributed memory compilers based on data-dependence 33, 26, 18, 23, 22, 15], we shall focus on other important optimizations that require the generality of data-ow analysis.
Both 
Reduction in volume of communication
Under di erent conditions, this reduction could mean that the amount of data being communicated is reduced, or the number of processors to which data is sent could be reduced. This case arises when the data to be communicated is a subset of the data that is available, though it is available at fewer processors than needed. Thus, the data D 1 can be sent to just the extra processors, Our data ow analysis procedure moves communications as early as legally possible and avoids introducing unnecessary communication, thus handling conditional control ow e ectively. Traditionally, researchers have proposed inserting sends ahead of receives to help hide the latency of communication. In the context of our framework, a better approach would be to place blocking sends 2 and non-blocking receives at the point of insertion of communication, and inserting a wait at the reference to non-local data for the receive to be over before reading that data. This leads to the initiation of communication at the earliest possible point (under the constraint that there is no speculative communication), and waiting for the data to arrive only when it is needed. Thus, for communication that can be moved signi cantly further ahead, much of the latency can be hidden if the underlying target architecture supports overlap between communication and computation.
Operations on Available Section Descriptors
In this section, we present the algorithms for various operations on the ASDs. Each operation is described in terms of further operations on the array section descriptors and the mapping descriptors that constitute the ASDs. There is an implicit order in each of those computations. The operations are rst carried out over the array section descriptors, and then over the descriptors of the mapping functions applied to the resulting array section.
The results of these operations cannot always be computed exactly, either due to some part of the operand(s) being unknown at compile-time, or due to the ASDs not being closed under that operation. In that case, the compiler must appropriately either underestimate or overestimate the result so that the nal optimization is only conservative, not incorrect. In Section 4.1, for each data ow equation, we described whether the result was underestimated or overestimated. Based on those constraints, we observe that the results of intersection and union operations are always underestimated in our framework, while the result of the di erence operation may need to be underestimated or overestimated, depending on the data ow equation. In our descriptions, the special value ? is used to represent statically unknown parameters. This special value ? is treated as null when the results are being underestimated. In case of the di erence operation, var 1 ? var 2 , when the result is to be overestimated, a resulting value of ? is interpreted as var 1 Union Operation The union operation presents some di culty because the ASDs are not closed under the union operation. The same is true for data descriptors like the DADs and the BRSDs, that have been used in practical optimizing compilers 4, 17]. As we explained earlier, in the context of our framework, any errors introduced during this approximation should be towards underestimating the extent of the descriptor.
One way to minimize the loss of information in computing hD 1 ; M 1 i hD 2 ; M 2 i is to maintain a list consisting of (1) hD 1 ; M 1 i, (2) If D 1 = D 2 , only the fourth term needs to be retained, since hD 1 ; M 1 M 2 i subsumes all other terms. If M 1 = M 2 , the third term, that e ectively evaluates to hD 1 D 2 ; M 1 i, subsumes all other terms, which may hence be dropped. In addition to these optimizations, the compiler can use heuristics like dropping terms associated with the smaller array regions to ensure that the size of such lists is bounded by a constant. Further discussion of these heuristics is beyond the scope of this paper. In our prototype implementation, for simplicity, the compiler represents the result of union operation as a list of the individual ASDs unless it can infer that one ASD is a subset of the other. This ensures, at the expense of some accuracy, that the size of the list is linearly bound by the number of communications.
Di erence Operation The di erence operation causes a part of the data associated with the rst ASD to be invalidated at the processors to which that part is mapped under the second ASD. Union Operation The BRSDs are not closed under the union operation. The algorithm described in 17] to compute an approximate union potentially overestimates the region corresponding to the union. We need a di erent algorithm, one that underestimates the array region while being conservative. In the special case when the array regions identi ed by D 1 and D 2 di er only in one dimension, say, i, the union operation is given by: 2 )) When the regions corresponding to D 1 and D 2 di er only in one dimension, all except for one of the terms in the above expression evaluate to null. When there is more than one non-null term, the result may be represented as a list, and heuristics used (as discussed earlier) to keep such lists bounded in length.
Again, the formula for computing the di erence, R i = S i 
Preliminary Implementation and Results
We have done a prototype implementation of our data ow framework as part of the pHPF compiler for HPF 15] . Our implementation is meant to serve as a platform to investigate the potential performance bene ts from the data ow analysis, and currently represents a very simpli ed version of the analysis presented in this paper. Currently, our compiler only eliminates fully redundant communication, it does not try to reduce the amount of data communicated or hide the latency of communication. Furthermore, for the sake of simplicity of implementation, the compiler does not currently move communication across di erent loop nests. However, since the analysis is performed on the program before it is transformed with loop distribution for vectorizing communication and eliminating extra guard statements 18, 15], our results include some potential bene ts of a more global analysis as well.
In our implementation, we have incorporated an extension to make our analysis exploit information about the distribution of arrays on physical processors in special cases, not merely the alignment information. Consider the program shown in Figure 11 . On the basis of alignment information, we would view the communications B(i; j?1) ! V PROCS(i; j) and B(i; j?2) ! V PROCS(i; j) as separate. However, taking into account the block distribution of the second array dimension, we can recognize the communication for B(i; j ? 2) as subsuming the communication for B(i; j ? 1), as the former involves two boundary columns and the latter involves one boundary column being communicated. We have implemented this by extending the test for one communication being a subset of another, for nearest-neighbor communication.
We now describe the results of our experiments performed on ve programs, which are part of an HPF benchmark suite developed by Applied Parallel Research, Inc. The rst program, tomcatv (originally from SPEC benchmarks), does mesh generation with Thompson's solver. The second program, x42, is an explicit modeling system using fourth order di erencing. The third program, tred2, is from the EISPACK library. It reduces a real symmetric matrix to a symmetric tridiagonal matrix, using and accumulating orthogonal similarity transformations. The program grid performs a 9-point stencil computation followed by global reductions. The last program, baro, performs computations for a shallow water atmospheric model. Table 2 shows the static counts of the number of references to array and scalar variables in the program which required interprocessor communication, with and without the optimization for redundancy elimination. The rst column describes for each program how the main HPF template was distributed on processors, since that a ects the number of communications needed. Each of the programs was compiled with the number of physical processors left unspeci ed at compile-time. As can be seen from the Table 3 : Execution times (in seconds) of tomcatv on IBM SP-1 than ten references needing communication, we observed a range of 13.6% to 52.3% of those communications (in terms of the static counts of references) as being completely redundant. The results for the program baro have been presented in terms of individual results for each subroutine. It was particularly encouraging to note that the subroutine that shows the best improvement, cmslow, is the most frequently executed subroutine and accounts for the maximum amount of time spent in the program. The improvements for tred2 were modest, the only communications eliminated were those for scalars. However, a hand-analysis of the program showed opportunities for more global communication optimizations captured by our framework, which were not implemented in this version of the compiler. We now present some performance results obtained on the IBM SP-1 machine. These programs were compiled using two versions of the HPF compiler, one which does not perform any data ow analysis to eliminate redundant communication, and the other one which does. These timings do not include any time spent on I/O, since that had been commented out from the main computation in those programs. Tables  3 and 4 show the performance of the programs tomcatv and grid for various values of the number of processors (p) and the data size (n). The tables give the execution times without applying the redundant message elimination (RME) optimization and after applying this optimization. Table 3 shows noticeable improvements in the performance of tomcatv due to redundant message elimination. The performance improvement on 16 processors varies from 25% to 55% for di erent data sizes ranging from n = 65 to n = 1025. The relative gain in performance is lower for larger data sizes and for programs run on fewer processors because of computation time dominating the communication time. However, even for larger data sizes, the performance improvement on 16 processors is quite signi cant. For smaller data sizes, the improvement is much more substantial. It is interesting to note that redundant message elimination enables the compiler to obtain speedups for a data size as small as n = 65, whereas there were no speedups obtained for that data size without this optimization. These results con rm the e ectiveness of redundant message elimination in reducing the communication costs of this benchmark program. n p = 1 p = 4 p = 8 p = 16 No RME With RME No RME With RME No RME With RME No RME With RME The performance improvement for grid is only modest. This is because the communication time for this benchmark program is very small compared to the overall execution time, leaving relatively little room for improvement. Due to this low communication time, we observe good speedups even without redundant message elimination. However, even for this program, when n = 64, where the communication time is relatively higher, we notice a performance improvement of 3-16% after redundant message elimination. Thus, our optimization does reduce the communication cost, but makes a signi cant di erence to overall performance only if communication cost itself is high.
Related Work
Many other researchers have used data-ow analysis to optimize communication. Granston and Veidenbaum 12] use data-ow analysis to detect redundant accesses to global memory in a hierarchical, shared-memory machine. However, they do not explicitly represent information about the availability of data on processors. Instead, they rely on simplistic assumptions about scheduling of parallel loops, which are often not applicable. Amarasinghe and Lam 3] use the last write tree framework to perform optimizations like eliminating redundant messages. Their framework does not handle general conditional statements, and they do not eliminate redundant communication due to di erent references in arbitrary statements (for instance, statements appearing in di erent loop nests). Von Hanxleden et al. 31, 30] have developed a data ow framework for generating communication in the presence of indirection arrays. Their work focuses on irregular subscripts, and therefore does not attempt to obtain more precise information about array sections. Gong et al. 11] describe a data-ow procedure that uni es optimizations like vectorizing communication and removing partially redundant communication. They only handle programs with singly nested loops and unidimensional arrays, and with very simple subscripts.
Suppression of partially redundant code is a powerful code optimization and has found its way into a number of commercial compilers. Morel and Renvoise 24] rst proposed a bidirectional bit-vector algorithm for the suppression of partial redundancies. The complexity of bidirectional problems for bit-vector representations of data ow information was addressed by later papers 8, 20, 9, 19] . This paper applies the techniques from 24] and 9] for eliminating partially redundant communication. This work extends the previous result on the decomposition of this bidirectional problem into e cient unidirectional problems by proving it in the context of an approximate data ow representation, namely the ASDs.
Interval analysis, introduced by Allen and Cocke 2], has been used to solve several data ow problems. The work by Gross and Steenkiste 13] was the rst to extend interval analysis to handle array sections. Our data ow procedure re nes the algorithms described in 13] by using loop-carried data dependences while summarizing data ow information for intervals. In addition, we apply data ow analysis to ASD's that represent both array section information and information about the processor elements on which the array elements are available.
We have used ideas from well-known representations of array sections used in other contexts 6, 4, 17] for developing a representation of communication in this work. In particular, we use the BRSD proposed by Havlak and Kennedy 17] to represent, in our framework, the data involved in communication. The concept of a mapping function descriptor that we have introduced represents a crucial extension to the notion of data descriptor. It enables representation of communication and of the data made available at various processors by prior communications. That in turn allows data ow analysis to be used for powerful communication optimizations.
Conclusions
We have presented a data-ow framework for reducing communication costs in a program. This framework provides a uni ed algorithm for performing a number of optimizations that eliminate redundant communication and that reduce the volume of communication, by reducing both the data and the number of processors involved. The algorithm also determines the earliest point at which communication can legally be moved, without introducing extra communication. That can help hide the latency of communication. This algorithm is quite general, it handles control ow and performs optimizations across loop nests. It also does not depend on a detailed knowledge of the explicit communication representation.
An important feature of our approach is that the analysis is performed at the granularity of sections of arrays and processors, that considerably enhances the scope of optimizations based on eliminating partial redundancies. We prove that in the context of an ASD representation also, the bidirectional problem of determining placement can be decomposed into a backward problem followed by a forward correction. This ensures the practicality of the analysis by making it e cient. The preliminary results from a simpli ed implementation of this framework show signi cant performance improvements, and con rm the e ectiveness of the optimization to eliminate redundant communication.
In the future, we plan to conduct more extensive experiments, to study the performance impact of other optimizations captured by our framework, like reducing the volume of communication and hiding the latency of communication. This will require further examination of issues like management of bu ers containing nonlocal data from other processors. There is also further work needed on the problem of optimal placement of communication. While our framework determines the earliest point at which communication can legally be moved, actually initiating the communication at that point may not necessarily lead to the best results.
Future work will also involve extending the data ow framework to perform interprocedural optimizations. There is also scope for integrating the concepts of ownership and availability of data, and developing algorithms for additional optimizations that exploit the fact that processors other than the owners can also send values to processors that need them.
