Array statements are often used to express data-parallelism in scienti c languages such as Fortran 90 and High Performance Fortran. In compiling array statements for a distributed-memory machine, e cient generation of communication sets and local index sets is important. We show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be e ciently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with a virtual processor approach to give an e cient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to as virtual-block or virtual-cyclic views depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits di erent schemes based on the combination of the virtual processor views chosen for the di erent arrays involved in an array statement. These virtualization schemes have di erent indexing overhead. We present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand-compiled code for array assignments. These results show that using the virtual processor approach, e cient code can be generated for execution of array statements involving block-cyclically distributed arrays.
Introduction
Programming languages, such as High Performance Fortran (HPF) 5], Fortran D 6], Vienna Fortran 3] , and Distributed Fortran 90 13] support a programming model based on a single address space and provide directives for the explicit speci cation of data distributions for arrays. Block, cyclic, and block-cyclic distributions are the regular data distributions provided in these languages. Array statements involving array expressions are often used to express data-parallelism in these languages. Array expressions involve array sections which consist of array elements from a lower index to an upper index at a xed stride. In order to generate high-performance target code, compilers for distributed-memory machines should produce e cient code for array statements involving distributed arrays.
Compilation of array statements with distributed arrays for parallel execution requires partitioning of the computation among the processors. Many compilers use the owner-computes rule to partition the com-putation in such a manner that the processor owning the array element performs the computation which modi es it. The computation performed on a processor may involve array elements resident on other processors. In the generated code, all the non-local data needed by a processor is fetched into temporary arrays in a processor's local memory using interprocessor communication. In order to reduce the communication overhead, each processor rst determines all the data it needs to send to and receive from other processors and then performs the needed communication. This reduces the communication overhead by aggregating all the data movements needed from one processor to another into a single message. After communication, each processor then performs the computation in its local address space.
The overhead to perform the data movement consists of determination of the following data index sets and processor sets for each processor p:
Send processor set of processor p: set of processors to which p has to send data.
Send data index set of processor p to processor q: indices of the array elements which are resident on p but are needed by q.
Receive processor set of processor p: set of processors from which p has to receive data.
Receive data index set of processor p from processor q: indices of the array elements which are needed by p but are resident on q.
Closed form characterization of these sets would reduce the overhead of packing data into messages on the sending processor and unpacking data at the receiving processor. If the arrays have only block or cyclic distributions, then the data index sets and the processor sets can be characterized using regular sections for closed forms 7, 8, 12] . However, for the general block-cyclic distribution, closed form characterization of these sets using simple regular sections is not possible. This paper presents a virtual processor approach to e ciently enumerate the data index sets and processor sets when arrays have block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to as virtual-block or virtual-cyclic views depending on whether a block or cyclic distribution of the array on the virtual processors is used. These virtualization views permit us to use the closed forms for block and cyclic distributions in the virtual processor domain.
A processor performs the computation for the virtual processors mapped to it. A message from processor p to q consists of all the data to be sent from the virtual processors on p to the virtual processors on q.
Under the owner computes rule, a processor owning the array element in an array section on the right hand side of the array assignment statement sends data to the processor owning the corresponding element of the array section on the left hand side of the assignment. We call the array section which is assigned the value as the target array section and an array section of the array expression on the right hand side of the array assignment statement as a source array section. For an array statement, each source and target array section pair is analyzed to determine the total data to be sent and received by each processor. The indexing overhead for a source and target array section pair depends upon the virtualization views used for each source and target array dimension. There are four possible combinations of virtual views per dimension as either a virtual block or a virtual cyclic view can be used for each array axis. Although each of the possible combinations involves exactly the same communication between any pair of physical processors, they will generally have di erent indexing overhead. We present a selection strategy to choose the virtualization scheme with the minimum indexing overhead.
We have implemented the virtual processor approach on a Cray T3D multicomputer. The performance measurements demonstrate that the indexing overheads for the four virtualization schemes are signi cantly di erent. The selection strategy provides a good prediction of the virtualization scheme with the lowest indexing cost. The additional overhead for generating the structures facilitating the enumeration of the index sets is in the range of 5% to 50% of the execution time for the array statement.
The rest of the paper is organized as follows. In Section 2, related work is presented. Section 3 describes the block, cyclic, and block-cyclic distributions along with their indexing functions and identi es the issues involved in e cient execution of array statements involving distributed arrays. In Section 4, we present closed form solutions in terms of regular sections for processors sets and data index sets for block-wise or cyclically distributed arrays. We then use the virtual processor approach to extend these solutions to blockcyclic distributions in Section 5. Performance results on a Cray T3D system are presented in Section 6. Conclusions and discussions are provided in Section 7.
Related Work
The issue of generating code for message passing machines from a single address space program with data distribution directives was addressed by Koelbel in 12] , where a closed form characterization of the data index sets was provided for computations involving identically distributed arrays with either a block or a cyclic distribution. Closed form characterizations of the processor sets were not developed and arrays with 3 block-cyclic distributions were not considered. Compilation of array statements in Distributed Fortran 90 is described in 13], but Distributed Fortran 90 supports only the block distribution.
For an array expression of the form B(l 2 : u 2 : s 2 ) = f(A(l 1 : u 1 : s 1 )) with the arrays distributed using block-cyclic distributions, Paalvast et al. 14] present techniques to enumerate the portion of B which is modi ed by a processor p and the portions of array A which reside on p but are needed by the other processors. Their communication scheme is based on scanning over the active array indices of array sections to determine the elements which need to be communicated. This scheme will incur a high run-time overhead since a local-to-global and a global-to-local translation will be needed for each active element in order to determine the destination processor.
The problem of active index-set identi cation for array statements involving block-cyclically distributed arrays was addressed by Chatterjee et al. 4] using a nite-state machine (FSM) to traverse the local index space of each processor. If all arrays in an array statement have the same block-cyclic distribution and access stride, the order of access of the local elements with the FSM approach turns out to be the same as the access order when the virtual-block view is taken with our approach. However, if the distribution or access stride of the array section on the left-hand-side is di erent from that of an array section on the right-hand-side, it appears that after determination of the active local indices of the r.h.s. section using a FSM, an explicit local-to-global translation corresponding to the r.h.s. section and a global-to-local translation corresponding to the l.h.s. section will need to be performed for each active element. In 4], restricted cases involving arrays with di erent strides are treated, but even with these, the generation of communication sets requires explicit index translation for each active element. With the virtual processor approach explicit local-to-global and global-to-local translation is not needed, even when the array sections di er in both distribution and access stride.
Recent independent work by Stichnoth also addresses the problem of active index-set and communicationset identi cation for array statements involving block-cyclically distributed arrays 16, 17] . The formulation proposed has similarities to the virtualization scheme with a virtual-cyclic view at both the source and target array. However, the approach in 17] does not attempt a closed form characterization of the active processorsets with respect to each source/destination processor. Although the proposed scheme does not require the inclusion of indexing data in the message, a protocol in which a processor sends both the data values and the indices on the receiving processor, to facilitate unpacking, is proposed and evaluated. This strategy increases the data volume transmitted but improves performance on machines with high communication bandwidth 4 such as the iWarp system.
The Syracuse F90-D compiler's initial implementation uses compile-time characterization of communication only for block distributions 1] and relies upon run-time generation of schedules for the general block-cyclic case using the approach adopted by the PARTI system 15].
The implementation of the Fortran-D compiler at Rice University is being extended to handle arrays with block-cyclic distributions 9, 10]. Their approach for determining communication sets is based upon computing the intersection of data index sets corresponding to the left-hand side and right-hand side array references in an assignment statement. The intersection is determined using a scanning approach which e ciently exploits the repetitive pattern of the intersection of the two index sets. An approach similar to the FSM approach 4] for determining the local memory access sequence is used. E cient techniques for the FSM table generation are presented for special cases. The approach relies upon run-time methods 15] to handle the general case.
The use of the virtual processor approach for addressing the problem of active index-set identi cation for array statements involving block-cyclic distributions and that of redistribution of arrays with block-cyclic distributions was rst reported by us in 7, 8] . This paper builds on that work and extends it in several ways: explicit closed-form characterization of processor sets using regular sections, development of a cost model to determine the virtualization scheme with lowest indexing time, and more extensive experimental veri cation of the e cacy of the approach.
Data Distributions and Array Statements
In this section, we describe regular data distributions and compilation of array statements involving regularly distributed arrays. Consider an array A(m : n) distributed onto P processors. In a block distribution, contiguous blocks of size d(n ? m + 1)=P e are allocated to the processors. In a cyclic distribution, array elements are assigned to the processors in a cyclic fashion. In a block-cyclic distribution, speci ed as cyclic(b), blocks of size b are allocated to the processors in a cyclic fashion. The block-cyclic distribution is the most general regular data distribution in languages such as HPF. An element A(i) of the array A has a global and a local index. The global index i is the index in array A, as referenced in an HPF program. The portion of array A located in a processor's local memory is referred to as A loc in the processor's node program generated by an HPF compiler. The index of element A(i) in A loc is its local index. The relationships between the global index, the local index and the processor index for regular data distributions are shown in Table 1 . A processor is said to own the elements of the distributed array which are allocated to it.
Let arrays A(m 1 : n 1 ) and B(m 2 : n 2 ) be distributed on P 1 PSend(p): the set of processors fq j DSend(p; q) 6 = ;; 0 q < P 2 g, 0 p < P 1 .
PRecv(p): the set of processors fq j DRecv(p; q) 6 = ;; 0 q < P 1 g, 0 p < P 2 .
Using these sets the node program pseudo-code for the execution of the array statement is as shown in Fig. 2 . In the program the receiving phase is combined with the execution phase. Rather than unpacking a message into a local temporary location and separately evaluating the function f on each element of the local section, f is applied on each element directly from the message bu er. This optimization reduces the overhead of an overlap which is useful to hide communication latency. However, this optimization may not be applicable when the array expression has multiple array sections on the right hand side. In this case, the receiving and execution phase are performed separately. The sending and receiving phase is executed for each right hand side array section, storing the result in temporary arrays which are identically distributed and aligned as the left hand side array. The execution phase evaluates the values for the elements of the left hand side array section using the Local Index Set LIndex() to access these elements, and obtains the required input elements from the appropriate temporary arrays. Since the data and processor send and receive sets and the local index sets are integral to the execution of array assignments, e cient schemes for enumerating these sets are important.
For block and cyclic distributions, the send and receive data index and processor sets can be expressed as regular sections. However, for block-cyclic distributions, these sets cannot be expressed as simple regular sections. For instance, consider the local index set of A(0 : 39 : 2) when A(0 : 39) is distributed onto four processors using a cyclic(5) distribution, as shown in Fig. 3 . The array section elements are separated by a constant stride 2 in global index space. However, the elements of the array section do not have a constant stride in the local index space of a processor. For example, LIndex(0) = (0; 2; 4; 5; 7; 9) and LIndex(1) = (1; 3; 6; 8).
We now construct the data send and receive sets and processor send and receive sets for block and 
local index: Node code for processor p developed using the closed forms is shown in Fig. 5 . Nonblocking sends and blocking receives are the communication primitives used. Fig. 5 shows the most general form of the code. It can be optimized by removing the unnecessary checks if some of the parameters are known at compile-time.
In the sending phase, processor p uses the closed-forms for DSend(p; q) to send messages to all processors q 2 PSend(p). In the receiving phase, processor p rst determines the number of messages it will receive using PRecv(p). Then, it receives the messages non-deterministically in the order they arrive. The receive command recv(q; tmp buf) returns the sender processor's index in q and the message in tmp buf. The recv data index set DRecv(p; q) is used to unpack the received message. If p 2 PSend(p), then the explicit message send and receive is replaced by a local memory copy of the bu ers.
Block Distribution to Cyclic Distribution
Let A be block distributed and B be cyclically distributed over P 1 For example, Fig. 6 illustrates an example in which P 2 = 3 and s 2 = 2. It can be seen that the stride distance between the section elements on any processor is lcm(s 2 ; P 2 ) (= 6). Figure 5 : Node code for B(l2 : u2 : s2) = f(A(l1 : u1 : s1)). A(m1 : n1) and B(m2 : n2) are block distributed on P1 and P2 processors, respectively.
slice index:
Array elements located on : proc q+1 proc q+2
processor index:
. . . . Node code for processor p, can be similarly developed using these closed forms.
Cyclic Distribution to Block Distribution
The case when A(m 1 : n 1 ) is cyclically distributed and B(m 2 : n 2 ) is block distributed is the dual of the previous case where A was block distributed and B cyclically distributed. The send processor and send data index sets of the former become the receive processor and receive data index sets of the latter. Similarly the receive sets of the former become the send sets of the latter.
Cyclic Distribution to Cyclic Distribution
We now consider the case where arrays A and B are cyclically distributed over P 1 and P 2 processors, respectively. The section indices i 1 such that i 1 is the smallest non-negative integer for which the corresponding c 1 is also a non-negative integer. Let r 1 = lcm(s 1 ; P 1 ). Node code for processor p, can be similarly developed using these closed forms.
Modi cations for Negative Strides
We now describe the modi cations to the closed form expressions to handle negative strides. If both s 1 < 0 and s 2 The data send set DSend(p; q) for q 2 PSend(p) can be similarly evaluated.
Virtual Processor Approach for Block-Cyclic Distributions
In this section, we present a virtual processor approach for e cient execution of array statements involving block-cyclically distributed arrays. Let A(m 1 : n 1 ) and B(m 2 : n 2 ) be distributed using a cyclic(b 1 ) and cyclic(b 2 ) distribution on P 1 and P 2 processors, respectively. For an array statement of the form B(l 2 : u 2 : s 2 ) = f(A(l 1 : u 1 : s 1 )), the virtual processor approach involves:
1. Viewing a cyclic(b 1 ) distribution of A as a block (or cyclic) distribution on V P 1 virtual processors which are cyclically (or block-wise) mapped to P 1 processors. These views are referred to as virtualblock or virtual-cyclic views depending on whether a block or cyclic distribution of the array on the virtual processors is used. Fig. 8 gives an schematic illustration of the two views of a block-cyclic distribution. The cyclic(b 2 ) distribution of B is similarly viewed as a block or cyclic distribution on V P 2 virtual processors.
2. The communication required to perform the array statement can be determined by using the closed forms of Section 4 in the virtual processor domain. Each physical processor bears the responsibility of of performing the computation and communication for the virtual processors mapped to it.
Thus the virtual processor approach is characterized by a two-level mapping of array elements to physical processors. The rst level maps the array elements to virtual processors and the second level maps virtual processors to physical processors. The mapping at each level can be represented using simple regular sections which facilitates e cient implementation of this approach.
We now present the details of the virtualization schemes. Depending upon the virtualization views used for the source array A and the target array B, four di erent schemes for executing the array statement are possible. The four schemes are shown in Table 2 . Each scheme is associated with a di erent number of source and target virtual processors, a di erent communication pattern in the virtual processor domain, and incurs di erent indexing overheads. Hence, we also present a strategy to select the communication scheme with minimum indexing overhead.
Virtualization Views
We now describe the virtual cyclic and virtual block views of a block-cyclic distribution.
Virtual-Block View
Let array A(m : n) be distributed using a cyclic(b) distribution on P processors. In the virtual-block view, A is assumed to be block distributed on V P = d(n ? m + 1)=be virtual processors. These virtual processors are assigned to P processors in a cyclic fashion. The set of virtual processors on processor p is (p : V P : P). Fig. 9 illustrates the virtual-block view of a cyclic(2) distribution of A(0 : 15) on two processors. The array has a block distribution on eight virtual processors v 0 to v 7 , which are cyclically allocated to the two (2) Using closed form expressions developed in Section 4, data index sets can be evaluated in terms of the local indices of the virtual processors. However, the local index of an element on a virtual processor is not the same as its local index on the processor to which it is mapped. For instance, in Fig. 9 , element A(8) has a local index of 4 in A loc on processor p 0 , but a local index of 0 on virtual processor v 4 . Since the physical processor p performs the computation and communication for the virtual processor v mapped to it, it is necessary to determine the translation from the virtual processor's local index space to the physical processor's index space. If the virtual processor v is mapped to processor p, then the array element with local index j on v has a local index (v div P) b + j on processor p. Under the virtual-block view, a stride of s in the local index space of a virtual processor on processor p remains unchanged in the local index space of p. Hence, an array section (l : u : s) of A in the local space of a virtual processor v on processor p corresponds to array section ((v div P) b + l : (v div P) b + u : s).
Consider the array section A(l : u : s). Under the virtual-block view not all virtual processors necessarily own elements of the array section. We refer to virtual processors that own array section elements as being V PSend(v p ) and V PRecv(v p ) are obtained from PSend() and PRecv() assuming a block or cyclic distribution on the virtual processors. Similarly, V DSend(v p ; v q ) and V DRecv(v p ; v q ) are obtained from DSend() and DRecv(). Note that V DSend() and V DRecv() are de ned in terms of the local index space of the physical processor and a translation from the virtual processor local index space to physical index space will be required. Node pseudo-code for execution of the array statement B(l 2 : u 2 : s 2 ) = f(A(l 1 : u 1 : s 1 )) using the processor sets and data index sets in the virtual processor domain is shown in Fig. 11 . Phy A(v) denotes the processor to which virtual processor v is mapped under the virtual view for the array A(m 1 : n 1 ). Similarly Phy B(v) denotes the processor to which virtual processor v is mapped under the virtual view for the array B(m 2 : n 2 ).
The node pseudo-code in Fig. 11 could be ine cient as it may involve sending multiple messages between two processors. The additional message startup overhead can be reduced by splitting the send and receive phase into two phases. First, PSend(p) and DSend(p; q) are evaluated by scanning the active virtual processors and determining the processor and data send and receive index sets. These sets are de ned as where V Act A(p) denotes the set of active virtual processors on processor p corresponding to the section A(l 1 : u 1 : s 1 ) and V Act B(p) denotes the set of active virtual processors on processor p corresponding to the section B(l 2 : u 2 : s 2 ). The node pseudo-code using these sets is as shown in Fig. 12 . The initialization of the appropriate sets is not shown in the gure. The rst phase evaluates the processor and data index send and receive sets. Note that the Phase two packs and sends the data to each processor in PSend(p), receives messages from all processors q 2 PRecv(p) and evaluates the new values for B using DRecv(p; q).
The message contains all data communicated from processor p to q, i.e., data from all active virtual processors on p to their target virtual processors on q. To facilitate unpacking, the data is packed in increasing order of target virtual processor index. Data for a particular target virtual processor from all its source virtual processors on processor p is stored in the increasing order of source virtual processor index.
Strategy for Selection of Virtualization Schemes
Given block-cyclic distributions for A(m 1 : n 1 ) and B(m 2 : n 2 ), four choices are available for the virtualization scheme used, as shown in Table 2 . The choice of the virtualization scheme depends on the additional indexing overhead per processor. Since the communication pattern between physical processors is identical for all four schemes, the scheme with lowest indexing overhead has the lowest total completion time. During the send phase, a processor p packs all the data in DSend(p; q); 8q 2 PSend(p). As shown in Section 5.2, DSend(p; q) is a union of array sections. Associated with each array section is the time for evaluation of the loop bounds and the loop overhead. Assuming that this overhead is nearly equal for each section, the total indexing cost per processor during the sending phase, t s (p), is proportional to the total number of array sections to be communicated. Using the closed forms in Section 4 approximate measures for V PSend max and V PRecv max can be obtained. These measures are shown in Table 3 . The source array of size N 1 is distributed on P 1 processors using a cyclic(b 1 ) distribution and the target array of size N 2 is distributed on P 2 processors using a cyclic(b 2 ) distribution. Approximate measures for the maximum number of virtual processors per physical processor 22 Thus given a source and target block-cyclic distribution, an approximation of the maximum indexing cost can be obtained for each of the four virtualization schemes and a choice of the scheme best suited for the given distributions can be made.
Performance Results
In this section, we present experimental results for the virtual processor approach. The experiments were performed on a 32-node Cray T3D. The node programs used the PVM message-passing library. The time required by each processor to execute the node program for the array assignment statement was measured and the maximum time among all processors reported. Times were measured using the rtclock() wall clock timer.
The goals of the performance measurement were to determine whether the indexing costs for the four virtualization schemes were signi cantly di erent and whether the developed heuristic provided a good indication of the scheme with lowest indexing cost. Further, an estimate of the table generation costs and a comparison of the table generation costs with the execution time for the array statement was desired. Also a veri cation of the premise that the scheme with the lowest indexing cost also has the lowest execution time and table generation time was desired.
To reduce the number of independent parameters, we measured times for array section assignments of Table 4 presents the indexing cost estimates (IE) for the four virtualization schemes and the corresponding indexing times (IT). The indexing time includes the time for copying the data into and out of message bu ers and does not include the time spent for communication. In Table 4 , vb-vb refers to the virtual block to virtual block scheme, vb-vc refers to the virtual block to virtual cyclic scheme, vc-vb refers to the virtual cyclic to virtual block scheme and vc-vc refers to the virtual cyclic to virtual cyclic scheme. Table 4 (a), 4(b), and 4(c) give the indexing cost estimates and indexing times for the data sets (a), (b), and (c), respectively. For the last three cases in each of the data sets, either the source or the target distribution is chosen to be a block distribution.
Comparing the indexing estimates and the indexing times for the scheme with lowest indexing cost indicates that the selection criteria developed in Section 5 provide a good prediction of the best indexing scheme. In all the measured cases, the \predicted best" scheme has the lowest indexing time. Furthermore, the relative order among the indexing cost estimates is the same as that between the indexing times for the four schemes. For schemes with the same predicted cost those having a virtual cyclic view for either the source or target distribution have a higher indexing time. This can be attributed to the fact that the non-unit stride between consecutive elements in the data index sets for the virtual cyclic view results in poor cache performance due to a decrease in spatial locality.
Phase two uses tables generated during phase one of the virtual processor approach as shown in Fig. 12 . Table 5 gives the execution times and the table generation times for the cases shown in Table 4 . The execution times reported are for phase two of the virtual processor approach and include the indexing time as well as the communication time. The best table generation and execution times are marked in Table 5 . Comparing the lowest indexing times in Table 4 and the lowest execution times in Table 5 it is observed that the virtualization scheme which has the lowest indexing cost also has the lowest execution time. This follows by noting that the communication pattern among the processors for all four schemes is identical. Thus the communication time for the four schemes is nearly equal and the scheme with lowest indexing time also has the lowest execution time. Comparing the lowest execution times and lowest table generation times in Table. 5, it is further observed that the scheme with the lowest execution time also has the lowest table generation time. This can be attributed to the fact that the table generation time is also directly proportional to the total number of sections generated. A comparison of the table generation time and the execution times for the best cases shows that for most of the considered cases, the table generation time is in the range of 5% to 50% of the execution time.
Discussion and Conclusion
In this paper, we have presented techniques for e cient enumeration of the data index sets and processor index sets for array expression execution involving arrays distributed using the block, cyclic, and blockcyclic distributions. Closed forms expressed as regular sections for determining the send and receive data and processor sets for block and cyclic distributions were derived. A strategy based on virtual processors was used along with these closed forms to generate e cient indexing code for arrays distributed using block-cyclic distributions. A heuristic for selecting the appropriate virtual views at the source and target distribution was developed. Performance results for an implementation on the Cray T3D were presented. The performance results demonstrate that the developed heuristic provides a good indication of the virtualization scheme with the lowest execution time. It is also observed that for a majority of the cases considered the table generation overhead for the virtual processor approach is a small percentage of the execution time.
The virtual processor approach can be further optimized to handle frequently occurring cases such as data redistribution and array expressions with unit strides and/or identically distributed arrays, more e ciently. These optimizations will reduce the table generation time but are case-speci c and are not presented in the paper.
The developed scheme for compilation of the array statement can be extended to cover compilation of some forms of other data parallel constructs, such as the WHERE statement in Fortran 90 and for building the parameter list when passing array sections of distributed arrays to a subroutine call. We do not address = Sending phase for processor (p1; p2) = for vp1 We have presented techniques for handling one-dimensional arrays. The virtual processor approach can be extended to handle multi-dimensional arrays. The communication sets are developed independently for each array dimension using the techniques presented for one-dimensional arrays. The communication sets for the array statement involving multi-dimensional arrays are obtained by taking the cross-product of the appropriate sets for each dimension. Consider arrays A(m a1 : n a1 ; m a2 : n a2 ) and B(m b1 : n b1 ; m b2 : n b2 ) distributed using block-cyclic distributions on a P 1 P 2 processor mesh. The processors in the mesh are labelled as (p 1 ; p 2 ), 0 p 1 < P 1 and 0 p 2 < P 2 . Consider the array statement B(l b1 : u b1 : s b1 ; l b2 : u b2 : s b2 ) = f(A(l a1 : u a1 : s a1 ; l a2 : u a2 : s a2 )). The active virtual processors on processor (p 1 ; p 2 ) are V PAct A1(p 1 ) V PAct A2(p 2 ) where V PAct A1(p 1 ) and V PAct A2(p 2 ) are the sets of active virtual processors corresponding to the rst and second dimensions of array A, respectively and denotes the crossproduct of two sets. The sets V PSend(), V PRecv(), V DSend(), and V DRecv() for the multi-dimensional array are de ned similarly. Thus the communication code for the multi-dimensional array statement would be as shown in Fig. 13 . The send and receive sets for the rst dimension have a subscript one, while those for the second dimension have a subscript two. This communication code should be expressed as two-phases to eliminate multiple messages between two physical processors. Note that the closed form expressions developed for the communication sets in Section 4 allow for a di erent number of physical processors at the source and target distribution. Thus a change of the shape of the underlying processor grid and the multi-dimensional array can be handled by the virtual processor approach. HPF supports a two-level mapping of data arrays to an abstract processor grid. The language introduces a Cartesian grid referred to as a template. Arrays are aligned to the template and the templates are distributed The execution of the array statement in the paper follows the owner-computes rule which is prevalent in compilers for HPF-like languages. However, the closed form expressions developed in Section 4 are not restricted to the owner computes-rule and can be either directly used or adapted to cases where the iterations are distributed using a regular distribution as in KALI 12] . For instance, consider the KALI-like code in Fig. 14(a) , in which the iterations of the speci ed loop and the arrays A and B are distributed among the processors using a block-cyclic distribution. The execution of the loop can be performed by the equivalent sequence of array statements in Fig. 14(b) . The array T(0 : N ?1) is distributed identically as the iterations of the loop. The communication sets for the array statements can be determined using the virtual processor approach and the closed form expressions developed in Section. 4.
The developed code is available by anonymous ftp from ftp.cis.ohio-state.edu:pub/hpce/compiler/ Source/arraysect. 
