AbstractÐIn many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a generalized processor mapping technique to minimize the amount of data exchange for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) array redistribution and vice versa. The main idea of the generalized processor mapping technique is first to develop mapping functions for computing a new rank of each destination processor. Based on the mapping functions, a new logical sequence of destination processors can be derived. The new logical processor sequence is then used to minimize the amount of data exchange in a redistribution. The generalized processor mapping technique can handle array redistribution with arbitrary source and destination processor sets and can be applied to multidimensional array redistribution. We present a theoretical model to analyze the performance improvement of the generalized processor mapping technique. To evaluate the performance of the proposed technique, we have implemented the generalized processor mapping technique on an IBM SP2 parallel machine. The experimental results show that the generalized processor mapping technique can provide performance improvement over a wide range of redistribution problems.
INTRODUCTION
T HE data parallel programming model has become a widely accepted paradigm for programming distributed memory multicomputers. To efficiently execute a data parallel program on a distributed memory multicomputer, appropriate data decomposition is critical. The data decomposition involves data distribution and data alignment. The data distribution deals with how data arrays should be distributed. The data alignment deals with how data arrays should be aligned with respect to one another. The purpose of data decomposition is to balance the computational load and minimize the communication overheads.
Many data parallel programming languages, such as High Performance Fortran (HPF) [9] , Fortran D [6] , Vienna Fortran [33] , and High Performance C (HPC) [28] , provide compiler directives for programmers to specify array distribution. The array distribution provided by those languages, in general, can be classified into two categories, regular and irregular. The regular array distribution, in general, has three types, BLOCK, CYCLIC, and BLOCK-CYCLIC(c). The irregular array distribution uses userdefined array distribution functions to specify array distribution.
In some algorithms, such as multidimensional fast Fourier transform [29] , the Alternative Direction Implicit (ADI) method for solving two-dimensional diffusion equations, and linear algebra solvers [21] , an array distribution that is well suited for one phase may not be good for a subsequent phase in terms of performance. Array redistribution is required for those algorithms at runtime. Therefore, many data parallel programming languages support runtime primitives for changing a program's array decomposition [1] , [2] , [9] , [28] , [33] . Since array redistribution is performed at runtime, there is a performance trade-off between the efficiency of a new data decomposition for a subsequent phase of an algorithm and the cost of redistributing arrays among processors. Thus, efficient methods for performing array redistribution are of great importance for the development of distributed memory compilers for those languages.
In this paper, we present a generalized processor mapping technique to minimize the amount of data exchange of BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution and vice versa. The data transmission cost of a redistribution can be reduced. Compared with the technique proposed by Kalns et al. [12] , the generalized processor mapping technique is effective not only on BLOCK to BLOCK-CYCLIC(r) (or vice versa) redistribution but also on BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) array redistribution. Another contribution of the generalized processor mapping technique is the ability to handle array redistribution with arbitrary source and destination processor sets. We also present a theoretical model to compute the amount of data that is retained locally and to analyze the performance improvement through a redistribution. The generalized processor mapping technique has the following characteristics:
. The generalized processor mapping technique can minimize the amount of data that needs to be communicated in BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) array redistribution. The data transmission cost of a redistribution can be reduced. . The generalized processor mapping technique can handle array redistribution with arbitrary source and destination processor sets and also multidimensional arrays. . The proposed mapping functions determine a unique logical processor sequence that achieves the maximum amount of data retained locally in a redistribution. . If the source processor set and destination processor set of a redistribution are two disjoint sets, then the generalized processor mapping technique will be stultified. We have implemented the generalized processor mapping technique on an IBM SP2 parallel machine. The experimental results show that the generalized processor mapping technique provides performance improvement for most redistribution samples.
The rest of this paper is organized as follows: In Section 2, a brief survey of related work will be presented. In Section 3, we will introduce notations and terminology used in this paper. Section 4 presents the generalized processor mapping technique for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. In Section 5, we will present the generalized processor mapping technique for multidimensional array redistribution. The performance analysis and experimental results will be given in Section 6.
RELATED WORK
Many methods for performing array redistribution have been presented in the literature. These techniques can be classified into multicomputer compiler techniques [27] and runtime support techniques. We briefly describe the related research in these two approaches.
Gupta et al. [7] derived closed form expressions to efficiently determine the send/receive processor/data sets. They also provided a virtual processor approach [8] for addressing the problem of reference index-set identification for array statements with BLOCK-CYCLIC(c) distribution and formulated active processor sets as closed forms. A recent work in [16] extended the virtual processor approach to address the problem of memory allocation and index-set identification. By using their method, closed form expressions for index-sets of arrays that were mapped to processors using one-level mapping can be translated to closed form expressions for index-sets of arrays that were mapped to processors using two-level mapping and vice versa. A similar approach that addressed the problems of the index set and the communication sets identification for array statements with BLOCK-CYCLIC(c) distribution was presented in [24] . In [24] , the BLOCK-CYCLIC(k) distribution was viewed as a union of k CYCLIC distribution. Since the communication sets for CYCLIC distribution is easy to determine, communication sets for BLOCK-CYCLIC(k) distribution can be generated in terms of unions and intersections of some CYCLIC distributions.
In [3] , Chatterjee et al. enumerated the local memory access sequence of communication sets for array statements with BLOCK-CYCLIC(c) distribution based on a finite-state machine. In this approach, the local memory access sequence can be characterized by an FSM at most c states. In [17] , Kennedy et al. also presented algorithms to compute the local memory access sequence for array statements with BLOCK-CYCLIC(c) distribution. Lee and Chen [18] derived communication sets for statements of arrays which were distributed in arbitrary BLOCK-CYCLIC(c) fashion. They also presented closed form expressions of communication sets for restricted block size. In [4] , we proposed a basiccycle calculation to efficiently generate the communication sets for array redistribution. The greatest advantage of this method is the ability of fast indexing. In [11] , we proposed efficient algorithms for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. The most significant improvement of the algorithms is that a processor does not need to construct the send/receive data sets for a redistribution.
Thakur et al. [25] , [26] presented algorithms for runtime array redistribution in HPF programs. For BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution (or vice versa), in most cases, a processor scanned its local array elements once to determine the destination (source) processor for each block of array elements of size r in the local array. In [10] , an approach for generating communication sets by computing the intersections of index sets corresponding to the LHS and RHS of array statements was presented. The intersections are computed by a scanning approach that exploits the repetitive pattern of the intersection of two index sets. In [22] , [23] , Ramaswamy and Banerjee used a mathematical representation, PITFALLS, for regular data redistribution. The basic idea of PITFALLS is to find all intersections between source and destination distributions. Based on the intersections, the send/receive processor/data sets can be determined and general redistribution algorithms can be devised. Prylli and Touranchean [21] proposed a runtime scan algorithm for BLOCK-CYCLIC array redistribution. Their approach has the same time complexity as that proposed in [23] but has a simple basic operation compared to that proposed in [23] . The disadvantage of these approaches is that, when the number of processors is large, iterations of the outermost loop in intersection algorithms increases as well. This leads to high indexing overheads and degrades the performance of a redistribution algorithm.
The above researches focus on efficient generation of communication sets. For the communication part, a spiral mapping technique [32] was proposed. The main idea of this approach was to map formal processors onto actual processors such that the global communication can be translated to the local communication in a certain processor group. Since the communication is local to a processor group, one can reduce communication conflicts when performing a redistribution. Kalns and Ni [12] , [13] proposed a processor mapping technique to minimize the amount of data exchange for BLOCK to BLOCK-CYCLIC(r) redistribution and vice versa. Using the data to logical processor mapping, they show that the technique can achieve the maximum ratio between data retained locally and the total amount of data exchanged. Walker and Otto [30] used the standardized Message Passing Interface (MPI) to express the redistribution operations. They implemented the BLOCK-CYCLIC array redistribution algorithms in a synchronous and an asynchronous scheme. Since the excessive synchronization overheads occurred from the synchronous scheme, they also presented the random and optimal scheduling algorithms for BLOCK-CYCLIC array redistribution.
Kaushik et al. [14] , [15] proposed a multiphase redistribution approach for BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution. The main idea of multiphase redistribution is to perform a redistribution as a sequence of redistributions such that the communication cost of data movement among processors in the sequence is less than that of direct redistribution. Instead of redistributing the entry array at one time, a strip mining approach was presented in [31] . In this approach, portions of array elements were redistributed in sequence in order to overlap the communication and computation. In [19] , a generalized circulant matrix formalism was proposed to reduce the communication overheads for BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. Using the generalized circulant matrix formalism, the authors derived direct, indirect, and hybrid communication schedules for the cyclic redistribution with the block size changed by an integer factor k. They also extended this technique to solve some multidimensional redistribution problems [20] . However, as the array size increased, the above methods will have a large amount of extra transmission costs and degrades the performance of a redistribution algorithm.
PRELIMINARIES
In general, a BLOCK-CYCLIC(s) over P processors to BLOCK-CYCLIC(t) over Q processors redistribution can be classified as one of three types:
1. s is divisible by t, i.e., BLOCK-CYCLICs kr to BLOCK-CYCLICt r redistribution, 2. t is divisible by s, i.e., BLOCK-CYCLICs r to BLOCK-CYCLICt kr redistribution, and 3. s is not divisible by t and t is not divisible by s. To simplify the presentation, we use kr 3 r , r 3 kr , and s 3 t to represent the first, the second, and the third types of redistribution, respectively, for the rest of the paper. In this section, we first present the terminology used in this paper. Definition 1. Given a BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution, BLOCK-CYCLIC(s), BLOCK-CYCLIC(t), s, and t are called the source distribution, the destination distribution, the source distribution factor, and the destination distribution factor of the redistribution, respectively.
Definition 2. Given an s 3 t , the source local array of processor i , denoted by ve i H X xa À I, is defined as the set of array elements that are distributed to processor i in the source distribution, where H i À I. The destination local array of processor j , denoted by hve j H X xa À I, is defined as the set of array elements that are distributed to processor j in the destination distribution, where H j À I.
Definition 3. Given an s 3 t redistribution on eI X x, the source processor of an array element in eI X x or hve j H X xa À I is defined as the processor that owns the array element in the source distribution, where H j À I. The destination processor of an array element in eI X x or ve i H X xa À I is defined as the processor that owns the array element in the destination distribution, where
Definition 4. Given an s 3 t redistribution on eI X x, a global complete cycle (GCC) of eI X x is defined as qgg lms Â Y t Â . We define eI X qgg as the first global complete cycle of eI X x, eqgg I X P Â qgg as the second global complete cycle of eI X x, and so on.
Definition 5. Given an s 3 t redistribution on eI X x, a local complete cycle of a local array is defined as vgg s qgga in the source distribution and vgg d qgga in the destination distribution. We define
as the first local complete cycle of
as the second local complete cycle of
and so on.
We now give examples to clarify the above definitions. Given a one-dimensional array eI X IHH and S processors, Fig. 1 shows a BLOCK to BLOCK-CYCLIC(10) redistribution on A over five processors. In this paper, we assume that the local array index starts from 0 and the global array index starts from 1. According to Definitions 4 and 5, the size of global complete cycle (GCC) is equal to 100 and the size of the local complete cycle is equal to 20 in both source and destination distributions.
THE GENERALIZED PROCESSOR MAPPING
TECHNIQUE FOR kr 3 r AND r 3 kr ARRAY REDISTRIBUTION
To perform the redistribution shown in Fig is more expensive in terms of the execution time than the computation cost. Therefore, techniques for reducing communication costs are very important. In [12] , a processor mapping technique was proposed to minimize the amount of data exchange in a redistribution. The proposed techniques addressed the case of BLOCK to BLOCK-CYCLIC(x) redistribution. Fig. 2a shows an example of the processor mapping technique for the redistribution shown in Fig. 1 . In Fig. 2a , in the destination distribution, the ªNSº represents the normal sequence of logical processor ranks that start from 0 to M-1, where M is the number of processors. The ªMSº represents the mapping sequence of logical processor ranks that is generated by the mapping function of the processor mapping technique. The shaded portions represent the data that were retained on the same logical processor through the redistribution. In the normal sequence scheme, there are 20 array elements retained locally. However, in the mapping sequence scheme, there are 50 array elements retained locally. Since the global array size is equal to 100, the processor mapping technique provides 30 percent improvement in terms of data transmission time for the redistribution shown in Fig. 1 . We consider another two examples. Fig. 2b and Fig. 2c show the redistribution with different array sizes and destination distribution factors, respectively. In Fig. 2b , a BLOCK to BLOCK-CYCLIC(10) redistribution with larger array size eI X SHH is shown. Both the normal sequence scheme and the mapping sequence scheme have the same amount of array elements retained locally. The processor mapping technique does not provide a larger amount of local data in this case. In Fig. 2c , a BLOCK to BLOCK-CYCLIC(4) redistribution on a one-dimensional array eI X IHH over five processors is shown. Similar to the result of Fig. 2b , the normal sequence and the mapping sequence schemes have the same amount of array elements retained locally. We have the following two observations:
with fixed destination distribution factor r, the processor mapping technique is not effective when the array size is larger than the threshold. 2. Given a BLOCK to BLOCK-CYCLIC(r) redistribution with fixed array size N, the processor mapping technique is not effective when the destination distribution factor r is smaller than the threshold. In fact, BLOCK to BLOCK-CYCLIC(r) redistribution (or vice versa) is a special case of kr 3 r (or r 3 kr ) array redistribution, when k xa r (or k xa r ) where N is the array size. For general redistribution problems, we derive a generalized processor mapping technique for kr 3 r (or r 3 kr ) array redistribution to minimize the amount of data exchange.
According to the values of vgg s , vgg d , and kr, kr 3 r and r 3 kr array redistributions can be classified into two different types: optimal type and general type, as shown in Table 1 . In the optimal type, the generalized processor mapping technique can derive a mapping sequence such that the amount of data exchange is minimal. In the general type, the generalized processor mapping technique can derive a mapping sequence to reduce the amount of data exchange. We will discuss the generalized processor mapping technique for the optimal type and the general type in Section 4.1 and Section 4.2, respectively. Since qgg w Â lmsY t and vgg lmsY t, in the destination distribution, if e is distributed to the destination processor j , so are e qgg, e P Â qggY F F F Y and e xaw À qgg, where H j w À I and I qgg. t u Lemma 2. Given an s 3 t redistribution on eI X x over M processors, for a destination processor j , hve j m, hve j m vgg, hve j m P Â vggY F F F Y and hve j m xaw À vgg have the same source processor, where H j w À I and H m vgg À I.
Proof. The proof of this lemma is similar to Lemma 1. t u Given a one-dimensional array eI X IHH and w S processors, Fig. 3 shows a BLOCK-CYCLIC(10) to BLOCK-CYCLIC(5) redistribution on A over M processors. According to Lemmas 1 and 2, we know that each local complete cycle (LCC) has the same communication patterns. In Fig. 3 , for source processor P , array elements ve P H X W and ve P IH X IW are in the first and the second LCC, respectively. ve P H X W and ve P IH X IW have the same communication patterns. Therefore, for kr 3 r redistribution, a processor only needs to construct the communication sets for its first LCC. Then, it can perform the redistribution. Similarly, to present the generalized processor mapping technique, we only discuss how to derive a mapping sequence in the first LCC.
Given a kr 3 r redistribution on a one-dimensional array eI X x over M processors, we usè
and`
to represent the normal sequence and the mapping sequence, respectively, where j represents the new logical processor rank of j . The main idea of the generalized processor mapping technique is to distribute the global array elements onto destination processors according to the mapping sequence instead of the normal sequence in the destination distribution. For a destination processor j , the new logical processor rank of j can be determined by the following equation:
where j H to w À I. Fig. 4 shows a BLOCK-CYCLIC(10) to BLOCK-CYCLIC(5) redistribution on a one-dimensional array eI X IHH over five processors. There are two kinds of logical processor sequences illustrated in this example. Since the normal sequence of destination processor ranks is H , I , P , Q , and R . According to (1), the new ranks of destination processors H , I , P , Q , and R are equal to 0, 3, 1, 4, and 2, respectively. Therefore, the mapping sequence of destination processors is H , Q , I , R , and P . From Fig. 4 , we can see that there are 20 array elements retained locally in a normal sequence scheme while there are 50 array elements retained locally in a mapping sequence scheme. The generalized processor mapping technique provides a larger amount of local data. The following lemma shows that the mapping sequence generated by (1) can achieve the maximum amount of data that was retained on the same logical processor through a redistribution:
Lemma 3. Given a kr 3 r redistribution on a one-dimensional array eI X x over M processors, (1) determines a mapping sequence of destination processors to achieve the maximum ratio k w AE Ç X k, between local data and the global array size.
Proof. We prove the lemma in two parts: 1) The maximum ratio is
2) The mapping sequence generated by (1) can achieve the maximum ratio k w AE Ç X k.
1. Given a kr 3 r redistribution, for a source processor i , where H i w À I: If k`w, then at most r elements are retained on the local array in each local complete cycle. Since there are M local complete cycles in a GCC, the total 
ratio between local data and the number of array elements in a GCC is wr X qgg wr X wkr I X kX This is equal to
elements are retained on local array in each local complete cycle. Since there are M local complete cycles in a GCC, the total amount of data retained on local array is wr Â k w
AE Ç
. In a kr 3 r redistribution, qgg wkr, therefore, the ratio between local data and the number of array elements in a GCC is wr X qgg wr Â k w AE Ç X wkr k w AE Ç X k. From the above description, the maximum ratio is k w AE Ç X k. 2. In a kr 3 r redistribution, each GCC has the same communication patterns, therefore, we only need to prove that the mapping sequence can achieve the maximum ratio
If k`w, in the source distribution, the source processors of array elements in vgg H , vgg I Y F F F Y and vgg wÀI (i.e., eI X kr, ekr I X PkrY F F F Y and ew À I Â kr I X wkr are H , I Y Y and wÀI , respectively. In the destination distribution, the destination processors of the first r array elements of local complete cycles vgg H , vgg I Y F F F Y and vgg wÀI (i.e., eI X r, respectively. Therefore, there are Mr array elements retained on the same logical processor in the source and destination distribution in a GCC. The ratio between local data and global array size in a GCC is equal to wr X qgg wr X wkr I X kX This is equal to k w AE Ç X k I X k, when k`w That means the mapping sequence can achieve the ratio I X k.
For the case of k ! w, the proof of this part is similar to above. Therefore, from these two parts, we know that (1) can determine a logical sequence of destination processors to achieve the maximum ratio k w
X k, between local data and global array size. t u B. T : Given a kr 3 r redistribution on a onedimensional array eI X x, we use`q H Y q I Y q P Y Y q ÀI b and q 1H Y q 1I Y q 1P Y F F F Y q 1ÀI b to represent the normal sequence and the mapping sequence, respectively, where 1j represents the new logical processor rank of q j . For a destination processor q j , the new logical processor rank of q j can be determined by the following equation:
where j H to À I. An example of the generalized processor mapping technique for kr 3 r redistribution with different source and destination processor sets is shown in Fig. 5 . In  Fig. 5 , there are four source processors and eight destination processors. According to (2), the mapping sequence of destination processors q H , q I , q P , q Q , q R , q S , q T , and q U are equal to 0, 4, 1, 5, 2, 6, 3, and 7, respectively. In the mapping sequence scheme, there are 20 array elements retained locally in a global complete cycle. Since qgg RH, the ratio between local data and global array size is equal to PH X RH I X P. According to Lemma 3, the mapping sequence Fig. 5 . The following Lemma shows that the generalized processor mapping technique can achieve the maximum amount of data retained locally for kr 3 r redistribution, if vgg s is equal to kr.
Lemma 4. Given a kr 3 r redistribution on a onedimensional array eI X x: If vgg s kr, (2) determines a mapping sequence of destination processors to achieve the maximum ratio k l m X k between local data and global array size.
2) The mapping sequence generated by (2) can achieve the maximum ratio.
1. Given a kr 3 r redistribution, for a source processor i , where H i À I: If k`, then at most r elements are retained on the local array in each local complete cycle. Since there are P local complete cycles in a GCC, the total amount of data retained on the local array is Pr. In a kr 3 r redistribution, qgg kr, therefore, the ratio between local data and the number of array elements in a GCC is r X qgg r X kr I X kX This is equal to
If k ! , then at most k l m elements are retained on local array in each local complete cycle. Since there are P local complete cycles in a GCC, the total amount of data that is retained on the local array is r Â k l m
. In a kr 3 r redistribution, qgg kr, therefore, the ratio between the local data and the number of array elements in a GCC is
From the above description, the maximum ratio between local data and global array size is respectively. Therefore, there are Pr array elements retained on the same logical processor in the source and destination distribution in a GCC. The ratio between local data and global array size in a GCC is equal to r X qgg r X kr I X kX
That means the mapping sequence can achieve the ratio
For the case of k ! , the proof of this part is similar to above. Therefore, we know that (2) can determine a logical sequence of destination processors to achieve the maximum ratio k l m X k, between local data and global array size. t u
r 3 kr Array Redistribution
A. : In this section, we present the generalized processor mapping technique for r 3 kr array redistribution with same source and destination processor sets. Given an r 3 kr redistribution on a one-dimensional array eI X x over M processors, we use`
to represent the normal sequence and the mapping sequence, respectively, where j represents the new logical processor rank of j . For a destination processor j , the new logical processor rank of j can be determined by the following equation:
where j H to w À I. Fig. 6 shows a BLOCK-CYCLIC (5) to BLOCK-CYCLIC(10) redistribution on a one-dimensional array eI X IHH over five processors. In Fig. 6 , two kinds of logical processor sequences are illustrated. The normal sequence of the destination processor ranks is H , I , P , Q , and R . According to (3), the new ranks of destination processors H , I , P , Q , and R are equal to 0, 2, 4, 1, and 3, respectively. Therefore, the mapping sequence of destination processors is H , P , R , I , and Q . From Fig. 6 , we can see that there are 20 array elements retained locally in the normal sequence scheme while there are 50 array elements retained locally in a mapping sequence scheme. The generalized processor mapping technique provides a larger amount of local data. The following lemma states that the mapping sequence generated by (3) can achieve the maximum amount of data that is retained on the same logical processor through an r 3 kr redistribution.
Lemma 5. Given an r 3 kr redistribution on a one-dimensional array eI X x over M processors, (3) determines a mapping sequence of destination processors to achieve the maximum ratio k w AE Ç X k between local data and the global array size.
Proof. The proof of this lemma can be easily established according to Lemma 3. t u B. T : Given an r 3 kr redistribution on a onedimensional array eI X x, we use`q H Y q I Y q P Y F F F Y q ÀI b and`q H Y q I Y q P Y F F F Y q ÀI b to represent the normal sequence and the mapping sequence, respectively, where j represents the new logical processor rank of q j . For a destination processor q j , the new logical processor rank of q j can be determined by the following equation:
where j H to À I. Proof. The proof of this lemma can be easily established according to Lemma 4. t u
General Type
According to Table 1 , there are two types of redistribution in the general type: kr 3 r redistribution with vgg s T kr and r 3 kr redistribution with vgg d T kr. For kr 3 r redistribution, the mapping function is the same as (2). For r 3 kr redistribution, the mapping function is the same as (4). Fig. 7 shows an example of kr 3 r redistribution with vgg s Qkr. According to (2), the mapping sequence is H Y Q Y I Y R Y P Y S . In Fig. 7 , both the normal sequence scheme and the mapping sequence scheme provide the same amount of local data. The generalized processor mapping technique does not provide a larger amount of local data than that of the normal method in this case. Fig. 8 shows another example of kr 3 r redistribution with vgg s Pkr. According to (2), the mapping sequence is
In Fig. 8 , the mapping sequence scheme provides a larger amount of local data than that of the normal sequence scheme. From the above two examples, we know that the processor mapping technique provides a different improvement for different kr 3 r redistribution. In Section 6, we will present a theoretical model to analyze the amount of local data in the generalized processor mapping technique. The mathematical model can also calculate the improvement of the generalized processor mapping technique for kr 3 r redistribution or vice versa.
MULTIDIMENSIONAL ARRAY REDISTRIBUTION
The generalized processor mapping technique can be extended to multidimensional arrays. To simplify the presentation, we use Fig. 7 . kr 3 r redistribution on eI X x with different sequence of destination processor ranks, where k P, r S, x IPH, R, T. Fig. 8 . kr 3 r redistribution on eI X x with different sequence of destination processor ranks, where k P, r S, x IPH, Q, IP.
to represent the reverse case. Since the source and destination processor sets may be different, we use H Y I Y F F F Y nÀI and H Y I Y F F F Y nÀI to represent the source and the destination processor grids, respectively. The mapping functions 2 and 4 can be extened as follows: Given a
for a destination processor q j in the th dimension, if the new logical processor rank of q j is denoted by j, then the value of j can be determined by the following equation:
where H j À I and H n À I. Given a
where H j À I and H n À I.
PERFORMANCE EVALUATION AND EXPERIMENTAL RESULTS

Theoretical Analysis
From the description in Section 4, we know that the generalized processor mapping technique can reduce the data transmission cost for kr 3 r redistribution and vice versa. In this section, we present a theoretical model to analyze the performance of the generalized processor mapping technique.
kr 3 r and r 3 kr Array Redistribution
We first consider the case of kr 3 r and r 3 kr array redistribution with the same source and destination processor set. Given a kr 3 r (or r 3 kr ) array redistribution on a one-dimensional array eI X x over M processors, since each global complete cycle (GCC) has the same communication patterns, we only consider the redistributing patterns in a GCC. To analyze the normal method and the generalized processor mapping technique, we use v norml and v mpping to represent the amount of local data generated by normal sequence and mapping sequence in a GCC, respectively. Therefore, the total number of local data for a redistribution is equal to
Given a kr 3 r (or r 3 kr) redistribution on a onedimensional array eI X x over M processors, the value of v norml can be determined by the following equation:
where " is defined as follows:
where ei is defined as follows:
ei Ài w À ik mod w mod w`k mod wY W 
According to (7) and (10), we can have the following equation:
The generalized processor mapping technique provides a larger amount of local data than that of the normal method, when w b ". Since the value of " is smaller than or equal to M (according to (8) ), the generalized processor mapping technique is effective for all kr 3 r (or r 3 kr) redistribution.
Given a kr 3 r (or r 3 kr ) redistribution with different source and destination processor sets, the theoretical analysis for kr 3 r and r 3 kr redistribution are constructed as follows:
. kr 3 r : Given a kr 3 r redistribution, if vgg s mkr, where m is a positive integer, the value of v norml can be determined by the following equation:
where is defined as follows:
where fi is defined as follows: 
. r 3 kr : Given an r 3 kr redistribution with vgg d mkr, where m is a positive integer, the theoretical model for r 3 kr redistribution can be constructed by exchanging the variables P and Q in (12) to (15).
Experimental Results
To verify the performance analysis that was presented in Section 6.1, we have implemented the generalized processor mapping technique into the algorithms proposed in [11] for kr 3 r and r 3 kr redistribution. We called algorithms with and without the generalized processor mapping technique GPMT_KRR and KRR, respectively. All algorithms were written in the single program multiple data (SPMD) programming paradigm with C+MPI codes and executed on an IBM SP2 parallel machine. To get the experimental results, each test sample with a particular array size was executed 14 times by each algorithm. The mean time of these 14 tests (except the two maximum and the two minimum values) that were executed by an algorithm was used as the time to perform a redistribution. The single-precision array was used for the test. Table 2 shows the time of GPMT_KRR and KRR to execute different kr 3 r and r 3 kr redistribution on a 50-node SP2. From Table 2 , we have the following two observations:
1. The improvement of GPMT_KRR increases as the value of k decreases. 2. The improvement of GPMT_KRR is more significant when array size increases. The reason for the first observation is that the local data provided by the normal sequence is extremely less than that of the mapping sequence when the value of k is small. For example, for the case when k is equal to 2, the values of v norml and v mpping are equal to 4 and 100, respectively. In this case, GCC = 200. That means the mapping sequence provides 96/200 (= 48 percent) improvements. For the case when k is equal to 4, the values of v norml and v mpping are equal to 12 and 100, respectively. In this case, qgg SHH, the mapping sequence provides 88/500 (= 17.6 percent) improvements. Therefore, when the value of k increases, the performance of GPMT_KRR and KRR will become close. These phenomena match the performance analysis presented in Section 6.1. Fig. 9 shows the performance of GPMT_KRR and KRR to execute the redistribution samples (k = 2, 5, 10) shown in Table 2 . The array size is IXT Â IH V bytes. From Fig. 9a and Fig. 9b , we can see that the performance of GPMT_KRR and KRR are approximate when the value of k increases. The reason for the second observation is that when the array size is small, the communication time is not significant, in terms of the total time of redistribution. The improvement of the generalized processor mapping technique is not significant either. When the array size is large, the communication time dominates the performance of a redistribution. Therefore, the improvement of the generalized processor mapping technique is more significant. Table 3 shows the time of GPMT_KRR and KRR to execute different kr 3 r and r 3 kr redistributions with different source and destination processor sets, where SH and RH. Fig. 10 shows the performance of GPMT_KRR and KRR to execute the redistribution samples (k PY SY IH) shown in Table 3 . From Table 3 and Fig. 10 , we have similar observations as those obtained from Table 2 and Fig. 9 . Fig. 11a and Fig. 11b show the performance of GPMT_KRR and KRR to execute fgIHY R 3 fgSY P and fgPHY IS 3 fgSY S redistributions, respectively. Table 4 shows the execution time of the redistribution shown in Fig. 11 . From Fig. 11 , we can see that the improvement of the generalized processor mapping technique in Fig. 11a is larger than that of the generalized processor mapping technique in Fig. 11b . For fgIHY R 3 fgSY P and fgPHY IS 3 fgSY S redistribution, the values of Fig. 11a are smaller than the values of k H Y k I in Fig. 11b . According to the first observation in Table 2 , the GPMT_KRR can have a larger improvement in Fig. 11a. From Fig. 11 , we also observe that the improvement is more significant when the array size becomes large. The reason is the same as that described for Table 2 .
From the above performance analysis and experimental results, we have the following remarks: Remark 1. The generalized processor mapping technique can minimize the amount of data exchange for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) array redistribution. The data transmission cost can be reduced.
Remark 2. The generalized processor mapping technique provides significant improvement when the value of k is small. However, when the value of k is large, the improvement of the generalized processor mapping technique will be limited.
Remark 3. The generalized processor mapping technique provides significant improvement when the array size is large.
CONCLUSIONS
Array redistribution is usually used in data-parallel programs to minimize the runtime cost of performing data exchange among different processors. Since it is performed at runtime, efficient methods are required for array redistribution. In this paper, we have presented a generalized processor mapping technique to minimize the amount of data needed to be communicated for BLOCK-CYCLIC(kr)
to BLOCK-CYCLIC(r) array redistribution and vice versa. Based on the mathematical mapping functions, a new sequence of logical processors is derived to achieve the maximum amount of data that can be retained locally through a redistribution. The communication cost of a redistribution can be reduced when the data transmission costs become lower. The generalized processor mapping technique can handle array redistribution with arbitrary source and destination processor sets and can be applied to multidimensional arrays. The theoretical model and experimental results show that the generalized processor mapping technique can provide performance improvement over a wide range of redistribution problems. When array size is large and the value of ªkº is small, the generalized processor mapping technique performs very well for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) array redistribution and vice versa. Our techniques can only handle dense arrays and In-core programs. There are some possible extensions that could be made. One of the issues would be to consider out-of-core external array redistribution. Another important future research direction would be to investigate the redistribution techniques in irregular scientific computation programs. It would also be interesting to consider the array redistribution of sparse arrays. 
