Abstract. Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in underutilization of idle system resources thereby decreasing overall system throughput. In our research, we have developed a prototype framework called ReSHAPE, which supports dynamic resizing of parallel MPI applications executing on distributed memory platforms. The resizing library in ReSHAPE includes support for releasing and acquiring processors and efficiently redistributing application state to a new set of processors. In this paper, we derive an algorithm for redistributing two-dimensional block-cyclic arrays from P to Q processors, organized as 2-D processor grids. The algorithm ensures a contention-free communication schedule for data redistribution if Pr ≤ Qr and Pc ≤ Qc. In other cases, the algorithm implements circular row and column shifts on the communication schedule to minimize node contention.
Introduction
As terascale supercomputers become more common and as the high-performance computing (HPC) community turns its attention to petascale machines, the challenge of providing effective resource management for high-end machines grows in both importance and difficulty. A fundamental problem is that conventional parallel schedulers are static, i.e., once a job is allocated a set of resources, they remain fixed throughout the life of an application's execution. It is worth asking whether a dynamic resource manager, which has the ability to modify resources allocated to jobs at runtime, would allow more effective resource management. The focus of our research is on dynamically reconfiguring parallel applications to use a different number of processes, i.e., on dynamic resizing of applications. 1 In order to explore the potential benefits and challenges of dynamic resizing, we are developing ReSHAPE, a framework for dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment. The ReSHAPE framework includes a programming model and an API, data redistribution algorithms and a runtime library, and a parallel scheduling and resource management system framework. ReSHAPE allows the number of processors allocated to a parallel message-passing application to be changed at run time. It targets long-running iterative computations, i.e., homogeneous computations that perform similar computational steps over and over again. By monitoring the performance of such computations on various processor sizes, the ReSHAPE scheduler can take advantage of idle processors on large clusters to improve the turnaround time of high-priority jobs, or shrink low-priority jobs to meet quality-of-service or advanced reservation commitments.
Dynamic resizing necessiates runtime application data redistribution. Many high performance computing applications and mathematical libraries like ScaLAPACK [1] require block-cyclic data redistribution to achieve computational efficiency. Data redistribution involves four main stages -data identification and index computation, communication schedule generation, message packing and unpacking and finally, data transfer. Each processor identifies its part of the data to redistribute and transfers the data in the message passing step according to the order specified in the communication schedule. A node contention occurs when one or more processors sends messages to a single processor. A redistribution communication schedule aims to minimize these node contentions and maximiz network bandwidth utilization. Data is packed or marshalled on the source processor to form a message and is unmarshalled on the destination processor.
In this paper, we present an algorithm for redistributing two-dimensional blockcyclic data from P ( P r rows ×P c columns) to Q (Q r rows ×Q c columns) processors, organized as 2-D processor grids. We evaluate the algorithm's performance by measuring the redistribution time for different block-cyclic matrices. If P r ≤ Q r and P c ≤ Q c , the algorithm ensures a contention-free communication schedule for redistributing data from source processor set P to Q processor set. In other cases the algorithm minimizes node contentions by performing row or column circular shifts on the communication schedule. The algorithm discussed in this paper supports 2-D block cyclic data redistribution for only one-and two-dimensional processor topology. We also discuss in detail the modifications needed to port an existing scientific application to use the dynamic resizing capability of ReSHAPE using the API provided by the framework.
The rest of the paper is organized as follows: Section 2 discusses prior work in the area of data redistribution. Section 3 briefly reviews the architecture of the ReSHAPE framework and discusses in detail the two-dimensional redistribution algorithm and the ReSHAPE API. Section 4 reports our experimental results of the redistribution algorithm with the ReSHAPE framework tested on the SystemX cluster at Virginia Tech. We conclude in Section 5 discussing future directions to this research.
RelatedWork
Data redistribution within a cluster using message passing approach has been extensively studied in literature. Many of the past research efforts [2] [12] were targeted towards redistributing cyclically distributed one dimensional arrays between the same set of processors within a cluster on a 1-D processor topology. To reduce the redistribution overhead cost, Walker and Otto [12] and Kaushik [7] proposed a K-step communication schedule based on modulo arithmetic and tensor products repectively. Ramaswamy and Banerjee [9] proposed a redistribution technique, PITFALLS, that uses line segments to map array elements to a processor. This algorithm can handle any arbitrary number of source and destination processors. However, this algorithm does not use communication schedules during redistribution resulting in node contentions during data transfer. Thakur et al. [11] [10] use gcd and lcm methods for redistributing cyclically distributed one dimensional arrays on the same processor set. The algorithms described by Thakur et al. [10] and Ramaswamy [9] use a series of one-dimensional redistributions to handle multidimensional arrays. This approach can result in significant redistribution overhead cost due to unwanted communication. Kalns and Ni [6] presented a technique for mapping data to processors by assigning logical processor ranks to the target processors. This technique reduces the total amount of data that must be communicated during redistribution. Hsu et al. [5] further extended this work and proposed a generalized processor mapping technique for redistributing data from cyclic(kx) to cyclic(x), and vice versa. Here, x denotes the number of data blocks assigned to each processor. However, this method is applicable only when the number of source and target processors are same. Chung et al. [2] proposed an efficient method for index computation using basic-cycle calculation (BCC) technique for redistributing data from cyclic(x) to cyclic(y) on the same processor set. An extension of this work by Hsu et al. [13] uses generalized basic-cyclic calculation method to redistribute data from cyclic(x) over P processors to cyclic(y) over Q processors. The generalized BCC uses uses bipartite matching approach for data redistribution. Lim et al. [8] developed a redistribution framework that could redistribute one-dimensional array from one block-cyclic scheme to another on the same processor set using a generalized circulant matrix formalism. Their algorithm applies row and column transformations on the communication schedule matrix to generate a conflict-free schedule.
Prylli et al. [14] , Desprez et al. [3] and Lim et al. [15] proposed efficient algorithms for redistributing one-and two-dimensional block cyclic arrays. Prylli et al. [14] proposed a simple scheduling algorithm, called Caterpillar, for redistributing data across a two-dimensional processor grid. At each step d in the algorithm, processor P i (0 < i ≤ P ) in the destination processor set exchanges its data with processor P ((P −i−d) mod P ) . The Caterpillar algorithm does not have a global knowledge of the communication schedule and redistributes the data using the local knowledge of the communications at every step. As a result, this algorithm is not efficient for data redistribution using "nonall-to-all" communication. Also, the redistribution time for a step is the time taken to transfer the largest message in that step. Desprez et al. [3] proposed a general solution for redistributing one-dimensional block-cyclic data from a cyclic(x) distribution on a P-processor grid to a cyclic(y) distribution on a Q-processor grid for arbitrary values of P, Q, x, and y. The algorithm assumes the source and target processors as disjoint sets and uses a bipartite matching to compute the communication schedule. However, this algorithm does not ensure a contention-free communication schedule. In a recent work, Guo and Pan [4] described a method to construct schedules that minimizes number of communication steps, avoids node contentions, and minimizes the effect of difference in message length in each communication step. Their algorithm focuses on redistributing one-dimensional data from a cyclic(kx) distribution on P processors to cyclic(x) distribution on Q processors for any arbitrary positive values of P and Q. Lim et al. [15] propose an algorithm for redistributing a two-dimensional block-cyclic array across a two-dimensional processor grid. But the algorithm is restricted to redistributing data across different processor topologies on the same processor set. Park et al. [16] extended the idea described by Lim et al. [15] and proposed an algorithm for redistributing onedimensional block-cyclic array with cyclic(x) distribution on P processors to cyclic(kx) on Q processors where P and Q can be any arbitrary positive value.
To summarize, most of the existing approaches either deal with redistribution of block-cyclic array across one-dimensional processor topology on the same or on a different processor set. The Caterpillar algorithm by Prylli et al. [14] is the closest related work to our redistribution algorithm in that it supports redistribution on checkerboard processor topology. In our work, we extend the idea in [15] [16] to develop an algorithm to redistribute two-dimensional block-cyclic data distributed across a 2-D processor grid topology. The data is redistributed from P (P r × P c ) to Q (Q r × Q c ) processors where P and Q can be any arbitrary positive value. Our work is contrary to Desprez et al. [3] where they assume that there is no overlap among processors in the source and destination processor set. Our algorithm builds an efficient communication schedule and uses non-all-to-all communication for data redistribution. We apply row and column transformations using the circulant matrix formalism to minimize node contentions in the communication schedule.
System Overview
The ReSHAPE framework, shown in Figure 1 (a), consists of two main components. The first component is the application scheduling and monitoring module which schedules and monitors jobs and gathers performance data in order to make resizing decisions based on application performance, available system resources, resources allocated to other jobs in the system and jobs waiting in the queue. The second component of the framework consists of a programming model for resizing applications. This includes a resizing library and an API for applications to communicate with the scheduler to send performance data and actuate resizing decisions. The resizing library includes algorithms for mapping processor topologies and redistributing data from one processor topology to another. The individual components in these modules are explained in detail by Sudarsan and Ribbens [17] .
Resizing library
The resizing library provides routines for changing the size of the processor set assigned to an application and for mapping processors and data from one processor set to another. An application needs to be re-compiled with the resize library to enable the scheduler to dynamically add or remove processors to/from the application. During resizing, rather than suspending the job, the application execution control is transferred to the resize library which maps the new set of processors to the application and redistributes the data (if required). Once mapping is completed, the resizing library returns control back to the application and the application continues with its next iteration. The application Our API gives programmers a simple way to indicate resize points in the application, typically at the end of each iteration of the outer loop. At resize points, the application contacts the scheduler and provides performance data to the scheduler. The metric used to measure performance is the time taken to compute each iteration. The scheduler's decision to expand or shrink the application is passed as a return value. If an application is allowed to expand to more processors, the response from the Remap Scheduler includes the size and the list of processors to which an application should expand. A call to the redistribution routine remaps the global data to the new processor set. If the Scheduler asks an application to shrink, then the application first redistributes its global data across a smaller processor set, retrieves its previously stored MPI communicator, and creates a new BLACS [18] context for the new processor set. The additional processes are terminated when the old BLACS context is exited. The resizing library notifies the Remap Scheduler about the number of nodes relinquished by the application.
Application Programming Interface (API)
A simple API allows user codes to access the ReSHAPE framework and library. The core functionality is accessed through the following internal and external interfaces. These functions are available for use by advanced application programmers. These functions provide the main functionality of the resizing library by contacting the scheduler, remapping the processors after an expansion or a shrink, and redistributing the data. These functions are listed as follows: Figure 2(a) shows the source code for a simple MPI application for solving a sequence of linear system of equations using ScaLAPACK functions. The original code was refactored to identify the global data structures and variables. The ReSHAPE API calls were inserted at the appropriate locations in the refactored code. Figure 2(b) shows the modified code.
Data Redistribution
The data redistribution library in ReSHAPE uses an efficient algorithm for redistributing block-cyclic arrays between processor sets organized in a 1-D (row or column format) or checkerboard processor topology. The algorithm for redistributing 1-D blockcyclic array over a one-dimensional processor topology was first proposed by Park et al. [16] . We extend this idea to develop an algorithm to redistribute both one-and twodimensional block-cyclic data across a two-dimensional processor grid of processors. In our redistribution algorithm, we assume the following:
-Source processor configuration: P r × P c (rows × columns), P r , P c > 0. -The data granularity is set at the block level, i.e., a block is the smallest data that will be transferred which cannot be further subdivided. This block size is specified by the user. -The data matrix, data, which needs to be redistributed, is of dimension n × n.
-Let the block size be NB. Therefore total number of data blocks = (n/N B) * (n/N B) = N × N , represented using matrix Mat. -We use M at(x, y) to refer block(x, y), 0 ≤ x, y < N . -The data that can be equally divided among the source and destination processors P and Q respectively, i.e., N is evenly divisible by P r , P c , Q r , and Q c . Each processor has an integer number of data blocks. -The source processors are numbered P (i,j) , 0 ≤ i < P r , 0 ≤ j < P c and the destination processors are numbered as
Problem Definition. We define 2D block-cyclic distribution as follows: Given a two dimensional array of n × n elements with block size NB and a set of P processors arranged in checkerboard topology, the data is partitioned into N × N blocks and distributed across P processors, where N = n/N B. Using this distribution a matrix block, M at(x, y), is assigned to the source processor P c * (x%P r ) + y%P c , 0 ≤ x < N , 0 ≤ y < N . Here we study the problem of redistributing a two-dimensional blockcyclic matrix from P processors to Q processors arranged in checkerboard topology, where P = Q and N B is fixed. After redistribution, the block M at(x, y) will belong to the destination processor Q c * (x%Q r ) + y%Q c , 0 ≤ x < N , 0 ≤ y < N . (a) Superblock: Figure 3(a) shows the checkerboard distribution of a 8 × 6 blockcyclic data on source and destination processor grids. The b00 entry in the source layout table indicates that the block of data is owned by processor P (0,0) , block denoted by b01 is owned by processor P (0,1) and so on. The numbers on the top right corner in every block indicates the id of that data block. From this data layout, a periodic pattern can be identified for redistributing data from source to destination layout. The blocks M at(0, 0), M at(0, 2), M at(2, 0), M at(2, 2), M at(4, 0) and M at(4, 2), owned by processor P (0,0) in the source layout, are transferred to processors Q (0,0) , Q (0,2) , Q (2,0) , Q (2,2) , Q (1,0) and Q (1, 2) . This mapping pattern repeats itself for blocks M at(0, 4), M at(0, 6), M at(2, 4), M at(2, 6), M at (4, 4) and M at(4, 6). Thus we can see that the communication pattern of the blocks M at(i, j), 0 ≤ i < 5, 0 ≤ j < 4 repeats for other blocks in the data. A superblock is defined as the smallest set of data blocks whose mapping pattern from source to destination processor can be uniquely identified. For a 2-D processor topology data distribution, each superblock is represented as a table of R rows and C columns, where R = lcm(P r , Q r ) C = lcm(P c , Q c ) The entire data is divided into multiple superblocks and the mapping pattern of the data in each superblock is identical to the first superblock, i.e., the data blocks located at the same relative position in all the superblocks are transferred to the destination processor. A 2-D block matrix with Sup elements is used to represent the entire data where each element is a Superblock. The dimensions of this block matrix are Sup R and Sup C where, processor layout for the data before redistribution for a single superblock. Since the data-processor mapping is identical over all the superblocks, only one instance of this table is created. The table has R rows ×C columns. IDP C(i, j) contains the processor id P (i,j) that owns the block M at(i, j) located at the same relative position in all the superblocks, (0 ≤ i <, R, 0 ≤ j < C).
(d) Final Data-Processor Configuration (FDPC):
The table represents the final processor configuration for the data layout after redistribution for a single superblock. Like IDPC, only one instance of this table is created and used for all the data superblocks. The dimensions of this table is R × C. FDPC(i, j) contains the processor id Q (i,j) that owns the block M at(i, j) after redistribution located at the same relative position in all the superblocks, (0 ≤ i < R, 0 ≤ j < C). (e) The source processor for any data block Mat(i, j) in the data matrix can be computed using the formula Source(i, j) = P c * (i%P r ) + (j%P c ) (f) Communication schedule send table (C T ransf er ): This table contains the final communication schedule for redistributing data from source to destination layout. This table is created by re-ordering the FDPC table. The columns of C T ransf er correspond to P source processors and the rows correspond to individual communication steps in the schedule. The number of rows in this table is determined by (R * C)/P . The network bandwidth is completely utilized in every communication step as the schedule involves all the source processors in data transfer. A positive entry in the C T ransf er table indicates that in the i th communication step, processor j will send data to C T ransf er (i, j), 0 ≤ i < (R * C)/P , 0 ≤ j < (P r * P c ). . A positive entry at C Recv (i, j) indicates that processor j will receive data from source processor at C Recv (i, j) in the i th communication step, 0 ≤ i < (R * C)/P , 0 ≤ j < (Q r * Q c ).
If (Q r * Q c ) ≥ (P r * P c ), then the additional entries in the C Recv table are filled with -1.
Algorithm.
Step 1: Create Layout table The Layout array of tables are created by traversing through all the data blocks in matrix M at(i, j), where 0 ≤ i, j < N , 0 ≤ j < N . The superblocks in M at(i, j) is traversed in row-major format. Pseudocode:
Step 2: Creating IDPC and FDPC tables An entry at IDP C(i, j) is calculated using the index i and j of the table and the size of the source processor set P , 0 ≤ i < R, 0 ≤ j < C. The Source function returns the processor id of the owner of the data before redistribution stored in that location. Similarly, an entry F DP C(i, j) is computed using the i and j coordinates of the table and the size of the destination processor set Q, 0 ≤ i < R, 0 ≤ j < C. The Source function returns the processor id of the owner of the redistributed data stored in that location. Pseudocode:
Step 3: Communication schedule tables(C T ransf er and C Recv ) The C T ransf er table stores the final communication schedule for transferring data between the source and the destination processors. The columns in C T ransf er correspond to source processor P (i,j) . The table has C T ransf erRows rows and (P r * P c ) columns, where C T ransf erRows = (R * C)/(P r * P c ) Each entry in the C T ransf er table is filled by sequentially traversing the FDPC table in row-major format. The data corresponding to each processor inserted at the appropriate column at the next available location. An integer counter updates itself and keeps track of the next available location (next row) for each processor.
Pseudocode:
processor id = IDP C(i, j) C T ransf er (counter j , processor id) ← F DP C(i, j) U pdate counter j where 0 ≤ i < R and 0 ≤ j < C. Each row in the C T ransf er table forms a single communication step where all the source processors send the data to a unique destination processor. The C Recv table is used by the destination processors to know the source of their data in a particular communication step.
C Recv (i, C T ransf er (i, j)) = j where 0 ≤ i < C T ransf erRows and 0 ≤ j < (Q r × Q c ). Node contention can occur in the C T ransf er communication schedule if any one of the following conditions are true
If there are node contentions in the communication schedule, create a Processor Mapping (PM) table of dimension R × C and initialize it with the values from FDPC table. To reduce node contentions, the PM tables are circularly shifted in row or columns. To maintain data consistency, same operations are performed on the IDPC table and the superblock tables within the Layout array. The C T ransf er table is created from the modified PM table. We identify 3 situations where node contentions can occur. Case 1 and case 2 are applicable during both expansion and shrinking of an application while Case 3 can occur only when an application is shrinking to a smaller destination processor set. Do the following operation on IDPC, PM and on each 2-D table in the Layout array. Case 1: If P r > Q r and P c < Q c then 1. Create (R/P r ) groups with P r rows in each group. 2. For 1 ≤ i < P r , perform a circular right shift on each row i by P c * i elements in each group. 3. Create the C T ransf er table from the resulting PM table. Case 2: If P r < Q r and P c > Q c then 1. Create (C/P c ) groups with P c columns in each group. 2. For 1 ≤ j < P c , perform a circular down shift on each column j by P r * j elements in each group. 3. Create the C T ransf er table from the resulting PM table. Case 3: If P r > Q r and P c > Q c then 1. Create (C/P c ) groups with P c columns in each group. 2. For 1 ≤ j < P c , perform a circular down shift each column j by P r * j elements in each group. 3. Create (R/P r ) groups with P r rows in each group. 4. For 1 ≤ i < P r , perform a circular right shift each row i by P c * i elements in each group 5. Create the C T ransf er table from the resulting PM table. The C Recv table is not used when the schedule is not contention-free. Node contention results in overlapping entries in the C Recv table thus rendering it as unusable.
Step 4: Data marshalling and unmarshalling
If a processor's rank equal the value at IDP C(i, j), then the processor collects the data from the relative indexes of all the superblocks in the Layout array. Each collection of data over all the superblocks forms a single message for communication for processor j.
If there are no node contentions in the schedule, each source processor stores (R * C)/(P r * P c ) messages, each of size (N * N/(R * C)) in the original order of the data layout. The messages received on the destination processor are unpacked into individual blocks and stored at an offset of (R/Q r ) * (C/Q c ) elements from the previous data block in the local array. The first data block is stored at zero th location of the local array. If the communication schedule has node contentions, the order of the messages are shuffled according to row or column transformations. In such cases, the destination processor performs reverse index computation and stores the data at the correct offset.
Step 5: Data Transfer
The message size in each send communication is equal to (N * N )/(R * C) data blocks. Each row in the C T ransf er table corresponds to a single communication step. In each communication step, the total volume of messages exchanged between the processors is P * (N * N/(R * C)) data blocks. This volume includes cases where data is locally copied to a processor without performing a MPI Send and MPI Recv operation. In a single communication step j, a source processor P i sends the marshalled message to the destination processor given by C T ransf er (j, i), where 0 ≤ j < C T ransf erRows , 0 ≤ i < (P r * P c ),
Data Transfer Cost. For every communication call using MPI Send and MPI Recv, there is a latency overhead associated with it. Let us denote this time to initiate a message by λ. Let τ denote the time taken to transmit a unit size of message from source to destination processor. Thus, the time taken to send a message from a source processor in single communication step is ((N * N )/(R * C)) * τ . The total data transfer cost for redistributing the data across destination processors is C T ransf erRows * (λ + ((N * N )/(R * C)) * τ ).
Experiments and Results
This section presents experimental results which demonstrate the performance of our two-dimensional block-cyclic redistribution algorithm. The experiments were conducted on 50 nodes of a large homogeneous cluster (System X). Each node is a dual 2.3
GHz PowerPC 970 processor with 4GB of main memory. Message passing was done using MPICH2 [19] over a Gigabit Ethernet interconnection network. We integrated the redistribution algorithm into the resizing library and evaluated its performance by measuring the total time taken by the algorithm to redistribute block-cyclic matrices from P to Q processors. We present results from two sets of experiments. The first set of experiments evaluates the performance of the algorithm for resizing and compares it with the Caterpillar algorithm. The second set of experiments focuses on the effects of processor topology on the redistribution cost. Table 1 shows all the possible processor configurations for various processor topologies. Processor configurations for the onedimensional processor topology (1 × Q r * Q c or Q r * Q c × 1) are not shown in the table. For the two set of experiments described in this section, we have used the following matrix sizes -2000 × 2000, 4000 × 4000, 6000 × 6000, 8000 × 8000, 12000 × 12000, 16000 × 16000, 20000 × 20000 and 24000 × 24000. A problem size of 8000 indicates the matrix 8000 × 8000. The processor configurations listed in Table 1 evenly divide the problem sizes listed above. 
Overall Redistribution Time
Every time an application acquires or releases processors, the globally distributed data has to be redistributed to the new processor topology. Thus, the application incurs a redistribution overhead each time it expands or shrinks. We assume a nearly-square processor topology for all the processor sizes used in this experiment. The matrix stores data as double precision floating point numbers. Figure 4 (a) shows the overhead for redistributing large dense matrices for different matrix sizes using the our redistribution algorithm. Each data point in the graph represents the data redistribution cost incurred when increasing the size of the processor configuration from the previous (smaller) configuration. Problem size 8000 and 12000 start execution with 2 processors, problem size 16000 and 20000 start with 4 processors, and the 24000 case starts with 6 processors. The starting processor size is the smallest size which can accommodate the data. The trend shows that the redistribution cost increases with matrix size, but for a fixed matrix size the cost decreases as we increase the number of processors. This makes sense because for small processor size, the amount of data per processor that must be transferred is large. Also the communication schedule developed by our redistribution algorithm is independent of the problem size and depends only on the source and destination processor set size. shows the overhead cost incurred while shrinking large matrices from P processors to Q processors. In this experiment, we assign the values for P from the set 25, 40, 50 and Q from the set 4, 8, 10, 25 and 32. Each data point in the graph represents the redistribution overhead incurred while shrinking at that problem size. From the graph, it is evident that the redistribution cost increases as we increase the problem size. Typically, a large difference between the source and destination processor set results in higher redistribution cost. The rate at which the redistribution cost increases depends on the size of source and destination processor set. But we note that smaller destination processor set size has a greater impact on the redistribution cost compared to the difference between the processor set sizes. This is shown in the graph where the redistribution cost for shrinking from P = 50 to Q = 32 is lower compared to the cost when shrinking from P = 25 to Q = 10 or P = 25 to Q = 8. Figure 5 (a) and 5(b) compares the total redistribution cost of our algorithm and the Caterpillar algorithm. We have not compared the redistribution costs with the bipartite redistribution algorithm as our algorithm assumes that data redistribution from P to Q processors includes an overlapping set processors from the source and destination processor set. The total redistribution time is the sum total of schedule computation time, index computation time, packing and unpacking the data and the data transfer time. In each communication step, each sender packs a message before sending it and the receiver unpacks the message after receiving it. The Caterpillar algorithm does not attempt to schedule communication operations and send equal sized messages in each step. Figure 5(a) shows experimental results for redistributing block-cyclic twodimensional arrays from a 2 × 4 processor grid to a 5 × 8 processor grid. On average, the total redistribution time of our algorithm is 12.7 times less than the Caterpillar algorithm. In Figure 5 (b), the total redistribution time of our algorithm is about 32 times less than of the Caterpillar algorithm. In our algorithm, the total number of communication calls for redistributing from 8 to 40 processors is 80 whereas in Caterpillar the number In this experiment, we report the performance of our redistribution algorithm with four different processor topologies -One-dimensional-row (Row-major), One-dimensional-column (Column major), Skewed-rectangular-row (P r × P c , P r > P c) and Skewed-rectangular-column (P r × P c , P r < P c ). The processor configurations used for the Skewed-rectangular topologies are listed in Table 1 . Figure 6 (a) and Figure 6(b) shows the overhead for redistributing problem size 20000 and 24000 across different processor topologies using the our redistribution algorithm, respectively. The total redistribution cost for redistributing 20000 × 20000 matrix across an one-dimensional topology is comparable to the total redistribution cost on a nearly-square processor topology (see Figure 4(a) ). In the case of skewed-rectangular topologies, the total redistribution time is slightly higher compared to the redistribution cost with nearly-square processor topologies. We ran this experiment on other problem sizes -8000 × 8000 and 16000 × 16000 and observed results similar to Figure 6(a) . An increase in the total redistribution time for skewed-rectangular topology can be due to one of the two situations.
Effects of Processor Topology on Total Redistribution Time
(1) There is an increase in the total number of messages to be transferred using the communication schedule.
(2) Node contention in the communication schedule is high.
Since the dimensions of a superblock depends upon source and destination processor row and columns, a change in the processor topology can change the number of elements in a superblock. As a result, the number of messages exchanged between processors will also vary thereby increasing or decreasing the total redistribution time. Figure 6 (b) shows that the total redistribution cost for a skewed processor topology suddenly increases when the processor size increases from 30 to 36 (10 × 3 to 18 × 2). In this case the number of elements in superblock increases to 540. Table 2 shows the total MPI send/receive counts for redistributing between different processor sets on different topologies. From Table 2 , we note that data redistribution using a skewed-rectangular processor topology requires exactly half the number of send/receive operation as compared to nearly-square topology. The algorithm uses only 18 MPI send/receive operations to redistribute data from 4 to 20 processors and 36 to redistribute from 8 to 40 processors as compared to 36 and 72 respectively required for a nearly-square topology. In Figure 6 (a), the cost of redistribution in a P < Q topology is more than the redistribution cost for a P > Q topology. The reason for this additional overhead can be attributed to increased number of node contentions in the comunication schedule for the P < Q topology. The node contentions reduces as the processor size increases and the topology is maintained in subsequent iterations. When data is redistributed from P = 25 (square topology) to Q = 40 (skewed topology), node contentions in the communication schedule of Q = 40 (10×4) are higher compared to the schedule for redistribution to Q = 40 (4 × 10).
Discussion and Future Work
In this paper we have introduced a framework, ReSHAPE, that enables parallel message passing applications to be resized during execution. We have extended the functionality of the resizing library in ReSHAPE to support redistribution of 2-D block-cyclic matrices distributed across a 2-D processor topology. We build upon the work by Park et al. [16] to derive an efficient 2-D redistribution algorithm. Our algorithm redistributes a two-dimensional block-cyclic data distribution on a 2-D grid of P (P r × P c ) processors to two-dimensional block-cyclic data distribution on a 2-D grid with Q (Q r × Q c )
