We consider the problem of redistributing data on homogeneous and heterogeneous rings of processors. The problem arises in several applications, after each invocation of a load-balancing mechanism (but we do not discuss the load-balancing mechanism itself). We provide algorithms that aim at optimizing the data redistribution, both for unidirectional and bidirectional rings. One major contribution of the paper is that we are able to prove the optimality of the proposed algorithms in all cases except that of a bidirectional heterogeneous ring, for which the problem remains open.
Introduction
In this paper, we consider the problem of redistributing data on a heterogeneous ring of processors. The problem typically arises when a load-balancing phase must be initiated. Because of variations either in the resource performances (CPU speed, communication bandwidth) or in the system/application requirements (completed tasks, new tasks, migrated tasks, etc.), data must be redistributed between participating processors so that the current (estimated) load is better balanced. We do not discuss the loadbalancing mechanism itself (we take it as external, be it a system, an algorithm, an oracle, or whatever). Rather we aim at optimizing the data redistribution induced by the load-balancing mechanism.
We adopt the following abstract view of the problem. There are n participating processors P 1 , P 2 , …, P n . Each processor P k initially holds L k atomic data items. The load-balancing system/algorithm/oracle has decided that the new load of P k should be L k -δ k . If δ k > 0, this means that P k now is overloaded and should send δ k data items to other processors; if δ k < 0, P k is underloaded and should receive -δ k data items from other processors. Of course there is a conservation law: δ k = 0. The goal is to determine the required communications and to organize them (what we call the data redistribution) in minimal time.
We assume that the participating processors are arranged along a ring, either unidirectional or bidirectional, and either with homogeneous or heterogeneous link bandwidths; hence a total of four different frameworks to deal with. There are two main contexts in which processor rings are useful. The first context is that of many applications which operate on ordered data, and where the order needs to be preserved. Think of a large matrix whose columns are distributed among the processors, but with the condition that each processor operates on a slice of consecutive columns. An overloaded processor P i can send its first columns to the processor P j that is assigned the slice preceding its own slice (and P j would append these columns to the end of its slice). Similarly, P i can send its last columns to the processor which is assigned the next slice. Obviously, these are the only possibilities. In other words, the ordered unidimensional data distribution calls for a unidimensional arrangement of the processors, i.e. along a ring.
The second context that may call for a ring is the simplicity of the programming. Using a ring, either unidirectional or bidirectional, allows for a simpler management of the data to be redistributed. Data intervals can be maintained and updated to characterize each processor load. Finally, we observe that parallel machines with a rich but fixed interconnection topology (hypercubes, fat trees, grids, to quote a few) are on the decline. Heterogeneous cluster architectures, which we target in this paper, have a largely unknown interconnection graph, including gateways, backbones, and switches, and modeling the communication graph as a ring is a reasonable, if conservative, choice.
As stated above, we discuss four cases for the redistribution algorithms. We delay the formal statement of the redistribution problems until Section 2, but we summarize the main results as follows. In the simplest case, that of a unidirectional homogeneous ring, we derive an optimal algorithm, and we prove its correctness in full detail. Because the target architecture is quite simple, we are able to provide explicit (analytical) formulae for the number of data sent/received by each processor. The same holds true for the case of a bidirectional homogeneous ring, but the algorithm becomes more complicated. When assuming heterogeneous communication links, we still derive an optimal algorithm for the unidirectional case, but we have to use an asynchronous formulation. However, we are only able to solve the bidirectional case in the special case of light redistributions. We point out that one major contribution of the paper is the design of optimal algorithms, together with their formal proof of correctness; to the best of our knowledge, this is the first time that optimal algorithms have been introduced.
The rest of the paper is organized as follows. In Section 2 we formally state the optimization problem. For homogeneous networks (all links have same capacity), the optimal algorithms are described in Section 3 (unidirectional ring) and in Section 5 (bidirectional ring). For heterogeneous networks, the optimal asynchronous unidirectional algorithm is presented in Section 4, and the linear-programming based optimal algorithm for light redistributions on bidirectional links is explained in Section 6. Section 7 is devoted to a survey of related work. In Section 8, we overview some simulation results that confirm the usefulness of data redistributions. Finally, in Section 9 we conclude the paper and highlight future work directions.
Due to page limits, we were not able to include all the proofs in this paper. The missing proofs can be found in Renard, Robert, and Vivien (2004a) .
Framework
We consider a set of n processors P 1 , P 2 , …, P n arranged along a ring. The successor of P i in the ring is P i + 1 , and its predecessor is P i -1 , where all indices are taken modulo n. For 1 ≤ k, l ≤ n, C k, l denotes the slice of consecutive processors C k, l = P k , P k + 1 , …, P l -1 , P l .
We denote by c i, i + 1 the capacity of the communication link from P i to P i + 1 . In other words, it takes c i, i + 1 time units to send an atomic data item from processor P i to processor P i + 1 . In the case of a bidirectional ring, c i, i -1 is the capacity of the link from P i to P i -1 . We use the oneport model for communications; at any given time, there are at most two communications involving a given processor, one sent and the other received. A given processor can simultaneously send and receive data, so there is no restriction in the unidirectional case. However, in the bidirectional case, a given processor cannot simultaneously send data to its successor and its predecessor; neither can it receive data from both sides. This is the only restriction induced by the model: any pair of communications that does not violate the one-port constraint can take place in parallel.
Each processor P k initially holds L k atomic data items. After redistribution, P k will hold L k -δ k atomic data items. We call δ k the "unbalance" of P k . We denote by δ k, l the total unbalance of the processor slice C k, l :
Because of the conservation law of atomic data items, δ k = 0. Obviously the unbalance cannot be larger than the initial load: L k ≥ δ k . In fact, we suppose that any processor holds at least one data item, both initially (L k ≥ 1) and after the redistribution (L k ≥ 1 + δ k ); otherwise we would have to build a new ring from the subset of resources still involved in the computation.
Homogeneous Unidirectional Ring
In this section, we consider a homogeneous unidirectional ring. Any processor P i can only send data items to its successor P i + 1 , and c i, i + 1 = c for all i [1, n] . We first derive a lower bound on the running time of any redistribution algorithm. Then, we present an algorithm achieving this bound (hence optimal), and we prove its correctness.
Lower Bound
We have the following bound on the optimal redistribution time:
Lemma 1. Let τ be the optimal redistribution time. Then
(1)
Proof. The processor slice C k,k + l = P k , P k + 1 , …, P k + l -1 , P k + l has a total unbalance of δ k, k + l = δ k + δ k + 1 + … + δ k + l -1 + δ k + l . If δ k,k + l > 0, δ k,k + l data items must be sent from C k,k + l to the other processors. The ring is unidirectional, so P k + l is the only processor in C k,k + l with an outgoing link. Furthermore, P k + l needs a time equal to δ k,k + l c to send δ k,k + l data items. Therefore, in any case, a redistribution scheme cannot take less than δ k,k + l c to redistribute all data items. We have the same type of reasoning for the case δ k,k + l < 0. 1
An Optimal Algorithm
Algorithm 1 is an optimal solution to our problem. We first prove its correctness (Lemma 3). Secondly, we prove its optimality (Lemma 4). Intuitively, if step 6 of this algorithm is always feasible, then each execution of step 3 has exactly a length of c, and the algorithm will meet the time bound of Lemma 1.
First, we point out that the slice C start, end is well defined in step 2 of the algorithm: for any slice with an unbalance δ, the slice made up from the remaining processors has the opposite unbalance -δ. Next, we state the particular role of the processor P start : Lemma 2. Processor P start receives no data items during the execution of Algorithm 1.
Proof. We prove the result by contradiction. Suppose that at a given iteration s processor P start receives some data items. Then the predecessor of P start in the ring, P start -1 , sends a data item at this iteration. Thus, P start -1 being a sender, by the condition at step 5 of Algorithm 1, δ start,start-1 = δ start + j ≥ s. However, due to the conservation law, δ i = 0. Hence, 0 ≥ s, the desired contradiction.
To prove that Algorithm 1 is correct, we must show that during each iteration, any processor required to send a data item in step 6 actually holds at least one data item at this iteration. In other words, we must prove that no processor is asked to send a data item that it does not currently own. Let L be the load of P i at the end of iteration s of Algorithm 1:
Lemma 3. During iteration s of loop 3, if P i sends a data item, then L ≥ 1.
Proof. We prove Lemma 3 by induction. By definition of unbalances (see Section 2), we know that each processor P i in the ring initially holds an amount of L = L i ≥ 1 data items. Thus, the result holds for s = 1. Now we suppose that the result holds until a certain iteration s (included), and we focus on iteration s + 1. There are two cases to consider depending whether processor P i is supposed to receive a data item during iteration s + 1 or not:
1. If processor P i is both a sender and a receiver during iteration s + 1, then P i is both a sender and a receiver during iteration s by the condition at step 5 of Algorithm 1. Then the load of P i after iteration s was the same as before that iteration and L = L We conclude using the induction hypothesis. 2. If processor P i is a sender but not a receiver during iteration s + 1, we must verify that P i does not send a data item that it does not hold. Because P i is a sender we have by the condition at Step 5 of Algorithm 1:
.
( 2 ) Furthermore, P i has sent a data item during each of the previous iterations. During iteration s + 1, P i is not a receiver. Thus, P i -1 is not a sender during this iteration, and, by the condition at step 5 of Algorithm 1, we have δ start, i -1 < s + 1. During each iteration from 1 to δ start, i -1 , P i -1 has sent a data item (see below for the proof that δ start, start + j ≥ 0 for all j [0, n -1]). Hence, during each of these iterations, P i was both a sender and a receiver, and neither its load nor its unbalance changed.
During each iteration from 1 + δ start, i -1 to s, processor P i was a sender but not a receiver. So both its load and its unbalance decrease by one during each of these iterations. Hence .
( 3 )
The above proof relies on the property that, for any value of j [0, n -1], δ start, start + j ≥ 0. We now prove this result by contradiction. Hence we suppose that there exists a value j such that δ start, start + j < 0. We have two cases to consider, as follows. Algorithm 1. Redistribution algorithm for homogeneous unidirectional rings. [start, end] . Then δ start, end = δ start, start + j + δ start + j + 1, end and δ start, end < δ start + j + 1, end which contradicts the maximality of C start, end . 2. j + start [start, end] . Then δ start, j + start = δ start, end + δ 1 + end, j + start . So δ start, end < -δ 1 + end, j + start . However, as the sum of unbalances is null by definition, the sum of unbalances of C 1 + end, j + start is equal to the opposite of the sum of unbalances of C j + 1 + start, end . Hence, δ start, end < δ j + 1 + start, end , which contradicts the maximality of C start, end . 221
We have proved the correction of Algorithm 1. We still have to prove that when it terminates, the entire redistribution has actually been performed.
Lemma 4. When Algorithm 1 terminates after iteration δ max , i.e. at time τ, the load of any processor P i is equal to
Proof. We prove by induction on the processor indices, starting at processor P start , that any processor P j has the desired load of L j -δ j at any iteration s ≥ max 0 ≤ i ≤ j δ start, start + i .
As stated by Lemma 2, processor P start never receives a data item during the algorithm execution. So, after δ start, start = δ start iterations of loop 3, P start is never the receiver or the sender of a data item. As required, P start exactly holds L start -δ start data items, i.e. its initial load minus the amount of data items sent.
We suppose the result proved up to a processor P start + l (with l ≥ 0) included. We focus on processor P start + l + 1 . Using the induction hypothesis, we know that at any iteration s ≥ max 0 ≤ i ≤ l δ start, start + i , the total load of the slice C start, start + l is equal to L i -δ i . During the execution of the whole algorithm, processor P start + l + 1 has sent exactly δ start, start + l + 1 data items (remember that for any j [0, n -1], δ start, start + j ≥ 0). All these send operations took place before or during iteration δ start, start + l + 1 . Furthermore, Lemma 2 states that processor P start never receives a data item during the execution. So, the total load of the slice C start, start + l + 1 does not change after iteration δ start, start + l + 1 , and its total load is equal to its initial total load minus the data items sent by processor P start + l + 1 : ( L i ) -δ start, start + l + 1 . Therefore, after any iteration s, where s ≥ max(max 0≤i≤ l δ start,start+i δ start,start+l+1 ) = max 0 ≤ i ≤ l + 1 δ start, start + i , we know the total load of the slices C start, start + l and C start, start + l + 1 . Therefore, we know the load of processor P start + l + 1 at any step t ≥ s:
To conclude, we just need to remark that δ max = max 0 ≤ i ≤ n -1 δ start, start + i . 1
The optimality of Algorithm 1 is a direct consequence of the previous lemmas: Theorem 1. Algorithm 1 is optimal.
Heterogeneous Unidirectional Ring
In this section we still suppose that the ring is unidirectional but we no longer assume the communication paths to have the same capacities. We build on the results of the previous section to design an optimal algorithm (Algorithm 2). In this algorithm, the amount of data items sent by any processor P i is exactly the same as in Algorithm 1 (namely δ start, i ). However, as the communication links have different capacities, we no longer have a synchronous behavior. A processor P i sends its δ start, i data items as soon as possible, but we cannot express its completion time with a simple formula. Indeed, if P i initially holds more data items than it has to send, we have the same behavior as previously: P i can send its data items during the time interval [0, δ start, i c i, i + 1 ]. In contrast, if P i holds fewer data items than it has to send (L i < δ start, i ), P i still starts to send some data items at time 0 but may have to wait to have received some other data items from P i -1 to be able to forward them to P i + 1 .
The asynchronousness of Algorithm 2 implies that it is correct by construction. Furthermore, when the algorithm terminates, the redistribution is complete (the proof is the
Algorithm 2. Redistribution algorithm for heterogeneous unidirectional rings.
same as in Lemma 4). It remains to prove that the running time of Algorithm 2 is optimal. We first compute this running time.
Lemma 5. The running time of Algorithm 2 is max 0 ≤ l ≤ n -1 δ start, start + l c start + l, start + l + 1 .
The result of Lemma 5 is surprising. Intuitively, it says that the running time of Algorithm 2 is equal to the maximum of the communication times of all the processors, if each of them initially stored locally all the data items it will have to send throughout the execution of the algorithm. In other words, there is no forwarding delay, whatever the initial distribution. The proof of Lemma 5 is technical and can be omitted at first reading.
Proof. We prove the result by contradiction, assuming that the running time of Algorithm 2, denoted as t max , is strictly greater than max 0 ≤ l ≤ n -1 δ start, start + l c start + l, start + l + 1 (we assume that the algorithm starts running at time 0). Let P i be any processor whose running time is t max , i.e. let P i be any processor which terminates the emission of its last data item at time t max . By hypothesis, t max > δ start, i c i, i + 1 . Therefore, there is some time during the running time of the algorithm at which processor P i is not sending any data items to processor P i + 1 . Let t i denote the latest time at which P i is not sending any data items. Then, by definition of t i , from time t i until the completion of the algorithm, processor P i is continuously sending data items to P i + 1 . Let n i denote the number of data items that P i sends during that interval. Note that we have t max = t i + n i c i, i + 1 . We now prove by induction that for any value of j ≥ 1.
1. Processor P ij sends a data item to processor
Between time t ic ik, ik + 1 and the completion of the algorithm, processor P ij sends at least j + n i data items to processor P
processor P ij is not sending any data items to processor P ij + 1 (it is idle in sending).
Once we have proved these properties, the contradiction follows from considering processor P start . Processor P start only sends data items that it initially holds (δ start = δ start, start ≤ L start ), and receives no data items from its predecessor in the ring. However, using the above properties, there is a value of j ≥ 0 such that start = ij, and between time t ic ik, ik + 1 and the completion of the algorithm, processor P ij -1 sends at least j + 1 + n i data items to processor P ij = P start . Hence the contradiction.
The construction used in the proof is illustrated by Figure 1 . We start by proving the above properties for j = 1.
1. By definition of t i , processor P i is not sending any data items to processor P i + 1 right before time t i . Because of the "as-soon-as" nature of the algorithm, processor P i is not holding a single data item right before time t i and is waiting for processor P i -1 to send it one. Furthermore, the data item that processor P i started to send at time t i is sent to it by processor P i -1 during the time inter- Fig. 1 The construction used in the proof of Lemma 5.
2. Between time t i and the completion of the algorithm, processor P i sends n i data items to processor P i + 1 . By hypothesis, processor P i holds at least one data item after the completion of the algorithm. As P i holds no data item right before time t i , then between the times t ic i -1, i and t max , P i -1 sends at least 1 + n i data items to P i . 3. From what just precedes, and using the relationship between t i , n i , and t max , we have t i + n i c i, i + 1 = t max and t max ≥ (t ic i -1, i ) + (1 + n i )c i -1, i , which imply c i, i + 1 ≥ c i -1, i , as n i is non-zero by definition. 4. Suppose that processor P i -1 is sending a data item to processor P i right before the time t ic i -1, i . Then, at the earliest, this data item is received by processor P i at time t ic i -1, i . Due to the "as-soon-as" nature of the algorithm, P i forwards this data item to processor P i + 1 (as it forwards data items received later). P i finishes to forward this data item at time t ic i -1, i + c i, i + 1 ≥ t i at the earliest. Therefore, processor P i has no reason not to be sending any data item at time t i , which contradicts the definition of t i .
We now proceed to the general case of the induction. We suppose that the property is proved up to a processor P ij included (with j ≥ 1).
1. By induction hypothesis, processor P ij is not sending any data items to processor P ij + 1 right before time t ic ik, ik + 1 . Because of the "assoon-as" nature of the algorithm, processor P ij is not holding a single data item right before this time and is waiting for processor P ij -1 to send one. Furthermore, the data item that processor P ij started to send at time t ic ik, ik + 1 is sent to it by processor P ij -1 during the time interval [t i -
Between time t ic ik, ik + 1 and the completion of the algorithm, processor P ij sends j + n i data items to processor P ij + 1 , by induction hypothesis. By hypothesis, processor P ij holds at least one data item after the completion of the algorithm. As P ij holds no data item right before time t ic ik, ik + 1 , then between the times t ic ik, ik + 1 and t max , P ij -1 sends at least 1 + j + n i data items to P ij .
3. From what just precedes, and using the relationship between t i , n i , and t max , we have Therefore, and thus: c i, i + 1 ≥ c ij -1, ij as, by induction hypothesis, for any k [1, j], c i, i + 1 ≥ c ik, ik + 1 . 4. Suppose that processor P ij -1 is sending a data item to processor P ij right before the time t ic ik, ik + 1 . Then, at the earliest, this data item is received by processor P ij at time t ic ik, ik + 1 . Due to the "as-soon-as" nature of the algorithm, P ij forwards this data item to processor P ij + 1 (as it forwards data items received later). P ij finishes to forward this data item at time t ic ij -1, ijc ik, ik + 1 at the earliest. Then, following the same line of reasoning, processor P ij + 1 forwards it to P ij + 2 , which receives it at the earliest at time t ic ij -1, ijc ik, ik + 1 , and so on. So, processor P i receives this data item at the earliest at time t ic ij -1, ij , and forwards it. Then, it finishes to send it at the earliest at time t ic ij -1, ij + c i, i + 1 ≥ t i , as we have seen that c i, i + 1 ≥ c ij -1, ij . Therefore, processor P i has no reason not to be sending any data items at time t i , which contradicts the definition of t i . Hence, processor P ij -1 is not sending any data item to processor P ij right before the time t ic ik, ik + 1 . 1 Theorem 2. Algorithm 2 is optimal.
Proof. Let τ denote the optimal redistribution time. Following the arguments used in the proof of Lemma 1 for the homogeneous case in Section 3.1, we obtain the lower bound:
We conclude using Lemma 5. 1
Homogeneous Bidirectional Ring
In this section, we consider a homogeneous bidirectional ring. All links have the same capacity but a processor can send data items to its two neighbors in the ring; there exists a constant c such that, for all i [1, n], c i, i + 1 = c i, i -1 = c. We proceed as for the homogeneous unidirectional case. We first derive a lower bound on the running time of any redistribution algorithm, and then we present an algorithm attaining this bound.
Lower Bound
We have the following bound on the optimal redistribution time. Proof. Consider any processor P i with positive unbalance (δ i > 0). Even if processor P i can send data items to both of its neighbors, because of the one-port model, it cannot send data items to both of them simultaneously. So, it requires processor P i at least a time of δ i c to send δ i data items, whatever the destinations of these data items.
We have a symmetric result for the case δ i < 0. Hence a first lower bound on the optimal redistribution time τ:
Now, consider any non-trivial slice of consecutive processors C k, l . By "non-trivial" we mean that the slice is not reduced to a single processor (we have already considered that case) and that it does not contain all processors. We suppose that δ k, l > 0. So, in any redistribution scheme, at least δ k, l data items must be sent by C k, l . As this slice is not reduced to a single processor, the two processors at the extremities of the slice, P k and P l , can simultaneously send data items to their neighbors outside the slice, P k -1 and P l + 1 , respectively. Therefore, during any time interval of length c, at most two data items can be sent from the slice. So, it takes at least a time of c for the slice C k, l to send δ k, l data items. Once again, the reasoning is similar when receiving data items if δ k, l < 0. Hence a second lower bound on τ:
. We just gather the previous two lower-bounds to obtain the desired bound. 1
An Optimal Algorithm
Algorithm 3 is a recursive algorithm which defines communication patterns designed so as to decrease the value of δ max (computed at step 1) by one from one recursive call to another. The intuition behind Algorithm 3 is the following.
1. Any non-trivial slice C k, l such that = δ max and δ k, l ≥ 0 must send two data items per recursive call, one through each of its extremities. 2. Any non-trivial slice C k, l such that = δ max and δ k, l ≤ 0 must Greceive two data items per recursive call, one through each of its extremities.
3. Once the mandatory communications specified by the two previous cases are defined, we take care of any processor P i such that |δ i | = δ max . If P i is already involved in a communication due to the previous cases, everything is settled. Otherwise, we have the freedom to choose whom P i will send a data item to (case δ i > 0) or whom P i will receive a data item from (case δ i < 0). To simplify the algorithm we decide that all these communications will take place in the direction from P i to P i + 1 .
Algorithm 3 is initially called with the parameter s = 1. For any call to Algorithm 3, all the communications take place in parallel and exactly at the same time, because the communication paths are homogeneous by hypothesis. One very important point about Algorithm 3 is that this algorithm is a set of rules which only specify which processor P i must send a data item to which processor P j , one of its immediate neighbors. Therefore, whatever the number of rules deciding that there must be some data item sent from a processor P i to one of its immediate neighbor P j , only one data item is sent from P i to P j to satisfy all these rules.
To prove that Algorithm 3 is optimal, we show that the set of rules is consistent, i.e. that it respects the one-port model, and that the value δ max (computed at step 1) decreases by one at each recursive call. Lemma 7. Algorithm 3 satisfies to all the one-port constraints.
Lemma 8. Algorithm 3 terminates in exactly recursive calls.
The optimality of Algorithm 3 is then a simple corollary of Lemma 8 and of the lower bound defined by equation (5) (the missing proofs can be found in Renard, Rober, and Vivien 2004a) . Theorem 3. Algorithm 3 is optimal.
Heterogeneous Bidirectional Ring
In this section, we consider the most general case, that of a heterogeneous bidirectional ring. We do not know any optimal redistribution algorithm in this case. However, if we assume that each processor initially holds more data than it needs to send during the whole execution of the algorithm (what we call a light redistribution), then we succeed in deriving an optimal solution.
Light Redistribution
Throughout this section, we suppose that we have a light redistribution: we assume that the number of data items sent by any processor throughout the redistribution algorithm is less than or equal to its original load. There are two reasons for a processor P i to send data: (i) because it is overloaded (δ i > 0); (ii) because it has to forward some data to another processor located further in the ring. If P i initially holds at least as many data items as it will send during the whole execution, then P i can send at once all these data items. Otherwise, in the general case, some processors may wait to have received data items from a neighbor before being able to forward them to another neighbor.
Solution by Integer Linear Programming
Under the "light redistribution" assumption, we can build an integer linear program to solve our problem (see system 6). Let S be one of its solutions, and denote by S i, i + 1 the number of data items that processor P i sends to processor P i + 1 . Similarly, S i, i -1 is the number of data items that P i Algorithm 3. Redistribution algorithm for homogeneous bidirectional rings (for step s).
sends to processor P i -1 . In order to ease the writing of the equations, we impose in the first two equations of system 6 that S i, i + 1 and S i, i -1 are non-negative for all i, which imposes the use of other variables S i + 1, i and S i -1, i for the symmetric communications. The third equation states that after the redistribution, there is no more unbalance. We denote by τ the execution time of the redistribution. For any processor P i , due to the one-port constraints, τ must be greater than the time spent by P i to send data items (fourth equation) or spent by P i to receive data items (fifth equation). Our aim is to minimize τ; hence the system:
Lemma 9. Any optimal solution of system 6 is feasible, for example using the following schedule: for any i [1, n], P i starts sending data items to P i + 1 at time 0 and, after the completion of this communication, starts sending data items to P i -1 as soon as possible under the one-port model.
Proof.
We have to show that we are able to schedule the communications defined by any optimal solution (S, τ) of system 6 so that the redistribution takes a time no greater than τ. For any i [1, n] , we schedule at time 0 all emissions from P i to P i + 1 . This communication is done in time S i, i + 1 c i, i + 1 : because of the "light redistribution" hypothesis, P i already holds all the data items that it must send. Because of the fourth equation of system 6, this communication ends before the time τ.
For any value of i [1, n], we still have to schedule the sending of data items from P i to P i -1 . We schedule this communication as soon as possible; therefore, at time max{S i, i + 1 c i, i + 1 , S i -2, i -1 c i -2, i -1 }, i.e. at the earliest time when (i) P i has ended sending data items to P i + 1 , and (ii) P i -1 has stopped receiving data items from P i -2 . Therefore, the communication from P i to P i -1 ends at the date:
Once again, this is true owing to the "light redistribution" hypothesis: no processor needs to wait to have received some data items before being able to send them to one of its neighbors.
The first term of the "max" expression is the time needed by P i to send data items to both P i + 1 and P i -1 . This term is less than or equal to τ because of the fourth equation of system 6. The second term of the "max" expression is the time needed by P i -1 to receive data items from both P i -2 and P i . This term is less than or equal to τ because of the fifth equation of system 6. 1
So far, we have not mathematically defined a condition for the "light redistribution" hypothesis to hold. In fact, this is not mandatory: we use system 6 to find an optimal solution to the problem. If, in this optimal solution, for any processor P i , the total number of data items sent is less than or equal to the initial load (S i, i + 1 + S i, i -1 ≤ L i ), we are under the "light redistribution" hypothesis and we can use the solution of system 6 safely.
Solution Through Rational Linear Programming
Even if the "light redistribution" hypothesis holds, we may wish to solve the redistribution problem with a technique less expensive than integer linear programming (which is potentially exponential). An idea would be to first solve system 6 to find an optimal rational solution, which can always be done in polynomial time, and then to round up the obtained solution to find a "good" integer solution. In fact, the following theorem shows that one of the two natural ways of rounding always lead to an optimal (integer) solution. The complexity of the light redistribution problem is therefore polynomial.
Theorem 4. Let R be an optimal rational solution to the redistribution problem. For any j in [1, n] , R j denotes the number of data items that processor P j sends to processor P j + 1 (using the notations of system 6, R j = S j, j + 1 -S j + 1, j ). Let F be the integer solution defined by F 1 = . Let G be the integer solution defined by G 1 = . Then: (i) F and G are well defined by the single condition above; (ii) either F or G is an optimal integer solution.
Proof. Lemma 10 states that F and G are both fully defined. Lemma 11 states that there exists at least one optimal integer solution E such that |E 1 -R 1 | < 1. The only two solutions satisfying these constraints are F and G. Hence the result. 1 Lemma 10. To fully define the number of data items sent between processors in any redistribution scheme, we only need to define, for a single given value of j [1, n] , the number of data items that processor P j sends to processor P j + 1 .
Lemma 11. Let R be an optimal rational solution to the redistribution problem: for any j in [1, n] , R j denotes the number of data items processor P j sends to processor P j + 1 .
MINIMIZE τ SUBJECT TO
,
Then, there exists an optimal integer solution E to the solution problem such that: |E 1 -R 1 | < 1.
The missing proofs can be found in Renard, Robert, and Vivien (2004a) .
ularity changes from one computational kernel to the other, moving from a CYCLIC(r) distribution over p processors to a CYCLIC(s) distribution over q processors is a very useful redistribution procedure, which has been implemented using a caterpillar algorithm in ScaLAPACK (Prylli and Tourancheau 1997) . Several papers, including Kalns and Ni (1995) , Thakur, Choudhary, and Ramanujam (1996) , Desprez et al. (1998), Park, Prasanna, and Raghavendra (1999) , Garcia, Ayguadé, and Labarta (2001) , Hsu et al. (2001) , and Knoop and Mehofer (2002) , have dealt with various optimizations of this redistribution procedure. Along this line of research, automatic data redistribution tools are presented in Garcia, Ayguadé, and Labarta (2001) .
Even though we have not dealt with load-balancing algorithms in this paper, we quote some key references on the subject. For homogeneous platforms, see the collection of papers (Shirazi, Hurson, and Kavi 1995) , and for heterogeneous clusters see chapter 25 in Buyya (1999) . Several authors (Nicol and Saltz 1988; Nicol and Reynolds 1990; Flaherty et al. 1997a; Watts and Taylor 1998; Hu and Blake 1999) have proposed a mapping policy which dynamically minimizes system degradation (including the cost of remapping) for each computation step. Static strategies aiming at distributing independent chunks of work to two-dimensional processor grids are studied in Barbosa, Tavares, and Padilha (2000) and Beaumont et al. (2001a) . Relaxing the geometrical constraints induced by two-dimensional grids leads to irregular partitionings (Crandall and Quinn 1993; Kaddoura, Ranka, and Wang 1996; Beaumont et al. 2001b ) that allow for good load balancing but are much more difficult to implement. This approach has been extended to three-dimensional problems (Flaherty et al. 1997b ).
Finally, we briefly mention three sample applications whose implementation can directly benefit from the redistribution strategies designed in this paper. The analysis of pulses propagating in a nonlinear medium calls for adaptive computational windows, and redistribution must occur frequently as the computation progresses (Bourgeade and Nkonga 2004) . A two-level redistribution procedure is advocated in Lan, Taylor, and Bryan (2001) for structured adaptive mesh refinement. A multilevel diffusion re-partitioner is presented in Kumar (1997, 2000) for irregular grid computations and has been incorporated into the ParMetis library. Of course this short list could be extended dramatically.
Simulation Results
Due to lack of space, we refer the reader to Vivien (2004a, 2004b) for the details. As expected, when the computation-to-communication ratio is high, the best strategy is to use no redistribution, as the cost is prohibitive. Conversely, when the computationto-communication ratio is low, it pays off to use many redistributions, but not too many. As the ratio decreases, all trade-offs can be found.
Conclusion
We have considered the problem of redistributing data on rings of processors. For homogeneous rings, the problem has been completely solved. Indeed, we have designed optimal algorithms, and provided formal proofs of correctness, for both unidirectional and bidirectional rings. The bidirectional algorithm turned out to be quite complex, and requires a lengthy proof.
For heterogeneous rings, there remains further research to be conducted. The unidirectional case was easily solved, but the bidirectional case remains open. Still, we have derived an optimal solution for light redistributions, an important case in practice. The complexity of the bound for the general case shows that designing an optimal algorithm is likely to be a difficult task.
All our algorithms have been implemented and extensively tested. As expected, the cost of data redistributions may not pay off a little unbalance of the work in some cases. Further work will aim at investigating how frequently redistributions must occur in real-life applications.
Author Biographies
Hélène Renard is currently a Ph.D. student in the Computer Science Laboratory LIP at ENS Lyon. She is mainly interested in parallel algorithm design for heterogeneous platforms and in load-balancing techniques.
Yves Robert received a Ph.D. from Institut National Polytechnique de Grenoble in 1986. He is currently a full professor in the Computer Science Laboratory LIP at ENS Lyon. He is the author of four books, 85 papers published in international journals, and 110 papers published in international conferences. His main research interests are scheduling techniques and parallel algorithms for clusters and grids. He is a senior member of IEEE, and serves as an associate editor of IEEE Transactions on Parallel and Distributed Systems. 
