Data redistribution algorithms for heterogeneous processor rings by Renard, Hélène et al.
HAL Id: hal-02102059
https://hal-lara.archives-ouvertes.fr/hal-02102059
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Data redistribution algorithms for heterogeneous
processor rings
Hélène Renard, Yves Robert, Frédéric Vivien
To cite this version:
Hélène Renard, Yves Robert, Frédéric Vivien. Data redistribution algorithms for heterogeneous pro-
cessor rings. [Research Report] Laboratoire de l’informatique du parallélisme. 2004, 2+26p. ￿hal-
02102059￿
Laboratoire de l’Informatique du Parallélisme
École Normale Supérieure de Lyon
Unité Mixte de Recherche CNRS-INRIA-ENS LYON-UCBL no 5668
Data redistribution algorithms
for heterogeneous processor rings
Hélène Renard,
Yves Robert,
Frédéric Vivien
May 2004
Research Report No 2004-28
École Normale Supérieure de Lyon
46 Allée d’Italie, 69364 Lyon Cedex 07, France
Téléphone : +33(0)4.72.72.80.37
Télécopieur : +33(0)4.72.72.80.80
Adresse électronique : lip@ens-lyon.fr
Data redistribution algorithms
for heterogeneous processor rings
Hélène Renard, Yves Robert, Frédéric Vivien
May 2004
Abstract
We consider the problem of redistributing data on homogeneous and heteroge-
neous ring of processors. The problem arises in several applications, each time
after that a load-balancing mechanism is invoked (but we do not discuss the
load-balancing mechanism itself). We provide algorithms that aim at optimiz-
ing the data redistribution, both for uni-directional and bi-directional rings,
and we give complete proofs of correctness. One major contribution of the pa-
per is that we are able to prove the optimality of the proposed algorithms in all
cases except that of a bi-directional heterogeneous ring, for which the problem
remains open.
Keywords: Heterogeneous rings, data redistribution algorithms, load-balancing
Résumé
Dans ce rapport, nous nous intéressons au problème de redistribution de don-
nées sur des anneaux de processeurs homogènes et hétérogènes. Ce problème
surgit dans plusieurs applications, après chaque phase d’équilibrage de charge
(nous ne discutons pas ici du mécanisme d’équilibrage de charge lui-même).
Nous proposons des algorithmes qui visent à optimiser la redistribution de
données pour des anneaux unidirectionnels et bidirectionnels, et nous donnons
toutes les preuves de correction de ces algorithmes. Une des contributions prin-
cipales de ce rapport est que nous pouvons prouver l’optimialité des algorithmes
proposés dans tous les cas, sauf dans le cas d’un anneau hétérogène bidirection-
nel, pour lequel le problème reste ouvert.
Mots-clés: Anneaux hétérogènes, algorithmes de redistribution de données, équilibrage de
charge
Data redistribution algorithms for heterogeneous processor rings 1
1 Introduction
In this paper, we consider the problem of redistributing data on a heterogeneous ring of processors.
The problem typically arises when a load balancing phase must be initiated. Because either
of variations in the resource performances (CPU speed, communication bandwidth) or in the
system/application requirements (completed tasks, new tasks, migrated tasks, etc.), data must
be redistributed between participating processors so that the current (estimated) load is better
balanced. We do not discuss the load-balancing mechanism itself (we take it as external, be it a
system, an algorithm, an oracle or whatever). Rather we aim at optimizing the data redistribution
induced by the load-balancing mechanism.
We adopt the following abstract view of the problem. There are n participating processors
P1, P2, . . . , Pn. Each processor Pk initially holds Lk atomic data items. The load-balancing
system/algorithm/oracle has decided that the new load of Pk should be Lk − δk. If δk > 0, this
means that Pk now is overloaded and should send δk data items to other processors; if δk < 0, Pk is
under-loaded and should receive −δk items from other processors. Of course there is a conservation
law:
∑n
k=1 δk = 0. The goal is to determine the required communications and to organize them
(what we call the data redistribution) in minimal time.
We assume that the participating processors are arranged along a ring, either unidirectional
or bidirectional, and either with homogeneous or heterogeneous link bandwidths, hence a total of
four different frameworks to deal with. There are two main contexts in which processor rings are
useful. The first context is those of many applications which operate on ordered data, and where
the order needs to be preserved. Think of a large matrix whose columns are distributed among the
processors, but with the condition that each processor operates on a slice of consecutive columns.
An overloaded processor Pi can send its first columns to the processor Pj that is assigned the slice
preceding its own slice (and Pj would append these columns to the end of its slice); similarly, Pi
can send its last columns to the processor which is assigned the next slice; obviously, these are
the only possibilities. In other words, the ordered uni-dimensional data distribution calls for a
uni-dimensional arrangement of the processors, i.e., along a ring.
The second context that may call for a ring is the simplicity of the programming. Using a ring,
either uni- or bi-directional, allows for a simpler management of the data to be redistributed. Data
intervals can be maintained and updated to characterize each processor load. Finally, we observe
that parallel machines with a rich but fixed interconnection topology (hypercubes, fat trees, grids,
to quote a few) are on the decline. Heterogeneous cluster architectures, which we target in this
paper, have a largely unknown interconnection graph, with includes gateways, backbones, and
switches, and modeling the communication graph as a ring is a reasonable, if conservative, choice.
As stated above, we discuss four cases for the redistribution algorithms. We delay the formal
statement of the redistribution problems until Section 2, but we summarize the main results as
follows. In the simplest case, that of a unidirectional homogeneous ring, we derive an optimal
algorithm, and we prove its correctness in full details. Because the target architecture is quite
simple, we are able to provide explicit (analytical) formulas for the number of data sent/received
by each processor. The same holds true for the case of a bidirectional homogeneous ring, but the
algorithm becomes more complicated. When assuming heterogeneous communication links, we
still derive an optimal algorithm for the unidirectional case, but we have to use an asynchronous
formulation. However, we have to resort to heuristics based upon linear programming relaxation
for the bidirectional case. We point out that one major contribution of the paper is the design of
optimal algorithms, together with their formal proof of correctness: to the best of our knowledge,
this is the first time that optimal algorithms are introduced.
The rest of the paper is organized as follows. In Section 2 we formally state the optimization
problem. For homogeneous networks (all links have same capacity), the optimal algorithms are
described in Section 3 (unidirectional ring) and in Section 5 (bidirectional ring). For heteroge-
neous networks, the optimal asynchronous unidirectional algorithm is presented in Section 4, and
the linear-programming based optimal algorithm for light redistributions on bidirectional links is
explained in Section 6. Section 7 is devoted to a survey of related work. In Section 8, we report
some simulation results that confirm the usefulness of data redistributions. Finally, Section 9
2 H. Renard, Y. Robert, F. Vivien
concludes the paper and highlights future work directions.
2 Framework
We consider a set of n processors P1, P2, . . . , Pn arranged along a ring. The successor of Pi in the
ring is Pi+1, and its predecessor is Pi−1, where all indices are taken modulo n. For 1 ≤ k, l ≤ n,
Ck,l denotes the slice of consecutive processors Ck,l = Pk, Pk+1, . . . , Pl−1, Pl.
We denote by ci,i+1 the capacity of the communication link from Pi to Pi+1. In other words,
it takes ci,i+1 time-units to send a data item from processor Pi to processor Pi+1. In the case of
a bidirectional ring, ci,i−1 is the capacity of the link from Pi to Pi−1. We use the one-port model
for communications: at any given time, there are at most two communications involving a given
processor, one sent and the other received. A given processor can simultaneously send and receive
data, so there is no restriction in the unidirectional case; however, in the bidirectional case, a
given processor cannot simultaneously send data to its successor and its predecessor; neither can
it receive data from both sides. These is the only restriction induced by the model: any pair of
communications that does not violate the one-port constraint can take place in parallel.
Each processor Pk initially holds Lk atomic data items. After redistribution, Pk will hold
Lk − δk atomic data items. We call δk the unbalance of Pk. We denote by δk,l the total unbalance
of the processor slice Ck,l: δk,l = δk + δk+1 + · · · + δl−1 + δl.
Because of the conservation law of atomic data items,
∑n
k=1 δk = 0. Obviously the unbalance
cannot be larger than the initial load: Lk ≥ δk. In fact, we suppose that any processor holds at
least one data, both initially (Lk ≥ 1) and after the redistribution (Lk ≥ 1 + δk): otherwise we
would have to build a new ring from the subset of resources still involved in the computation.
3 Homogeneous unidirectional ring
In this section, we consider a homogeneous unidirectional ring. Any processor Pi can only send
data items to its successor Pi+1, and ci,i+1 = c for all i ∈ [1, n]. We first derive a lower bound on
the running time of any redistribution algorithm. Then, we present an algorithm achieving this
bound (hence optimal), and we prove its correctness.
3.1 Lower bound
We have the following bound on the optimal redistribution time:
Lemma 1. Let τ be the optimal redistribution time. Then:
τ ≥
(
max
1≤k≤n, 0≤l≤n−1
|δk,k+l|
)
× c. (1)
Proof. The processor slice Ck,k+l = Pk, Pk+1, . . . , Pk+l−1, Pk+l has a total unbalance of δk,k+l =
δk+δk+1+· · ·+δk+l−1+δk+l. If δk,k+l > 0, δk,k+l data items must be sent from Ck,k+l to the other
processors. The ring is unidirectional, so Pk+l is the only processor in Ck,k+l with an outgoing
link. Furthermore, Pk+l needs a time equal to δk,k+l × c to send δk,k+l data items. Therefore, in
any case, a redistribution scheme cannot take less than δk,k+l × c to redistribute all data items.
We have the same type of reasoning for the case δk,k+l < 0.
3.2 An optimal algorithm
We introduce the following redistribution algorithm:
We first prove the correction of Algorithm 1 (Lemma 3). Secondly, we prove its optimality
(Lemma 4). Intuitively, if Step 6 of this algorithm is always feasible, then each execution of Step 3
has exactly a length of c, and the algorithm will meet the time bound of Lemma 1.
Data redistribution algorithms for heterogeneous processor rings 3
Algorithm 1 Redistribution algorithm for homogeneous unidirectional rings
1: Let δmax = (max1≤k≤n,0≤l≤n−1 |δk,k+l|)
2: Let start and end be two indices such that the slice Cstart,end is of maximal unbalance:
δstart,end = δmax.
3: for s = 1 to δmax do
4: for all l = 0 to n − 1 do
5: if δstart,start+l ≥ s then
6: Pstart+l sends to Pstart+l+1 a data item during the time interval [(s − 1) × c, s × c[
First, we point out that the slice Cstart,end is well-defined in Step 2 of the algorithm: for
any slice with an unbalance δ, the slice made up from the remaining processors has the opposite
unbalance −δ. Next, we state the particular role of the processor Pstart:
Lemma 2. Processor Pstart receives no data items during the execution of Algorithm 1.
Proof. We prove the result by contradiction. Suppose that at a given iteration s processor Pstart
receives some data items. Then the predecessor of Pstart in the ring, Pstart−1, sends a data
item at this iteration. Thus, Pstart−1 being a sender, by the condition at Step 5 of Algorithm 1,∑n−1
j=0 δstart+j = δstart,start−1 ≥ s. However, due to the conservation law,
∑n
i=1 δi = 0. Hence,
0 ≥ s, the desired contradiction.
To prove that Algorithm 1 is correct, we must show that during each iteration, any processor
required to send a data item in Step 6 actually holds at least one data item at this iteration.
In other words, we must prove that no processor is asked to send a data item that it does not
currently own. Let Lsi be the load of Pi at the end of iteration s of Algorithm 1:
Lemma 3. During iteration s of loop 3, if Pi sends a data item, then Ls−1i ≥ 1.
Proof. We prove Lemma 3 by induction. Initially, by definition of unbalances (see Section 2), we
know that each processor Pi in the ring initially holds an amount of L0i = Li ≥ 1 data items. Thus
the result holds for s = 1.
Now we suppose that the result holds until a certain iteration s (included), and we focus on
iteration s + 1. There are two cases to consider depending whether processor Pi is supposed to
receive a data item during iteration s + 1 or not:
1. If processor Pi is both a sender and a receiver during iteration s+1, then Pi is both a sender
and a receiver during iteration s by the condition at Step 5 of Algorithm 1. Then the load
of Pi after iteration s was the same than before that iteration and Lsi = L
s−1
i . We conclude
using the induction hypothesis.
2. If processor Pi is a sender but not a receiver during iteration s + 1, we must verify that
Pi does not send a data item that it does not own. Because Pi is a sender, then, by the
condition at Step 5 of Algorithm 1, we have:
δstart,i ≥ s + 1. (2)
Furthermore, Pi has sent a data item during each of the previous iterations.
During iteration s + 1, Pi is not a receiver. Thus, Pi−1 is not a sender during this iteration,
and, by the condition at Step 5 of Algorithm 1, we have: δstart,i−1 < s + 1. During
each iteration from 1 to δstart,i−1, Pi−1 has sent a data item (see below for the proof that
δstart,start+j ≥ 0 for all j ∈ [0, n − 1]). Hence, during each of these iterations, Pi was both
a sender and a receiver, and neither its load nor its unbalance did change.
During each iteration from 1 + δstart,i−1 to s, processor Pi was a sender but not a receiver.
So both its load and its unbalance decrease by one during each of these iterations. Hence:
Lsi = Li − (s − δstart,i−1). (3)
4 H. Renard, Y. Robert, F. Vivien
However, δi + δstart,i−1 = δstart,i. So Equation 3 is equivalent to: Lsi = Li − δi + δstart,i − s.
From Equation 2 we know that δstart,i − s ≥ 1. In Section 2, we assumed that Li ≥ 1 + δi.
So, Lsi ≥ 2.
The above proof relies on the property that, for any value of j ∈ [0, n − 1], δstart,start+j ≥ 0.
We now prove this result by contradiction. Hence we suppose that there exists a value j such that
δstart,start+j < 0. We have two cases to consider:
1. j + start ∈ [start, end]. Then δstart,end = δstart,start+j + δstart+j+1,end and δstart,end <
δstart+j+1,end which contradicts the maximality of Cstart,end.
2. j + start /∈ [start, end]. Then δstart,j+start = δstart,end + δ1+end,j+start. So δstart,end <
−δ1+end,j+start. However, as the sum of unbalances is null by definition, the sum of unbal-
ances of C1+end,j+start is equal to the opposite of the sum of unbalances of Cj+1+start,end.
Hence, δstart,end < δj+1+start,end, which contradicts the maximality of Cstart,end.
We have proved the correction of Algorithm 1. We still have to prove that when it terminates,
the entire redistribution has actually been performed:
Lemma 4. When Algorithm 1 terminates after iteration δmax, i.e., at time τ , the load of any
processor Pi is equal to Li − δi.
Proof. We prove by induction on the processor indices, starting at processor Pstart, that any
processor Pj has the desired load of Lj − δj at any iteration s ≥ max0≤i≤j δstart,start+i
As stated by Lemma 2, processor Pstart never receives a data item during execution. So, after
δstart,start = δstart iterations of loop 3, Pstart is never the receiver nor the sender of a data item.
As required, Pstart exactly holds Lstart − δstart data items, i.e., its initial load minus the amount
of data items sent.
We suppose the result proved up to a processor Pstart+l (with l ≥ 0) included. We focus
on processor Pstart+l+1. Using the induction hypothesis, we know that at any iteration s ≥
max0≤i≤l δstart,start+i, the total load of the slice Cstart,start+l is equal to
∑
0≤i≤l Li −
∑
0≤i≤l δi.
During the execution of the whole algorithm, processor Pstart+l+1 has sent exactly δstart,start+l+1
data items (remember that we showed in the proof of Lemma 3 that for any j ∈ [0, n − 1],
δstart,start+j ≥ 0). All these send operations took place before or during iteration δstart,start+l+1.
Furthermore, Lemma 2 states that processor Pstart never receives a data item during the execu-
tion. So, the total load of the slice Cstart,start+l+1 does not change after iteration δstart,start+l+1,
and its total load is equal to its initial total load minus the data items sent by processor Pstart+l+1:
(
∑
0≤i≤l+1 Li) − δstart,start+l+1. Therefore, after any iteration s, where
s ≥ max
(
max
0≤i≤l
δstart,start+i, δstart,start+l+1
)
= max
0≤i≤l+1
δstart,start+i,
we know the total load of the slices Cstart,start+l and Cstart,start+l+1. Therefore, we know the
load of processor Pstart+l+1:
Ltstart+l+1 =



 ∑
0≤i≤l+1
Lstart+i

− δstart,start+l+1

−

 ∑
0≤i≤l
Lstart+i −
∑
0≤i≤l
δstart+i


= Lstart+l+1 − δstart+l+1.
To conclude, we just need to remark that δmax = max0≤i≤n−1 δstart,start+i.
The optimality of Algorithm 1 is a direct consequence of the previous lemmas:
Theorem 1. Algorithm 1 is optimal.
Data redistribution algorithms for heterogeneous processor rings 5
4 Heterogeneous unidirectional ring
In this section we still suppose that the ring is unidirectional but we no longer assume the com-
munication paths to have the same capacities. We build on the results of the previous section to
design an optimal algorithm (Algorithm 2 below). In this algorithm, the amount of data items
sent by any processor Pi is exactly the same as in Algorithm 1 (namely δstart,i). However, as
the communication links have different capabilities, we no longer have a synchronous behavior. A
processor Pi sends its δstart,i data items as soon as possible, but we cannot express its completion
time with a simple formula. Indeed, if Pi initially holds more data items than it has to send,
we have the same behavior than previously: Pi can send its data items during the time interval
[0, δstart,i × ci,i+1[. On the contrary, if Pi holds less data items than it has to send (Li < δstart,i),
Pi still starts to send some data items at time 0 but may have to wait to have received some other
data items from Pi−1 to be able to forward them to Pi+1.
Algorithm 2 Redistribution algorithm for heterogeneous unidirectional rings
1: Let δmax = (max1≤k≤n,0≤l≤n−1 |δk,k+l|)
2: Let start and end be two indices such that the slice Cstart,end is of maximal unbalance:
δstart,end = δmax.
3: for all l = 0 to n − 1 do
4: Pstart+l sends δstart,start+l data items one by one and as soon as possible to processor
Pstart+l+1
The asynchronousness of Algorithm 2 implies that it is correct by construction: we wait for
receiving a data item before sending. Furthermore, when the algorithm terminates, the redistribu-
tion is complete (the proof is the same as in Lemma 4). There remains to prove that the running
time of Algorithm 2 is optimal. We first compute this running time:
Lemma 5. The running time of Algorithm 2 is max0≤l≤n−1 δstart,start+l × cstart+l,start+l+1.
The result of Lemma 5 is surprising. Intuitively, it says that the running time of Algorithm 2
is equal to the maximum of the communication times of all the processors, if each of them initially
stored locally all the data items it will have to send throughout the execution of the algorithm. In
other words, there is no forwarding delay, whatever the initial distribution. The proof of Lemma 5
is technical and can be omitted at first reading.
Proof. We prove the result by contradiction, assuming that the running time of Algorithm 2,
denoted as tmax, is strictly greater than max0≤l≤n−1 δstart,start+l × cstart+l,start+l+1 (we assume
that the algorithm starts running at time 0). Let Pi be any processor whose running time is tmax,
i.e., let Pi be any processor which terminates the emission of its last data item at time tmax. By
hypothesis, tmax > δstart,i × ci,i+1. Therefore, there is some time during the running time of the
algorithm at which processor Pi is not sending any data items to processor Pi+1. Let ti denote
the latest time at which Pi is not sending any data items. Then, by definition of ti, from time
ti until the completion of the algorithm, processor Pi is continuously sending data items to Pi+1.
Let ni denote the number of data items that Pi sends during that interval. Note that we have
tmax = ti + ni × ci,i+1. We now prove by induction that for any value of j ≥ 1:
1. Processor Pi−j sends a data item to processor Pi−j+1 during the time interval
[ti −
∑j
k=1 ci−k,i−k+1, ti −
∑j−1
k=1 ci−k,i−k+1].
2. Between time ti−
∑j
k=1 ci−k,i−k+1 and the completion of the algorithm, processor Pi−j sends
at least j + ni data items to processor Pi−j+1.
3. ci−j,i−j+1 ≤ ci,i+1.
4. Right before time ti −
∑j
k=1 ci−k,i−k+1, processor Pi−j is not sending any data items to
processor Pi−j+1 (it is idle in sending).
6 H. Renard, Y. Robert, F. Vivien
Once we have proved these properties, the contradiction follows from considering processor Pstart.
Processor Pstart only sends data items that it initially holds (δstart = δstart,start ≤ Lstart), and
receives no data items from its predecessor in the ring. However, using the above properties, there
is a value of j ≥ 0 such that start = i − j, and between time ti −
∑j+1
k=1 ci−k,i−k+1 and the
completion of the algorithm, processor Pi−j−1 sends at least j + 1 + ni data items to processor
Pi−j = Pstart. Hence the contradiction.
    
    


    
         
     
    
     
     
     
    
    
     
     
   
   
   
   
   
Pi−1 → PiPi−2 → Pi−1Pi−3 → Pi−2 Pi → Pi+1
ti
ti − ci−1,i
ti − ci−1,i − ci−2,i−1
0
ti − ci−1,i − ci−2,i−1 − ci−3,i−2
tmax
Time
Figure 1: The construction used in the proof of Lemma 5.
The construction used in the proof is illustrated by Figure 1. We start by proving the above
properties for j = 1.
1. By definition of ti, processor Pi is not sending any data items to processor Pi+1 right before
time ti. Because of the “as-soon-as” nature of the algorithm, processor Pi is not holding
a single data item right before time ti and is waiting for processor Pi−1 to send it one.
Furthermore, the data item that processor Pi started to send at time ti is sent to it by
processor Pi−1 during the time interval [ti − ci−1,i, ti].
2. Between time ti and the completion of the algorithm, processor Pi sends ni data items to
processor Pi+1. By hypothesis, processor Pi holds at least one data item after the completion
of the algorithm. As Pi holds no data item right before time ti, then between the times
ti − ci−1,i and tmax, Pi−1 sends at least 1 + ni data items to Pi.
3. From what just precedes, and using the relationship between ti, ni, and tmax, we infer:
ti + ni × ci,i+1 = tmax ≥ (ti − ci−1,i) + (1 + ni) × ci−1,i ⇒ ci,i+1 ≥ ci−1,i
as ni is nonzero by definition.
4. Suppose that processor Pi−1 is sending a data item to processor Pi right before the time
ti− ci−1,i. Then, at the earliest, this data item is received by processor Pi at time ti − ci−1,i.
Due to the “as-soon-as” nature of the algorithm, Pi forwards this data item to processor
Pi+1 (as it forwards data items received later). Pi finishes to forward this data item at time
ti−ci−1,i+ci,i+1 ≥ ti at the earliest. Therefore, processor Pi has no reason not to be sending
any data item at time ti, which contradicts the definition of ti.
We now proceed to the general case of the induction. We suppose that the property is proved
up to a processor Pi−j included (with j ≥ 1).
1. By induction hypothesis, processor Pi−j is not sending any data items to processor Pi−j+1
right before time ti −
∑j
k=1 ci−k,i−k+1. Because of the “as-soon-as”nature of the algorithm,
processor Pi−j is not holding a single data item right before this time and is waiting for
processor Pi−j−1 to send one. Furthermore, the data item that processor Pi−j started to
send at time ti −
∑j
k=1 ci−k,i−k+1 is sent to it by processor Pi−j−1 during the time interval
[ti −
∑j+1
k=1 ci−k,i−k+1, ti −
∑j
k=1 ci−j,i−j+1].
Data redistribution algorithms for heterogeneous processor rings 7
2. Between time ti−
∑j
k=1 ci−k,i−k+1 and the completion of the algorithm, processor Pi−j sends
j + ni data items to processor Pi−j+1, by induction hypothesis. By hypothesis, processor
Pi−j holds at least one data item after the completion of the algorithm. As Pi−j holds no
data item right before time ti−
∑j
k=1 ci−k,i−k+1, then between the times ti−
∑j+1
k=1 ci−k,i−k+1
and tmax, Pi−j−1 sends at leat 1 + j + ni data items to Pi−j .
3. From what just precedes, and using the relationship between ti, ni, and tmax, we infer:
ti + ni × ci,i+1 = tmax ≥
(
ti −
j+1∑
k=1
ci−k,i−k+1
)
+ (1 + j + ni) × ci−j−1,i−j ⇒
ni × ci,i+1 +
j∑
k=1
ci−k,i−k+1 ≥ (j + ni) × ci−j−1,i−j ⇒
ci,i+1 ≥ ci−j−1,i−j
as, by induction hypothesis, for any k ∈ [1, j], ci,i+1 ≥ ci−k,i−k+1.
4. Suppose that processor Pi−j−1 is sending a data item to processor Pi−j right before the
time ti −
∑j+1
k=1 ci−k,i−k+1. Then, at the earliest, this data item is received by processor
Pi−j at time ti −
∑j+1
k=1 ci−k,i−k+1. Due to the “as-soon-as” nature of the algorithm, Pi−j
forwards this data item to processor Pi−j+1 (as it forwards data items received later). Pi−j
finishes to forward this data item at time ti − ci−j−1,i−j −
∑j−1
k=1 ci−k,i−k+1 at the earliest.
Then, following the same line of reasoning, processor Pi−j+1 forwards it to Pi−j+2, which
receives it at the earliest at time ti − ci−j−1,i−j −
∑j−2
k=1 ci−k,i−k+1, and so on. So, processor
Pi receives this data item at the earliest at time ti − ci−j−1,i−j , and forwards it. Then, it
finishes to send it at the earliest at time ti − ci−j−1,i−j + ci,i+1 ≥ ti, as we have seen that
ci,i+1 ≥ ci−j−1,i−j . Therefore, processor Pi has no reason not to be sending any data items
at time ti, which contradicts the definition of ti. Hence, processor Pi−j−1 is not sending any
data item to processor Pi−j right before the time ti −
∑j+1
k=1 ci−k,i−k+1.
Theorem 2. Algorithm 2 is optimal.
Proof. Let τ denote the optimal redistribution time. Following the arguments used in the proof
of Lemma 1 for the homogeneous case in Section 3.1, we obtain the lower bound:
τ ≥ max
1≤k≤n,0≤l≤n−1
|δk,k+l| × ck+l,k+l+1.
We conclude using Lemma 5.
5 Homogeneous bidirectional ring
In this section, we consider a homogeneous bidirectional ring. All links have the same capacity
but a processor can send data items to its two neighbors in the ring: there exists a constant c such
that, for all i ∈ [1, n], ci,i+1 = ci,i−1 = c. We proceed as for the homogeneous unidirectional case:
we first derive a lower bound on the running time of any redistribution algorithm, and then we
present an algorithm attaining this bound.
8 H. Renard, Y. Robert, F. Vivien
5.1 Lower bound
We have the following bound on the optimal redistribution time:
Lemma 6. Let τ be the optimal redistribution time. Then:
τ ≥ max
{
max
1≤i≤n
|δi|, max
1≤i≤n,1≤l≤n−1
⌈ |δi,i+l|
2
⌉}
× c. (4)
Proof. Consider any processor Pi with positive unbalance (δi > 0). Even if processor Pi can send
data items to both of its neighbors, because of the one-port model, it cannot send data items to
both of them simultaneously. So, it requires processor Pi at least a time of δi × c to send δi data
items, whatever the destinations of these data items. We have a symmetric result for the case
δi < 0. Hence a first lower-bound on the optimal redistribution time τ :
τ ≥
(
max
1≤i≤n
|δi|
)
× c.
Now, consider any non trivial slice of consecutive processors Ck,l. By “non trivial” we mean
that the slice is not reduced to a single processor (we already treated that case) and that it does
not contain all processors. We suppose that δk,l > 0. So, in any redistribution scheme, at least
δk,l data items must be sent by Ck,l. As this slice is not reduced to a single processor, the two
processors at the extremities of the slice, Pk and Pl, can simultaneously send data items to their
neighbors outside of the slice, Pk−1 and Pl+1 respectively. Therefore, during any time interval of
length c, at most two data items can be sent from the slice. So, it takes at least a time of  δk,l2 
for the slice Ck,l to send δk,l data items. Once again, the reasoning is similar when receiving data
items if δk,l < 0. Hence a second lower-bound on τ :
τ ≥
(
max
1≤i≤n,1≤l≤n−1
⌈ |δi,i+l|
2
⌉)
× c.
We just gather the previous two lower-bounds to obtain the desired bound.
5.2 An optimal algorithm
Algorithm 3 (see below) is a recursive algorithm which defines communication patterns designed
so as to decrease the value of δmax (computed at Step 1) by one from one recursive call to another.
The intuition behind Algorithm 3 is the following:
1. Any non trivial slice Ck,l such that  |δk,l|2  = δmax and δk,l ≥ 0 must send two data items
per recursive call, one through each of its extremities.
2. Any non trivial slice Ck,l such that  |δk,l|2  = δmax and δk,l ≤ 0 must receive two data items
per recursive call, one through each of its extremities.
3. Once the mandatory communications specified by the two previous cases are defined, we take
care of any processor Pi such that |δi| = δmax. If Pi is already involved in a communication
due to the previous cases, everything is settled. Otherwise, we have the freedom to choose
whom Pi will send a data item to (case δi > 0) or whom Pi will receive a data item from
(case δi < 0). To simplify the algorithm we decide that all these communications will take
place in the direction from Pi to Pi+1.
Algorithm 3 is initially called with the parameter s = 1. For any call to Algorithm 3, all the
communications take place in parallel and exactly at the same time, because the communication
paths are homogeneous by hypothesis. One very important point about Algorithm 3 is that this
algorithm is a set of rules which only specify which processor Pi must send a data item to which
processor Pj , one of its immediate neighbors. Therefore, whatever the number of rules deciding
Data redistribution algorithms for heterogeneous processor rings 9
Algorithm 3 Redistribution algorithm for homogeneous bidirectional rings (for step s)
1: Let δmax = max{max1≤i≤n |δi|, max1≤i≤n,1≤l≤n−1 |δi,i+l|2 }
2: if δmax ≥ 1 then
3: if δmax = 2 then
4: for all slice Ck,l such that δk,l > 1 and  |δk,l|2  = δmax do
5: Pk sends a data item to Pk−1 during the time interval [(s − 1) × c, s × c[.
6: Pl sends a data item to Pl+1 during the time interval [(s − 1) × c, s × c[.
7: for all slice Ck,l such that δk,l < −1 and  |δk,l|2  = δmax do
8: Pk−1 sends a data item to Pk during the time interval [(s − 1) × c, s × c[.
9: Pl+1 sends a data item to Pl during the time interval [(s − 1) × c, s × c[.
10: else if δmax = 2 then
11: for all slice Ck,l such that δk,l ≥ 3 do
12: Pl sends a data item to Pl+1 during the time interval [(s − 1) × c, s × c[.
13: for all slice Ck,l such that δk,l = 4 do
14: Pk sends a data item to Pk−1 during the time interval [(s − 1) × c, s × c[.
15: for all slice Ck,l such that δk,l ≤ −3 do
16: Pk−1 sends a data item to Pk during the time interval [(s − 1) × c, s × c[.
17: for all slice Ck,l such that δk,l = −4 do
18: Pl+1 sends a data item to Pl during the time interval [(s − 1) × c, s × c[.
19: for all processor Pi such that δi = δmax do
20: if Pi is not already sending, due to one of the previous steps, a data item during the time
interval [(s − 1) × c, s × c[ then
21: Pi sends a data item to Pi+1 during the time interval [(s − 1) × c, s × c[.
22: for all processor Pi such that δi = −(δmax) do
23: if Pi is not already receiving, due to one of the previous steps, a data item during the
time interval [(s − 1) × c, s × c[ then
24: Pi receives a data item from Pi−1 during the time interval [(s − 1) × c, s × c[.
25: if δmax = 1 then
26: for all processor Pi such that δi = 0 do
27: if Pi−1 sends a data item to Pi during the time interval [(s − 1) × c, s × c[ then
28: Pi sends a data item to Pi+1 during the time interval [(s − 1) × c, s × c[.
29: if Pi+1 sends a data item to Pi during the time interval [(s − 1) × c, s × c[ then
30: Pi sends a data item to Pi−1 during the time interval [(s − 1) × c, s × c[.
31: Recursive call to Algorithm 3 (s + 1)
that there must be some data item sent from a processor Pi to one of its immediate neighbor Pj ,
only one data item is sent from Pi to Pj to satisfy all these rules.
To prove that Algorithm 3 is optimal, we show that the set of rules is consistent, i.e., that it
respects the one-port model, and that the value δmax (computed at Step 1) decreases by one at
each recursive call.
Lemma 7. Algorithm 3 satisfies to all the one-port constraints.
Proof. We call maximal slice a slice Ck,l of consecutive processors whose total unbalance satisfies
the condition:  |δk,l|2  = δmax. We call maximal processor a processor Pi whose unbalance is equal
to δmax or −δmax: |δi| = δmax. Maximal slices are processed by rules at Steps 4 through 18, while
maximal processors are processed by the rules of Steps 19 and 22.
To prove that the set of rules obeys the one-port model, we have to prove that no processor
simultaneously receives one data item from both neighbors, and that no processor simultaneously
sends one data item to both neighbors. We only study the cases involving a processor receiving
data items from both neighbors, because the algorithm symmetrically processes sends and receives.
We prove the result by contradiction. So, suppose that there exists a processor Pj that receives
one data item from each neighbor, Pj−1 and Pj+1. There are four cases to consider:
10 H. Renard, Y. Robert, F. Vivien
  
  
        
        
        



        
        
        



δj
Processor Pj
Positive maximal slice Cj+1,k
Positive maximal slice Ci,j−1−2δmax + ε1 2δmax − ε2
Figure 2: Case 1a in the proof of Lemma 7.
  
  
          
          
          



       
       
       



δj
Processor Pj
Negative maximal slice Cj,k
Negative maximal slice Ci,j
−2δmax + ε1 2δmax − ε2
Figure 3: Case 1b in the proof of Lemma 7.
1. Pj−1 and Pj+1 are both sending a data item to Pj because of Steps 4 through 18. Then,
Pj−1 and Pj+1 send data items to Pj either because they are extremities of positive maximal
slices or because Pj is the extremity of (a) negative maximal slice(s). We thus have three
subcases to study:
(a) Pj−1 and Pj+1 are both extremities of positive maximal slices. Then there exist two
indices i and k such that the slices Ci,j−1 and Cj+1,k are both positive maximal slices.
So, by definition, there exist two values ε1 and ε2, each one either equal to 0 or 1, such
that δi,j−1 = 2δmax − ε1 and δj+1,k = 2δmax − ε2. This case is illustrated by Figure 2.
Consider the slice Ci,k. By definition of δmax we have:⌈
δi,k
2
⌉
≤ δmax ⇔⌈
(2δmax−ε1)+δj+(2δmax−ε2)
2
⌉
≤ δmax ⇔
4δmax + δj − ε1 − ε2 ≤ 2δmax ⇔
δj ≤ ε1 + ε2 − 2δmax
However, by definition of δmax, δj is greater than or equal to −δmax. So we end up with
the constraint:
−δmax ≤ ε1 + ε2 − 2δmax ⇔ δmax ≤ ε1 + ε2. (5)
We then have three cases two consider:
i. δmax = 0: there is nothing to do as stated by the test at Step 2. (In the remaining
of this proof, we will no more consider the cases where δmax = 0.)
ii. δmax = 1. Then, either ε1 = 1 and δi,j−1 = 1, or ε2 = 1 and δj+1,k = 1: in both
cases, this contradicts our hypothesis that Pj−1 and Pj+1 are both sending data
items to Pj because of Steps 4 through 18.
iii. δmax = 2. This case is illustrated by Figure 4. Equation 5 induces that ε1 =
ε2 = 1. Applying the general scheme defined by Steps 4 through 6 would lead to
the violation of the one-port model (cf. Figure 4(a)). However, each of the two
slices Ci,j−1 and Cj+1,k only needs to output three data items in two successive
calls to Algorithm 3 (the calls with δmax = 2 and δmax = 1). So, we only require
these maximal slices to output one data item during the call with δmax = 2, in the
direction from Pi to Pi+1 (cf. Figures 4(b) and 4(c)). Remark that, in our example
Pi outputs a data item at step δmax = 2: this is not because it is the extremitiy of
Ci,j−1 with δi,j−1 = 3 but because δi,k = 4.
This particular case is one of the reasons why we introduced the special processing
of Steps 10 through 18.
(b) Pj is the extremity of two negative maximal slices Ci,j and Cj,k with i < j < k.
So, by definition, there exist two values ε1 and ε2, each one either equal to 0 or 1, such
that δi,j = −2δmax + ε1 and δj,k = −2δmax + ε2. This case is illustrated by Figure 3.
Consider the slice Ci,k:
δi,k = δi,j + δj,k − δj = −4δmax + ε1 + ε2 − δj
By definition of δmax we have: δi,k ≥ −2δmax. So, δj ≤ −2δmax + ε1 + ε2. But
δj ≥ −δmax. Hence, δmax ≤ ε1 + ε2. We then have two cases two consider:
Data redistribution algorithms for heterogeneous processor rings 11
-2
1
1
1
1
1
1
-1
-1-1
-1
Slice with negative unbalance
(a) Behavior under the gen-
eral rule
-2
1
1
1
1
1
1
-1
-1-1
-1
Slice with positive unbalance
(b) Special processing: first
step
-1
1
1
1
0
0
-1-1
0
0
0
(c) Special processing: sec-
ond step
Figure 4: Case 1(a)iii in the proof of Lemma 7. Figure 4(a) shows the problem: the one-port
model is violated if we apply the general rules to that case. Figures 4(b) and 4(c) describes the
two steps of the special processing: in the first step, only one data item is output by the rightmost
maximal slice; and in the second step, only one data item is output by the slice which was the
leftmost maximal slice.
i. δmax = 1. Then, either ε1 = 1 and δi,j = −1, or ε2 = 1 and δj,k = −1. In both
cases, this contradicts our hypothesis that Pj−1 and Pj+1 are both sending data
items to Pj because of Steps 4 through 18.
ii. δmax = 2. Then ε1 = ε2 = 1 and δi,j = δj,k = −3. As δj ≤ −2δmax + ε1 + ε2 and
as, by definition of δmax, δj ≥ −δmax, then δj = −δmax = −2.
Applying the general scheme defined by Steps 7 through 9 would lead to the viola-
tion of the one-port model (see Figure 5(a)). However, each of the two slices Ci,j
and Cj,k only needs to input three data items in two successive calls to Algorithm 3
(the calls with δmax = 2 and δmax = 1). So, we only require these maximal slices
to input one data item during the call with δmax = 2, in the direction from Pi to
Pi+1 (see Figures 5(b) and 5(c)). Remark that, in our example Pk inputs a data
item at step δmax = 2: this is not because it is the extremity of Cj,k with δj,k = −3
but because δi,k = −4.
This particular case is one of the reasons why we introduced the special processing
of Steps 10 through 18.
(c) Pj is the extremity of a negative maximal slice and one of its neighbor is the extremity
of a positive maximal slice. Without any loss of generality, suppose that Pj+1 sends
a data item to Pj because Ci,j is a maximal negative slice. Then Pj−1 sends a data
item to Pj because it is the extremity of some positive maximal chain Ck,j−1. So, by
definition, there exist two values ε1 and ε2, each one either equal to 0 or 1, such that
δi,j = −2δmax + ε1 and δk,j−1 = 2δmax − ε2. We have two cases to consider, depending
whether the slice Ck,j−1 is enclosed in the slice Ci,j :
i. k ∈ [i, j − 2] (this case is illustrated by Figure 6). δi,k−1 + δj = δi,j − δk,j−1 =
(−2δmax + ε1) − (2δmax − ε2) = −4δmax + ε1 + ε2. However, by definition of δmax,
δi,k−1 ≥ −2δmax and δj ≥ −δmax. So, δmax ≤ ε1 + ε2. Once again, we have two
cases to consider:
A. δmax = 1. Then, as always, either ε1 = 1 and δi,j = −1, or ε2 = 1 and δj,k = 1.
In both cases, this contradicts our hypotheses on Ci,j and Ck,j−1.
B. δmax = 2. Then ε1 = ε2 = 1 and δi,j = −3 and δk,j−1 = 3. Therefore,
δi,k−1 + δj = −6. By definition of δmax, δj ≥ −δmax = −2 and δi,k−1 ≥
−2δmax = −4, we have δj = −2 and δi,k−1 = −4.
12 H. Renard, Y. Robert, F. Vivien
Slice with positive unbalance
-2
-1-1
2 2
(a) Behavior under the general
rule
Slice with negative unbalance
-2
-1-1
2 2
(b) Special processing: first
step
-1
0-1
1 1
(c) Special processing: second
step
Figure 5: Case 1(b)ii in the proof of Lemma 7. Figure 5(a) shows the problem: the one-port
model is violated if we apply the general rules to that case. Figures 5(b) and 5(c) describes the
two steps of the special processing: in the first step, only one data item is input by the leftmost
negative maximal slice; and in the second step, only one data item is input by the slice which was
the rightmost maximal slice.
  
  
              
              
              



        
        
        



Processor Pj
Negative maximal slice Ci,j
Positive maximal slice Ck,j−1
Figure 6: Case 1(c)i of the proof of Lemma 7.
  
  
              
              
              



              
              
              



Processor Pj
Negative maximal slice Ci,j
Positive maximal slice Ck,j−1
Figure 7: Case 1(c)ii of the proof of
Lemma 7.
Similarly to the cases 1(a)iii and 1(b)ii, applying the general scheme defined
by Steps 4 through 9 would lead to the violation of the one-port model (cf.
Figure 8(a)). However, the slice Ci,j only needs to input three data items in
two successive calls to Algorithm 3 (the calls with δmax = 2 and δmax = 1) while
the slice Ck,j−1 only needs to output three data items. So, we only require the
slice Ci,j to input one data item and the slice Ck,j−1 to output one data item
during the call with δmax = 2, both communications being in the direction
from Pi to Pi+1 (cf. Figures 8(b) and 8(c)). Remark that, in our example, Pk
outputs a data item at step δmax = 2: this is not because it is the extremity of
Ck,j−1 with δk,j−1 = 3 but because δi,k−1 = −4.
This particular case is one of the reasons why we introduced the special pro-
cessing of Steps 10 through 18.
ii. k < i (this case is illustrated by Figure 7). Then, δk,i−1 = δj + δk,j−1 − δi,j =
δj + (2δmax − ε1) − (2δmax + ε2) = δj + 4δmax − ε1 − ε2. However, by definition
of δmax, δk,i−1 ≤ 2δmax and δj ≥ −δmax. So, δj ≤ −2δmax − ε1 − ε2 and, thus,
−δmax ≤ −2δmax − ε1 − ε2. Hence, δmax = ε1 = ε2 = 0, which is absurd.
2. Pj−1 and Pj+1 are both sending data items to Pj : one sends data items due to Steps 4
through 18; the other one is a maximal processor which sends data items due to Steps 19
and 24. Without loss of generality, suppose that Pj−1 is the maximal processor.
We have two cases to consider, depending whether Pj+1 is sending a data item to Pj because
of a positive or negative maximal slice.
(a) Pj+1 is the extremity of a positive maximal slice Cj+1,k. Figure 9 illustrates this case.
Therefore, there exists ε ∈ {0; 1}, such that δj+1,k = 2δmax − ε. By hypothesis, Pj−1
sends data items due to Steps 19 and 24, and thus the slice Cj−1,k is not a maximal
Data redistribution algorithms for heterogeneous processor rings 13
1 1 1
1 1 1
Slice with positive unbalance
-2-2-2
(a) Behavior under the gen-
eral rule
1 1 1
1 1 1
Slice with negative unbalance
-2-2-2
(b) Special processing: first
step
-1-1-1 0 1 0
0 1 1
(c) Special processing: sec-
ond step
Figure 8: Case 1(c)iB in the proof of Lemma 7. Figure 8(a) shows the problem: the one-port
model is violated if we apply the general rules to that case. Figures 8(b) and 8(c) describes the
two steps of the special processing: in the first step, only one data item is input by the negative
maximal slice; and in the second step, only one data item is output by the slice which was the
positive maximal slice.
          
          
          



  
δmax 2δmax − εδj
Positive maximal slice Cj+1,k
Maximal processor Pj−1
Processor Pj
Figure 9: Case 2a of the proof of Lemma 7.
                    
                    
                    



  
δmax δj
Maximal negative chain δi,j
Maximal processor Pj−1
Processor Pj
Figure 10: Case 2b of the proof of Lemma 7.
slice, i.e.,  δj−1,k2  ≤ δmax − 1.⌈
δj−1,k
2
⌉
=
⌈
δmax + δj + 2δmax − ε
2
⌉
≤ δmax − 1
⇔ δmax + δj + 2δmax − ε ≤ 2δmax − 2
⇔ δj ≤ ε − 2 − δmax
⇒ δj ≤ −1 − δmax
which contradicts the definition of δmax.
(b) Pj is the extremity of a negative maximal slice Ci,j (Figure 10 illustrates this case).
Then, there exists ε ∈ {0; 1}, such that δi,j = −2δmax + ε. Therefore, δi,j−2 = δi,j −
δmax−δj = −3δmax+ε−δj. By hypothesis, Pj−1 sends data items due to Steps 19 and 24,
and thus the slice Ci,j−2 is not a maximal slice, which implies that δi,j−2 ≥ −2δmax +2.
Therefore, −3δmax + ε − δj ≥ −2δmax + 2 and thus δj ≤ −δmax + ε − 2 ≤ −δmax − 1,
which contradicts the definition of δmax.
3. Pj−1 and Pj+1 are both sending data items to Pj because both are maximal processors which
send data items due to Steps 19 through 24. This case is impossible as these steps only define
data item sending in the direction from Pi to Pi+1 and never in the reverse direction (from
Pi to Pi−1).
4. Pj is a maximal processor of negative unbalance and this is the reason why Pj−1 sends it
a data item (following Steps 19 through 24). There maybe several reasons why Pj+1 would
also send a data item to Pj :
(a) Pj+1 is the extremity of a positive maximal slice Cj+1,k and it sends a data item due
to Steps 4 through 18. Then the test at Step 23 contradicts our hypothesis on Pj−1.
(b) Pj+1 is a positive maximal processor. But in this case Pj+1 sends a data item to Pj+2
and not to Pj .
14 H. Renard, Y. Robert, F. Vivien
(c) Pj is the extremity of a negative maximal slice Ci,j and Pj+1 sends it a data item due
to Steps 4 through 18. Then the test at Step 23 contradicts our hypothesis on Pj .
Lemma 8. Algorithm 3 terminates in exactly max
{
max1≤i≤n |δi|, max1≤i≤n,1≤l≤n−1
⌈
δi,i+l
2
⌉}
recursive calls.
Proof. We prove that from one recursive call to Algorithm 3 to another, the value of δmax (com-
puted at Step 1) decreases by one. Therefore, we consider how unbalances change between the
initial call to Algorithm 3 and its recursive call (excluded). For the general case, we have to prove
four properties:
1. If the non-trivial slice Ck,l was initially a maximal slice, i.e., if  |δk,l|2  = δmax, then after the
communications we have  |δk,l|2  = δmax − 1.
As previously we focus on the case of a positive maximal slice. The rules of Algorithm 3 are
written so that the slice Ck,l sends two data items (or only one in the degenerate case when
δmax = 2 and δk,l = 3) during an execution of Algorithm 3. This is all we need to conclude,
provided that this slice does not receive any data item during this call.
Thus, suppose that Ck,l receives a data item. We have three cases to consider:
(a) The maximal slice Ck,l receives a data item from a processor which is the extremity
of another maximal slice and which sends a data item due to Steps 4 through 18. As
the other maximal slice is sending a data item, its unbalance is positive. Without
any loss of generality, we suppose it is a maximal slice of the form Cl+1,m. Then, by
definition of maximal slices, δk,l = 2δmax − ε1 and δl+1,m = 2δmax − ε2, with both ε1
and ε2 taking values in {0, 1}. Thus, δk,m = 4δmax − ε1 − ε2. However, by definition
of δmax, δk,m ≤ 2δmax. Hence, we obtain 2δmax ≤ ε1 + ε2, which implies δmax = 1 and
ε1 = ε2 = 1. Then, δl+1,m = 1 which contradicts the hypothesis that Cl+1,m sends a
data item due to Steps 4 through 18 (see the test at Step 4).
(b) The maximal slice Ck,l receives a data item from a processor which is maximal and
which sends data items because of Steps 19 through 24. This case can only arise if
this maximal processor has a positive unbalance. Without any loss of generality, we
suppose processor Pk−1 has an unbalance of δmax. Then, by definition of maximal slices,
δk,l ≥ 2δmax−1 and δk−1,l ≥ 3δmax −1. However, by definition of δmax, δk−1,l ≤ 2δmax.
So δmax = 1 and δk−1 = 1. Then we have δk,l = 1, and δk−1,l = 2. Therefore, Ck−1,l
is a maximal slice and processor Pk−1 sends a data item to processor Pk−2 rather than
to Pk.
(c) The maximal slice Ck,l receives a data item because one of its extremities is also the
extremity of a negative maximal slice. Without any loss of generality, we suppose this
negative maximal slice is of the form Ck,m (with l ∈ [k; m]). Ck,l being a positive
maximal slice, δk,l = 2δmax − ε1 with ε1 ∈ {0; 1}. Ck,m being a negative maximal slice,
δk,m = −2δmax + ε2 with ε2 ∈ {0; 1}. We have two cases to consider:
i. l < m. Then, δl+1,m = (−2δmax + ε2) − (2δmax − ε1) = −4δmax + ε1 + ε2. By
definition of δmax, δl+1,m ≥ −2δmax, and thus δmax = 1 and ε1 = ε2 = 1. Then
Ck,m = −1 and, because of the test of Step 7, the rules of Steps 8 and 9 do not
apply, and the maximal slice Ck,l does not receive a data item because it is enclosed
in a negative maximal slice.
ii. l > m. Then, δm+1,l = (2δmax−ε1)−(−2δmax+ε2) = 4δmax−ε1−ε2. By definition
of δmax, δm+1,l ≤ 2δmax, and thus δmax = 1 and ε1 = ε2 = 1. Then Ck,m = −1.
Thus, in both cases, Ck,m = −1. Then, because of the test of Step 7, the rules of
Steps 8 and 9 do not apply, and the maximal slice Ck,l does not receive a data item
because one of its extremities is also the extremity of a negative maximal slice.
Data redistribution algorithms for heterogeneous processor rings 15
2. If processor Pi was initially maximal, i.e., if |δi| = δmax, then after the communications we
have |δi| = δmax − 1.
As previously, we only focus on the case δi = δmax. If, after communications, we do not have
|δi| = δmax − 1, then Pi has received one data item.
(a) Pi receives a data item from a processor which is the extremity of a positive maximal
slice and which sends a data item due to Steps 4 through 18. Without loss of generality,
suppose this processor is Pi+1 and the slice Ci+1,j . By definition of maximal slices, there
exists a value ε, either equal to 0 or 1, such that δi+1,j = 2δmax−ε. Then δi,j = 3δmax−ε.
As, by definition of δmax, δi,j ≤ 2δmax, this leads to δmax = ε = 1. So δi+1,j = 1, which
contradicts our hypothesis on Pi+1.
(b) Pi receives a data item, because of Steps 4 through 18, as it is the extremity of a negative
maximal slice. Without loss of generality, suppose the slice is Ci,j . By definition of
maximal slices, there exists a value ε, either equal to 0 or 1, such that δi,j = −2δmax+ε.
Then δi+1,j = (−2δmax + ε) − δmax = −3δmax + ε. As, by definition of δmax, δi+1,j ≥
−2δmax, this leads to δmax = ε = 1. So δi,j = −1, which contradicts our hypothesis on
Pi.
(c) Pi receives a data item from another maximal processor, say Pi−1, which sends data
items due to Steps 19 and 24. But two maximal processors side by side define a maximal
slice. Hence a contradiction because in a maximal slice δi−1,i processor Pi−1 sends a
data item to processor Pi−2 and not to Pi.
3. After the communications took place, no processor Pi is such that |δi| = δmax.
As previously, let us consider the case δi = δmax after the communications took place.
Because of Case 2, such a case would only arise if the unbalance of Pi was equal to δmax − 1
before the communications (because of the one-port model guaranteed by Lemma 7) and if
Pi received a data item but sent none.
We have three cases to consider:
(a) Processor Pi receives a data item from a processor which is the extremity of a maximal
slice and which sends data items due to Steps 4 through 18. There is no configuration
that can arise where the maximal slice is negative. So the maximal slice is positive.
Without any loss of generality, we suppose it is a maximal slice of the form Ci+1,j .
Then, by definition of maximal slices, δi+1,j = 2δmax − ε1 and ε1 is either equal to 0 or
1. Thus, δi,j = 3δmax − ε1 − 1. However, Ci,j is not a maximal slice (as, by hypothesis
Pi is not sending any data items). Therefore, by definition of δmax, δi,j ≤ 2δmax − 2.
Hence, we obtain δmax ≤ ε1 − 1 which has no solution.
(b) Processor Pi receives a data item from a processor which is maximal and which sends
data items due to Steps 19 through 24. This case can only arise if this maximal processor
has a positive unbalance. Without any loss of generality, we suppose that processor Pi−1
has an unbalance of δmax. Then, δi−1,i = 2δmax − 1. Thus, δi−1,i is a maximal slice,
which contradicts the assumption on Pi−1.
(c) Processor Pi receives a data item as it is the extremity of a negative maximal slice,
say Ci,j . Then, by definition of maximal slices, there exists ε ∈ {0, 1} such that δi,j =
−2δmax + ε. By definition of δmax we have δi+1,j ≥ −2δmax. As, δi+1,j = −3δmax + ε,
we obtain δmax ≤ ε. Then Ci,j = −1 and, because of the test of Step 7, the rules of
Steps 8 and 9 do not apply, and Pi does not receive a data item as it is the extremity
of a negative maximal slice.
4. After the communications took place, no non trivial slice Ck,l is such that  |δk,l|2  = δmax.
Once again we only consider the case of positive slices. We can assume that the slice Ck,l
was not initially a maximal slice as this case as already been processed. So, there exists a
value ε1 ∈ {0, 1} such that δk,l = 2δmax − 2 − ε1 and we have three cases to consider:
16 H. Renard, Y. Robert, F. Vivien
(a) The slice Ck,l receives a data item from a processor which is the extremity of a maximal
slice which sends data items due to Steps 4 through 18. There is no configuration that
can arise where the maximal slice is negative. So the maximal slice is positive. Without
any loss of generality, we suppose it is of the form Cj,k−1. Then, by definition of maximal
slices δj,k−1 = 2δmax−ε2, with ε2 taking values in {0, 1}. Thus, δj,l = 4δmax−ε1−ε2−2.
However, by definition of δmax, δj,l ≤ 2δmax. Hence, we obtain 2δmax ≤ ε1 + ε2 + 2. We
have two sub-cases to consider:
i. δmax = 2. Then, ε1 = ε2 = 1. However, in this case δj,l = 4, Cj,l is a maximal
slice, and Pl sends a data item to Pl+1. Before the communications took place,
δk,l = 1. During the communications Ck,l receive at most two data items (as it has
two extremities) and send at least one, from Pl. So, after the communications took
place, δk,l is either equal to 0, 1, and 2, and the three cases are fine.
ii. δmax = 1. Then, we conclude using the results of Cases 2 and 3.
(b) The slice Ck,l receives a data item because it is enclosed in a negative maximal slice.
Without any loss of generality, we suppose this negative maximal slice is of the form
Ck,m. Ck,m being a negative maximal slice, δk,m = −2δmax + ε2 with ε2 ∈ {0; 1}.
i. l < m. Then, δl+1,m = (−2δmax + ε2)− (2δmax −2− ε1) = −4δmax + ε1 + ε2 +2. By
definition of δmax, δl+1,m ≥ −2δmax. The case δmax = 1 is settled using the result
of Case 3. Then δmax = 2, ε1 = ε2 = 1, δk,l = 1 and δl+1,m = −4. So, Cl+1,m is
a negative maximal chain and Pl sends a data item to Pl+1. So the unbalance of
Ck,l, which was originally equal to 1, increases at most by one between before and
after communications took place, and there is no problems.
ii. m < l. Then, δm+1,l = (2δmax − 2− ε1)− (−2δmax + ε2) = 4δmax − ε1 − ε2 − 2. By
definition of δmax, δm+1,l ≤ 2δmax. The case δmax = 1 is settled using the result of
Case 3. Then δmax = 2, ε1 = ε2 = 1, δk,l = 1 and δm+1,l = −4. So, Cm+1,l is a
negative maximal chain and Pm sends a data item to Pm+1. So the unbalance of
Ck,l, which was originally equal to 1, increases at most by one between before and
after communications took place, and there is no problems.
(c) The slice Ck,l only receives a data item from a processor which is maximal and which
sends data items because of Steps 19 through 24. This case can only arise if this
maximal processor has a positive unbalance. Then, due to Step 19, this is processor
Pk−1 which has an unbalance of δmax. For Ck,l to be such that  |δk,l|2  = δmax after the
communications took place, necessarily, δk,l ≥ 2δmax − 2 before the communications.
Then δk−1,l ≥ 3δmax−2. As we supposed that Pi−1 sends data items because of Steps 19
through 24, the slice Ck−1,l is not maximal and thus δk−1,l ≤ 2δmax−2. Hence δmax ≤ 0,
a contradiction.
The optimality of Algorithm 3 is a simple corollary of Lemma 8 and of the lower bound defined
by Equation 4.
Theorem 3. Algorithm 3 is optimal.
6 Heterogeneous bidirectional ring
In this section, we consider the most general case, that of a heterogeneous bidirectional ring. We
do not know any optimal redistribution algorithm in this case. However, if we assume that each
processor initially holds more data than it needs to send during the whole execution of algorithm
(what we call a light redistribution), then we succeed in deriving an optimal solution.
Data redistribution algorithms for heterogeneous processor rings 17
6.1 Light redistribution
Throughout this section, we suppose that we have a light redistribution: we assume that the
number of data items sent by any processor throughout the redistribution algorithm is less than
or equal to its original load. There are two reasons for a processor Pi to send data: (i) because
it is overloaded (δi > 0); (ii) because it has to forward some data to another processor located
further in the ring. If Pi initially holds at least as many data items as it will send during the whole
execution, then Pi can send at once all these data items. Otherwise, in the general case, some
processors may wait to have received data items from a neighbor before being able to forward
them to another neighbor.
6.1.1 Solution by integer linear programming
Under the “light redistribution” assumption, we can build an integer linear program to solve our
problem (see System 6). Let S be a solution, and denote by Si,i+1 the number of data items
that processor Pi sends to processor Pi+1. Similarly, Si,i−1 is the number of data items that Pi
sends to processor Pi−1. In order to ease the writing of the equations, we impose in the first two
equations of System 6 that Si,i+1 and Si,i−1 are nonnegative for all i, which imposes to use other
variables Si+1,i and Si−1,i for the symmetric communications. The third equation states that
after the redistribution, there is no more unbalance. We denote by τ the execution time of the
redistribution. For any processor Pi, due to the one-port constraints, τ must be greater than the
time spent by Pi to send data items (fourth equation) or spent by Pi to receive data items (fifth
equation). Our aim is to minimize τ , hence the system:
Minimize τ, subject to

Si,i+1 ≥ 0 1 ≤ i ≤ n
Si,i−1 ≥ 0 1 ≤ i ≤ n
Si,i+1 + Si,i−1 − Si+1,i − Si−1,i = δi 1 ≤ i ≤ n
Si,i+1ci,i+1 + Si,i−1ci,i−1 ≤ τ 1 ≤ i ≤ n
Si+1,ici+1,i + Si−1,ici−1,i ≤ τ 1 ≤ i ≤ n
(6)
Lemma 9. Any optimal solution of System 6 is feasible, for example using the following schedule:
for any i ∈ [1, n], Pi starts sending data items to Pi+1 at time 0 and, after the completion of this
communication, starts sending data items to Pi−1 as soon as possible under the one-port model.
Proof. We have to show that we are able to schedule the communications defined by any optimal
solution (S, τ) of System 6 so that the redistribution takes a time no greater than τ . For any
i ∈ [1, n], we schedule at time 0, all emissions from Pi to Pi+1. This communication is done in
time Si,i+1ci,i+1: because of the “light redistribution” hypothesis, Pi already holds all the data
items that it must send. Because of the fourth equation of System 6, this communication ends
before the time τ .
For any value of i ∈ [1, n], we still have to schedule the sending of data items from Pi to
Pi−1. We schedule this communication as soon as possible, therefore at time max{Si,i+1ci,i+1,
Si−2,i−1ci−2,i−1}, i.e., at the earliest time when (i) Pi has ended sending data items to Pi+1, and
(ii) Pi−1 has stopped receiving data items from Pi−2. Therefore, the communication from Pi to
Pi−1 ends at the date:
max{Si,i+1ci,i+1,Si−2,i−1ci−2,i−1} + Si,i−1ci,i−1 =
max{Si,i+1ci,i+1 + Si,i−1ci,i−1,Si−2,i−1ci−2,i−1 + Si,i−1ci,i−1}. (7)
Once again, this is true owing to the “light redistribution” hypothesis: no processor needs to wait
to have received some data items before being able to send them to one of its neighbors.
The first term of the “max”expression is the time needed by Pi to send data items to both Pi+1
and Pi−1. This term is less than or equal to τ because of the fourth equation of System 6. The
second term of the “max” expression is the time needed by Pi−1 to receive data items from both
Pi−2 and Pi. This term is less than or equal to τ because of the fifth equation of System 6.
18 H. Renard, Y. Robert, F. Vivien
So far, we did not mathematically define a condition for the “light redistribution”hypothesis to
hold. In fact, this is not mandatory: we use System 6 to find an optimal solution to the problem.
If, in this optimal solution, for any processor Pi, the total number of data items sent is less than or
equal to the initial load (Si,i+1 + Si,i−1 ≤ Li), we are under the “light redistribution” hypothesis
and we can use the solution of System 6 safely.
6.1.2 Solution through rational linear programming
Even if the“light redistribution”hypothesis holds, one may wish to solve the redistribution problem
with a technique less expensive than integer linear programming (which is potentially exponen-
tial). An idea would be to first solve System 6 to find an optimal rational solution, which can
always be done in polynomial time, and then to round up the obtained solution to find a “good”
integer solution. In fact, the following lemma shows that one of the two natural ways of rounding
always lead to an optimal (integer) solution. The complexity of the light redistribution problem
is therefore polynomial.
Proposition 1. Let R be an optimal rational solution to the redistribution problem. For any j in
[1, n], Rj denotes the number of data items that processor Pj sends to processor Pj+1 (using the
notations of System 6, Rj = Sj,j+1 −Sj+1,j). Let F be the integer solution defined by F1 = 	R1
.
Let G be the integer solution defined by G1 = R1. Then:
(i) F and G are well-defined by the single condition above,
(ii) either F or G is an optimal integer solution.
Proof. Lemma 10 below states that F and G are both fully defined. Lemma 11 below states that
there exists at least one optimal integer solution E such that |E1−R1| < 1. The only two solutions
satisfying these constraints are F and G. Hence the result.
Lemma 10. To fully define the number of data items sent between processors in any redistribution
scheme, we only need to define, for a single given value of j ∈ [1, n], the number of data items that
processor Pj sends to processor Pj+1.
Proof. Without loss of generality, we suppose we have fixed the value of R1, the number of data
items sent by P1 to P2. (Note that R1 may be negative, meaning that in fact P2 sends data items
to processor P1.) After redistribution, the unbalance of P2 must be zero. Thus, δ2 +R1 −R2 = 0.
Therefore, as R1 is known, the value of R2 is also known. Using a direct induction, we then have
that, for any value of j ∈ [2, n], Rj = δj + Rj−1, and Rj is also known. As
∑n
i=1 δi = 0, one can
check that we also have δ1 + Rn −R1 = 0.
Lemma 11. Let R be an optimal rational solution to the redistribution problem: for any j in
[1, n], Rj denotes the number of data items processor Pj sends to processor Pj+1. Then, there
exists an optimal integer solution E to the solution problem such that: |E1 −R1| < 1.
Proof. We prove Lemma 11 by contradiction. Therefore, we suppose that no optimal integer
solution E satisfies |E1−R1| < 1. So, let us take an optimal integer solution E such that |E1−R1| ≥
1. Let R1 = E1 + z + ε, where z ∈ Z and ε ∈] − 1; 1[ such that E1 + z ∈ [E1;R1]. Therefore
E1 ≤ E1 + z ≤ E1 + z + ε or E1 ≥ E1 + z ≥ E1 + z + ε. (8)
Thus, using the construction used in the proof of Lemma 10, we have:
∀i ∈ [1, n], Ei ≤ Ei + z ≤ Ei + z + ε or ∀i ∈ [1, n], Ei ≥ Ei + z ≥ Ei + z + ε. (9)
Then let F be a new integer solution to our problem defined by: Fi = Ei + z, ∀i ∈ [1, n]. Then,
|F1 −R1| = |(E1 + z) − (E1 + z + ε)| = |ε| < 1. If we prove that F is an optimal integer solution,
we will have reached the desired contradiction.
Consider any value i in [1, n]. We have two situations to deal with for processor Pi (under
redistribution Fi):
Data redistribution algorithms for heterogeneous processor rings 19
1. Fi−1 · Fi ≥ 0: under Fi, either processor Pi only communicates data items with one of its
neighbors, or it sends data items to one of them and receive data items from the other one.
Without any loss of generality, we suppose that Fi−1 ≥ 0 and Fi ≥ 0. Then, we must show
that
max{Fi−1ci−1,i,Fici,i+1} ≤ τint,
where τint is the duration of an optimal integer solution. However, Fi−1ci−1,i = (Ei−1 +
z)ci−1,i. If Ei−1 +z is null, Fi−1ci−1,i = 0 ≤ τint. Otherwise, Ei−1 +z is not null. As Ei−1 +z
is by definition an integer, and as |ε| < 1, Ei−1 + z and Ei−1 + z + ε have the same sign,
thus are (strictly) positive, and under both redistribution there are data items sent from
processor Pi−1 to processor Pi.
• If ε > 0, then
(Ei−1 + z)ci−1,i < (Ei−1 + z + ε)ci−1,i = Ri−1ci−1,i ≤ τrat ≤ τint,
as R is by definition an optimal rational solution, and as optimal rational solutions are
no worse than optimal integer solutions.
• If ε < 0, then
(Ei−1 + z + ε)ci−1,i < (Ei−1 + z)ci−1,i < Ei−1ci−1,i ≤ τint,
because of Equation 9, and as E is by definition an optimal integer solution.
2. Fi−1 · Fi < 0: either Pi receives data items from both of its neighbors, or Pi sends data
items to both of them. Without any loss of generality, we suppose that Pi sends data items
to both of them.
Then, we must show that
−Fi−1ci,i−1 + Fici,i+1 ≤ τint. (10)
However, −Fi−1ci,i−1 + Fici,i+1 = −(Ei−1 + z)ci,i−1 + (Ei + z)ci,i+1. As Ei−1 + z is by
definition an integer, and as |ε| < 1, Ei−1 + z and Ei−1 + z + ε have the same sign, thus are
(strictly) negative, and under both redistribution there are data items sent from processor
Pi to processor Pi−1. Similarly, under both redistribution there are data items sent from
processor Pi to processor Pi+1.
As R is by definition an optimal rational solution, and as optimal rational solutions are no
worse than optimal integer solutions, then:
−(Ei−1 + z + ε)ci,i−1 + (Ei + z + ε)ci,i+1 = −Ri−1ci,i−1 + Rici,i+1 ≤ τrat ≤ τint.
So, if ε(ci,i+1 − ci,i−1) ≥ 0, Equation 10 holds. Otherwise, ε(ci,i+1 − ci,i−1) < 0 and we have
two cases to consider, depending on the redistribution E :
• Ei−1 · Ei < 0: then Ei−1 < 0 and Ei > 0. Indeed, whatever the redistribution S we
always have δi + Si−1 − Si = 0. As we have supposed that Fi−1 < 0 and Fi > 0 then
δi > 0 which forbids to have Ei−1 > 0 and Ei < 0.
As E is an optimal integer solution, we then have:
−Ei−1ci,i−1 + Eici,i+1 ≤ τint.
Equation 9 implies that z and ε are of same sign. So, z(ci,i+1 − ci,i−1) < 0. Therefore,
−Fi−1ci,i−1+Fici,i+1 = −(Ei−1+z)ci,i−1+(Ei+z)ci,i+1 < −Ei−1ci,i−1+Eici,i+1 ≤ τint.
20 H. Renard, Y. Robert, F. Vivien
• Ei−1 · Ei ≥ 0. Without any loss of generality, let us suppose that ε > 0. Then,
ci,i+1 − ci,i−1 < 0. Because of Equation 9, as ε > 0 and as (Ei−1 + z) < 0, Ei−1 < 0,
and thus Ei ≤ 0.
−(Ei−1 + z)ci,i−1 + (Ei + z)ci,i+1 = −Ei−1ci,i−1 + Eici,i+1 + z(ci,i+1 − ci,i−1)
< −Ei−1ci,i−1 + Eici,i+1,
as ci,i+1 − ci,i−1 < 0. However, Ei ≤ 0, so
−(Ei−1 + z)ci,i−1 + (Ei + z)ci,i+1 < −Ei−1ci,i−1 ≤ τint
as E is an optimal integer solution.
6.2 General case
6.2.1 Lower bound
We have the following bound on the optimal redistribution time:
Lemma 12. Let τ be the optimal redistribution time. Then:
τ ≥ max


max
1≤k≤n, δk>0
δk min{ck,k−1, ck,k+1},
max
1≤k≤n, δk<0
−δk min{ck−1,k, ck+1,k},
max
1≤k≤n,
1≤l≤n−2,
δk,k+l>0
min
0≤i≤δk,k+l
max{i · ck,k−1, (δk,k+l − i) · ck+l,k+l+1}
max
1≤k≤n,
1≤l≤n−2,
δk,k+l<0
min
0≤i≤−δk,k+l
max{i · ck−1,k, (−δk,k+l − i) · ck+l+1,k+l}


(11)
Proof. Consider any processor Pi with positive unbalance (δi > 0). Even if processor Pi can send
data items to both of its neighbors, because of the one-port model, it cannot send data items
to both of them simultaneously. The best way for processor Pi to send δi data items is then to
send them using the fastest of its outgoing links. So, it requires processor Pi at least a time of
δi × min{ci,i−1, ci,i+1} to send δi data items, whatever the destinations of these data items. We
have a symmetric result for the case δi < 0. Hence the first two equations of the System 11.
Now, consider any non trivial slice of consecutive processors Ck,l. By “non trivial” we mean
that the slice is not reduced to a single processor (we already treated that case) and that it does
not contain all processors. We suppose that δk,l > 0. So, in any redistribution scheme, at least
δk,l data items must be sent by Ck,l. As this slice is not reduced to a single processor, the two
processors at the extremities of the slice, Pk and Pl, can simultaneously send data items to their
neighbors outside of the slice, Pk−1 and Pl+1 respectively. Therefore, during the redistribution,
processor Pk sends a certain amount i ∈ [0, δk,l] of data items to processor Pk−1, while processor
Pl sends the remaining data items to Pl+1, which takes a time max{i · ck,k−1, (δk,l − i) · cl,l+1}.
Then we chose for i a value which minimizes this time. We have a symmetric result for the case
δk,l < 0. Hence the last two equations of the System 11.
6.2.2 Heuristic approaches
We do not know whether the bound given by Lemma 12 can always be reached, but we have no
counter-example proving that the bound is not tight.
When the solution found by System 6 does not satisfy the “light redistribution” hypothesis,
there is the possibility to modify the system to enforce it: we obtain System 12 which finds a
Data redistribution algorithms for heterogeneous processor rings 21
solution which satisfies the “light redistribution” hypothesis, if one exists. But there is no reason
a priori for the solution of System 12 to be optimal.
Minimize τ, subject to

Si,i+1 ≥ 0 1 ≤ i ≤ n
Si,i−1 ≥ 0 1 ≤ i ≤ n
Si,i+1 + Si,i−1 − Si+1,i − Si−1,i = δi 1 ≤ i ≤ n
Si,i+1ci,i+1 + Si,i−1ci,i−1 ≤ τ 1 ≤ i ≤ n
Si+1,ici+1,i + Si−1,ici−1,i ≤ τ 1 ≤ i ≤ n
Si,i+1 + Si,i−1 ≤ Li 1 ≤ i ≤ n
(12)
To conclude this section, we point out that the design of an optimal algorithm in the most
general case remains open. Given the complexity of the lower bound, the problem looks very
difficult to solve.
7 Related work
Redistribution algorithms have been the focus of an abundant literature. On the theoretical
side, in the framework of High Performance Fortran [25] compilation, Kremer [26] showed the
NP-completeness of a simple redistribution problem. This negative results shows that optimal
algorithms can be designed only for particular cases, such as the ring architecture in this paper.
To the best of our knowledge, no other redistribution algorithms has been proven optimal, but
several efficient algorithms have been designed for rings [20, 28, 13], trees or hypercubes [41]. The
elastic load balancing algorithm designed in [30, 4] has led to a data redistribution software used
for query processing [8] and medical image analysis [35].
The block-cyclic distribution of data arrays plays a very important role in scientific libraries [5].
In a CYCLIC(r) distribution over p processors, blocks of r consecutive elements of the array are
distributed to the processors in a wraparound fashion, and the parameter r is chosen to optimize the
granularity, i.e. the computation-to-communication ratio. Because this granularity changes from
one computational kernel to the other, moving from a CYCLIC(r) distribution over p processors
to a CYCLIC(s) distribution over q processors is a very useful redistribution procedure, which has
been implemented using a caterpillar algorithm in ScaLAPACK [34]. Several papers, including [23,
39, 14, 33, 19, 11, 24], have dealt with various optimizations of this redistribution procedure. Along
this line of research, automatic data redistribution tools are presented in [19].
Even though we did not deal with load-balancing algorithms in this paper, we quote some key
references on the subject. For homogeneous platforms, see the collection of papers [38], and for
heterogeneous clusters see chapter 25 in [9]. Several authors [17, 32, 31, 40, 21] propose a mapping
policy which dynamically minimizes system degradation (including the cost of remapping) for
each computation step. Static strategies aiming at distributing independent chunks of work to
two-dimensional processor grids are studied in [1, 2]. Relaxing the geometrical constraints induced
by two-dimensional grids leads to irregular partitionings [12, 22, 3] that allow for a good load-
balancing but are much more difficult to implement. This approach has been extended to three-
dimensional problems [18].
Finally, we briefly mention three sample applications whose implementation can directly benefit
from the redistribution strategies designed in this paper. The analysis of pulses propagating in
a nonlinear medium calls for adaptive computational windows, and redistribution must occur
frequently as the computation progresses [6]. A two-level redistribution procedure is advocated
in [27] for structured adaptive mesh refinement. A multi-level diffusion re-partitioner is presented
in [36, 37] for irregular grid computations and has been incorporated into the ParMetis library.
Of course this short list could be extended dramatically.
22 H. Renard, Y. Robert, F. Vivien
8 Experimental results
To evaluate the impact of the redistributions, we used the SimGrid [29] simulator to model an
iterative application, implemented on a platform generated with the Tiers network generator [10,
15].
Jacquelin
Boivin
Ethernet
Louis
St_Bruno
Jean_Yves
TeX
Geoff
Wright
Rubin
Lachapelle
Disney
iRMX
McGee Jamie
Kansas
Drouin
Gatien
Laroche
Marcoux
Pointe_Claire
Robert
Europe
Tanguay
Morin
Bellemarre
St_Jean
Lessard
Fraser
Kuenning
Gaston
Harry
Bousquet
Paul
Jill
LaTeX
Fafard
Marcel
Jackson
Victoria
Julien
Doyon
Fernand
Soucy
Ste_Julie
Browne
Florient
Gavrel
Bescherelle
Pierre
Olivier
Boucherville
Jocelyne
Croteau
King
Lapointe
Audy
Papineau
Dodge
Julian
SPICE
Lafontaine
Gordon
Juneau
Stephen
Provost
Casavant
St_Antoine
Varennes
St_Paul
Mathematica
Sirois
Monique
Bourassa
OHara
Boston
SunOS
Jacques Thierry
kV Intel
Yolande
Pronovost
Roy
Amadeus
Cambridge
Tremblay
UNIX
Domey
Jean_Claude
36
39
34
32
6
8
3
23
24
26
2
7
51
27
15
14
16
60
18
75
70
17
50
52
53
59
57
58
25
21
0
42
5
145
45
47
46
20
100
31
40
44
140
155
22
Figure 11: The platform is composed of 90 machine nodes, connected through 192 communication
links.
We use the platform represented in Figure 11. The capacities of the edges are assigned using
the classification of the Tiers generator (local LAN link, LAN/MAN link, MAN/WAN link,. . . ).
For each link type, we use values measured using pathchar [16] between some machines in ENS
Lyon and some other machines scattered in France (Strasbourg, Lille, Grenoble, and Orsay), in
the USA (Knoxville, San Diego, and Argonne), and in Japan (Nagoya and Tokyo).
We randomly select p processors in the platform to build the execution ring. The communica-
tion speed is given by the slowest link in the route from a processor to its successor (or predecessor)
in the ring. The processing powers (CPU speeds) of the nodes are first randomly chosen in a list
of values corresponding to the processing powers (expressed in MFlops and evaluated thanks to
a benchmark taken from LINPACK [7]) of a wide variety of machines (Pentium Pro 200MHz,
Pentium 2 350MHz, Celeron 400MHz, Athlon 1.4GHz, Pentium 4 1.7GHz, . . . ). But we make
these speeds vary during the execution of the application.
We model an iterative application which executes during 100 iterations. At each iteration,
independent data are updated by the processors. We may think of a m × n data matrix whose
columns are distributed to the processors (we use n = m = 1000 in the experiment). Ideally, each
processor should be allocated a number of columns proportional to its CPU speed. This is how
the distribution of columns to processors is initialized.
To motivate the need for redistributions, we create an unbalance by letting the CPU speeds
vary during the execution. The speed of each processor changes two times, first at some iteration
randomly chosen between iterations number 20 and 40, and then at some iteration randomly
chosen between iterations number 60 and 80) for each node to change the processing power (see
Figure 12 for an illustration). We record the values of each CPU speed in a SimGrid trace.
In the simulations, we use the heterogeneous bidirectional algorithm for light redistributions,
and we test five different schemes, each with a given number of redistributions within the 100
iterations. The first scheme has no redistribution at all. The second scheme implements a redis-
tribution after iteration number 50. The third scheme uses four redistributions, after iterations
Data redistribution algorithms for heterogeneous processor rings 23
 0
 50
 100
 150
 200
 250
 0  20  40  60  80  100
P
ro
ce
ss
in
g 
po
w
er
s
Number of iterations
Processor Amadeus
Processor Cambridge
Figure 12: Processing power of 2 sample machine nodes.
20, 40, 60 and 80. The fourth scheme uses 9 redistributions, implemented every 10 iterations, and
the last one uses 19 redistributions, implemented every 5 iterations. Given the shape of the CPU
traces, some redistributions are likely to be beneficial during the execution.
The last parameter to set is the computation-to-communication ratio, which amounts to set
the relative (average) cost of a redistribution versus the cost of an iteration. When this parameter
increases, iterations take more time, and the usefulness of a redistribution becomes more important.
 30
 35
 40
 45
 50
 55
 60
 65
 70
 1  2  3  4  5  6  7  8  9  10
E
xe
cu
tio
n 
tim
e 
in
 s
ec
.
Computation-to-communication ratio
no redistribution
1 redistribution
4 redistributions
9 redistributions
19 redistributions
Figure 13: Normalized execution time as a function of the computation-to-communication ratio,
for a ring of 8 processors.
In Figures 13 and 14, we plot the execution time of different computation schemes. Both figures
report the same comparisons, but for different ring sizes: we use 8 processors in Figures 13, and
32 in Figures 14.
As expected, when the processing power is high (ratio = 10 in the figures), the best strategy
is to use no redistribution, as their cost is prohibitive. Conversely, when the processing power is
low (ratio = 1 in the figures), it pays off to uses many redistributions, but not too many! As the
ratio increases, all tradeoffs can be found.
24 H. Renard, Y. Robert, F. Vivien
 35
 40
 45
 50
 55
 60
 65
 1  2  3  4  5  6  7  8  9  10
E
xe
cu
tio
n 
tim
e 
in
 s
ec
.
Computation-to-communication ratio
no redistribution
1 redistribution
4 redistributions
9 redistributions
19 redistributions
Figure 14: Normalized execution time as a function of the ratio computation-to-communication,
for a ring of 32 processors.
9 Conclusion
In this paper, we have considered the problem of redistributing data on rings of processors. For
homogeneous rings the problem has been completely solved. Indeed, we have designed optimal
algorithms, and provided formal proofs of correctness, both for unidirectional and bidirectional
rings. The bidirectional algorithm turned out to be quite complex, and requires a lengthy proof.
For heterogeneous rings there remains further research to be conducted. The unidirectional
case was easily solved, but the bidirectional case remains open. Still, we have derived an optimal
solution for light redistributions, an important case in practice. The complexity of the bound
provided for the general case shows that designing an optimal algorithm is likely to be a difficult
task.
All our algorithms have been implemented and extensively tested. We have reported some
simulation results for the most difficult combination, that of heterogeneous bi-directional rings.
As expected, the cost of data redistributions may not pay off a little unbalance of the work in
some cases. Further work will aim at investigating how frequently redistributions must occur in
real-life applications.
References
[1] J. Barbosa, J. Tavares, and A. J. Padilha. Linear algebra algorithms in a heterogeneous
cluster of personal computers. In 9th Heterogeneous Computing Workshop (HCW’2000),
pages 147–159. IEEE Computer Society Press, 2000.
[2] O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert. A proposal for a heteroge-
neous cluster ScaLAPACK (dense linear solvers). IEEE Trans. Computers, 50(10):1052–1070,
2001.
[3] O. Beaumont, V. Boudet, F. Rastello, and Y. Robert. Matrix multiplication on heterogeneous
platforms. IEEE Trans. Parallel Distributed Systems, 12(10):1033–1051, 2001.
[4] A. Bevilacqua. A dynamic load balancing method on a heterogeneous cluster of workstations.
Informatica, 23(1):49–56, 1999.
[5] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra,
S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK
Users’ Guide. SIAM, 1997.
Data redistribution algorithms for heterogeneous processor rings 25
[6] A. Bourgeade and B. Nkonga. Dynamic load balancing computation of pulses propagating in
a nonlinear medium. The Journal of Supercomputing, 28(3):279–294, 2004.
[7] R. P. Brent. The LINPACK Benchmark on the AP1000: Preliminary Report. In CAP Work-
shop 91. Australian National University, 1991. Website http://www.netlib.org/linpack/.
[8] L. Brunie, A. Flory, and H. Kosch. New static scheduling and elastic load balancing methods
for parallel query processing. In Basque International Workshop on Information Technology
BIWIT. IEEE Computer Society Press, 1995.
[9] R. Buyya. High Performance Cluster Computing. Volume 1: Architecture and Systems. Pren-
tice Hall PTR, Upper Saddle River, NJ, 1999.
[10] K. L. Calvert, M. B. Doar, and E. W. Zegura. Modeling internet topol-
ogy. IEEE Communications Magazine, 35(6):160–163, June 1997. Available at
http://citeseer.nj.nec.com/calvert97modeling.html.
[11] C.H.Hsu, Y. Chung, D. Yang, and C. Dow. A generalized processor mapping technique for
array redistribution. IEEE Trans. Parallel Distributed Systems, 12(7):743–757, 2001.
[12] P. E. Crandall and M. J. Quinn. Block data decomposition for data-parallel programming on
a heterogeneous workstation network. In 2nd International Symposium on High Performance
Distributed Computing, pages 42–49. IEEE Computer Society Press, 1993.
[13] E. Deelman and B. Szymanski. Dynamic load balancing in parallel discrete event simulation
for spatially explicit problems. In PADS’98, 12th Workshop on Parallel and Distributed
Simulation, pages 46–53. IEEE Computer Society Press, 1998.
[14] F. Desprez, J. Dongarra, A. Petitet, C. Randriamaro, and Y. Robert. Scheduling block-cyclic
array redistribution. IEEE Trans. Parallel Distributed Systems, 9(2):192–205, 1998.
[15] M. Doar. A better model for generating test networks. In Proceedings of Globecom ’96, Nov.
1996. Available at http://citeseer.nj.nec.com/doar96better.html.
[16] A. B. Downey. Using pathchar to estimate internet link characteristics. In Mea-
surement and Modeling of Computer Systems, pages 222–223, 1999. Available at
http://citeseer.nj.nec.com/downey99using.html.
[17] J. E. Flaherty, R. M. Loy, C. Özturan, M. S. Shephard, B. K. Szymanski, J. D. Teresco,
and L. H. Ziantz. Parallel structures and dynamic load balancing for adaptive finite element
computation. Applied Numerical Mathematics, 26(1-2):241–263, 1997.
[18] J. E. Flaherty, R. M. Loy, M. S. Shephard, B. K. Szymanski, J. D. Teresco, and L. H.
Ziantz. Adaptive local refinement with octree load balancing for the parallel solution of
three-dimensional conservation laws. J. Parallel and Distributed Computing, 47(2):139–152,
1997.
[19] J. Garcia, E. Ayguadé, and J. Labarta. A framework for integrating data alignment, dis-
tribution, and redistribution in distributed memory multiprocessors. IEEE Trans. Parallel
Distributed Systems, 12(4):416–431, 2001.
[20] M. Hamdi and C. Lee. Dynamic load balancing of data parallel applications on a distributed
network. In 9th International Conference on Supercomputing ICS’95, pages 170–179. ACM
Press, 1995.
[21] Y. Hu and R. Blake. Load balancing for unstructured mesh applications. Parallel and Dis-
tributed Computing Practices, 2(3), 1999.
[22] M. Kaddoura, S. Ranka, and A. Wang. Array decomposition for nonuniform computational
environments. Journal of Parallel and Distributed Computing, 36:91–105, 1996.
26 H. Renard, Y. Robert, F. Vivien
[23] E. T. Kalns and L. M. Ni. Processor mapping techniques towards efficient data redistribution.
IEEE Trans. Parallel Distributed Systems, 6(12):1234–1247, 1995.
[24] J. Knoop and E. Mehofer. Distribution assignment placement: effective optimization of
redistribution costs. IEEE Trans. Parallel Distributed Systems, 13(6):628–647, 2002.
[25] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. L. S. Jr., and M. E. Zosel. The High
Performance Fortran Handbook. The MIT Press, 1994.
[26] U. Kremer. NP-Completeness of dynamic remapping. In Proceedings of the Fourth Workshop
on Compilers for Parallel Computers, Delft, The Netherlands, 1993. also available as Rice
Technical Report CRPC-TR93330-S.
[27] Z. Lan, V. Taylor, and G. Bryan. Dynamic load balancing of samr applications on distributed
systems. In Proceedings of the ACM/IEEE Symposium on Supercomputing (SC’01). IEEE
Computer Society Press, 2001.
[28] C. Lee and M. Hamdi. Parallel image processing applications on a network of workstations.
Parallel Computing, 21:137–160, 1995.
[29] A. Legrand, L. Marchal, and H. Casanova. Scheduling Distributed Applications: The Sim-
Grid Simulation Framework. In Proceedings of the Third IEEE International Symposium on
Cluster Computing and the Grid (CCGrid’03), May 2003.
[30] S. Miguet and Y. Robert. Elastic load balancing for image processing algorithms. In H. Zima,
editor, Parallel Computation, LNCS 591, pages 438–451. Springer Verlag, 1992.
[31] D. Nicol and J. P.F. Reynolds. Optimal dynamic remapping of data parallel computations.
IEEE Trans. Computers, 39(2):206–219, 1990.
[32] D. Nicol and J. Saltz. Dynamic remapping of parallel computations with varying resource
demands. IEEE Trans. Computers, 37(9):1073–1087, 1988.
[33] N. Park, V. Prasanna, and C. Raghavendra. A framework for integrating data alignment,
distribution, and redistribution in distributed memory multiprocessors. IEEE Trans. Parallel
Distributed Systems, 10(12):1217–1240, 1999.
[34] L. Prylli and B. Tourancheau. Fast runtime block-cyclic data redistribution on multiproces-
sors. J. Parallel Distributed Computing, 45:63–72, 1997.
[35] D. Sarrut and S. Miguet. ARAMIS: a remote access medical imaging system. In ISCOPE’99,
3rd International Symposium on Computing in Object-Oriented Parallel Environments, vol-
ume 1732 of Lecture Notes in Computer Science. Springer, 1999.
[36] K. Schloegel, G. Karypis, and V. Kumar. Multilevel diffusion schemes for repartitioning of
adaptive meshes. volume 47, pages 109–124, 1997.
[37] K. Schloegel, G. Karypis, and V. Kumar. A unified algorithm for load-balancing adaptive sci-
entific simulations. In Proceedings of the ACM/IEEE Symposium on Supercomputing (SC’00).
IEEE Computer Society Press, 2000.
[38] B. A. Shirazi, A. R. Hurson, and K. M. Kavi. Scheduling and load balancing in parallel and
distributed systems. IEEE Computer Science Press, 1995.
[39] R. Thakur, A. Choudhary, and J. Ramanujam. Efficient algorithms for array redistribution.
IEEE Trans. Parallel and Distributed Systems, 7(6):587–594, 1996.
[40] J. Watts and S. Taylor. A practical approach to dynamic load balancing. IEEE Trans. Parallel
and Distributed Systems, 9(93):235–248, 1998.
[41] M.-Y. Wu. On runtime parallel scheduling for processor load balancing. IEEE Trans. Parallel
and Distributed Systems, 8(2):173–186, 1997.
